python - Is POS tagging deterministic? -


i have been trying wrap head around why happening hoping can shed light on this. trying tag following text:

ae0.475      x  mod  ae0.842      x  mod ae0.842      x  mod  ae0.775      x  mod  

using following code:

import nltk  file = open("test", "r")  line in file:         words = line.strip().split(' ')         words = [word.strip() word in words if word != '']         tags = nltk.pos_tag(words)         pos = [tags[x][1] x in range(len(tags))]         key = ' '.join(pos)         print words, " : ", key 

and getting following result:

['ae0.475', 'x', 'mod']  :  nn nnp nn ['ae0.842', 'x', 'mod']  :  -none- nnp nn ['ae0.842', 'x', 'mod']  :  -none- nnp nn ['ae0.775', 'x', 'mod']  :  nn nnp nn 

and don't it. know reason inconsistency? not particular accuracy pos tagging because attempting extract templates seems using different tags @ different instances word looks "almost" same.

as solution, replaced numbers 1 , solved problem:

['ae1.111', 'x', 'mod']  :  nn nnp nn ['ae1.111', 'x', 'mod']  :  nn nnp nn ['ae1.111', 'x', 'mod']  :  nn nnp nn ['ae1.111', 'x', 'mod']  :  nn nnp nn 

but curious why tagged instance different tags in first case. suggestions?

my best effort understand uncovered this not using whole brown corpus:

note words tagger has not seen before, such decried, receive tag of none.

so, guess looks ae1.111 must appear in corpus file, nothing ae0.842. that's kind of weird, that's reasoning giving -none- tag.

edit: got super-curious, downloaded brown corpus myself, , plain-text-searched inside it. number 111 appears in 34 times, , number 842 appears 4 times. 842 appears either in middle of dollar amounts or last 3 digits of year, , 111 appears many times on own page number. 775 appears once page number.

so, i'm going make conjecture, because of benford's law, end matching numbers start 1s, 2s, , 3s more numbers start 8s or 9s, since these more page numbers of random page cited in book. i'd interested in finding out if that's true (but not interested enough myself, of course!).


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -