python - Is POS tagging deterministic? -
i have been trying wrap head around why happening hoping can shed light on this. trying tag following text:
ae0.475 x mod ae0.842 x mod ae0.842 x mod ae0.775 x mod using following code:
import nltk file = open("test", "r") line in file: words = line.strip().split(' ') words = [word.strip() word in words if word != ''] tags = nltk.pos_tag(words) pos = [tags[x][1] x in range(len(tags))] key = ' '.join(pos) print words, " : ", key and getting following result:
['ae0.475', 'x', 'mod'] : nn nnp nn ['ae0.842', 'x', 'mod'] : -none- nnp nn ['ae0.842', 'x', 'mod'] : -none- nnp nn ['ae0.775', 'x', 'mod'] : nn nnp nn and don't it. know reason inconsistency? not particular accuracy pos tagging because attempting extract templates seems using different tags @ different instances word looks "almost" same.
as solution, replaced numbers 1 , solved problem:
['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn but curious why tagged instance different tags in first case. suggestions?
my best effort understand uncovered this not using whole brown corpus:
note words tagger has not seen before, such decried, receive tag of none.
so, guess looks ae1.111 must appear in corpus file, nothing ae0.842. that's kind of weird, that's reasoning giving -none- tag.
edit: got super-curious, downloaded brown corpus myself, , plain-text-searched inside it. number 111 appears in 34 times, , number 842 appears 4 times. 842 appears either in middle of dollar amounts or last 3 digits of year, , 111 appears many times on own page number. 775 appears once page number.
so, i'm going make conjecture, because of benford's law, end matching numbers start 1s, 2s, , 3s more numbers start 8s or 9s, since these more page numbers of random page cited in book. i'd interested in finding out if that's true (but not interested enough myself, of course!).
Comments
Post a Comment