python - Is POS tagging deterministic? -
i have been trying wrap head around why happening hoping can shed light on this. trying tag following text:
ae0.475 x mod ae0.842 x mod ae0.842 x mod ae0.775 x mod
using following code:
import nltk file = open("test", "r") line in file: words = line.strip().split(' ') words = [word.strip() word in words if word != ''] tags = nltk.pos_tag(words) pos = [tags[x][1] x in range(len(tags))] key = ' '.join(pos) print words, " : ", key
and getting following result:
['ae0.475', 'x', 'mod'] : nn nnp nn ['ae0.842', 'x', 'mod'] : -none- nnp nn ['ae0.842', 'x', 'mod'] : -none- nnp nn ['ae0.775', 'x', 'mod'] : nn nnp nn
and don't it. know reason inconsistency? not particular accuracy pos tagging because attempting extract templates seems using different tags @ different instances word looks "almost" same.
as solution, replaced numbers 1 , solved problem:
['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn ['ae1.111', 'x', 'mod'] : nn nnp nn
but curious why tagged instance different tags in first case. suggestions?
my best effort understand uncovered this not using whole brown corpus:
note words tagger has not seen before, such decried, receive tag of none.
so, guess looks ae1.111
must appear in corpus file, nothing ae0.842
. that's kind of weird, that's reasoning giving -none-
tag.
edit: got super-curious, downloaded brown corpus myself, , plain-text-searched inside it. number 111
appears in 34 times, , number 842
appears 4 times. 842
appears either in middle of dollar amounts or last 3 digits of year, , 111
appears many times on own page number. 775
appears once page number.
so, i'm going make conjecture, because of benford's law, end matching numbers start 1s, 2s, , 3s more numbers start 8s or 9s, since these more page numbers of random page cited in book. i'd interested in finding out if that's true (but not interested enough myself, of course!).
Comments
Post a Comment