Python - finding date in a string -
i want able read string , return first date appears in it. there ready module can use? tried write regexs possible date format, quite long. there better way it?
you can run date parser on subtexts of text , pick first date. of course, such solution either catch things not dates or not catch things are, or both.
let me provide example uses dateutil.parser catch looks date:
import dateutil.parser itertools import chain import re # add more strings confuse parser in list uninteresting = set(chain(dateutil.parser.parserinfo.jump, dateutil.parser.parserinfo.pertain, ['a'])) def _get_date(tokens): end in xrange(len(tokens), 0, -1): region = tokens[:end] if all(token.isspace() or token in uninteresting token in region): continue text = ''.join(region) try: date = dateutil.parser.parse(text) return end, date except valueerror: pass def find_dates(text, max_tokens=50, allow_overlapping=false): tokens = filter(none, re.split(r'(\s+|\w+)', text)) skip_dates_ending_before = 0 start in xrange(len(tokens)): region = tokens[start:start + max_tokens] result = _get_date(region) if result not none: end, date = result if allow_overlapping or end > skip_dates_ending_before: skip_dates_ending_before = end yield date test = """adelaide born in finchley, north london on 12 may 1999. child during daleks' abduction , invasion of earth in 2009. on 1st july 2058, bowie base 1 became first human colony on mars. commanded captain adelaide brooke, , seemed prove possible humans live long term on mars.""" print "with no overlapping:" date in find_dates(test, allow_overlapping=false): print date print "with overlapping:" date in find_dates(test, allow_overlapping=true): print date the result code is, quite unsurprisingly, rubbish whether allow overlapping or not. if overlapping allowed, lot of dates seen, , if if not allowed, miss important date in text.
with no overlapping: 1999-05-12 00:00:00 2009-07-01 20:58:00 overlapping: 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-03 00:00:00 1999-05-03 00:00:00 1999-07-03 00:00:00 1999-07-03 00:00:00 2009-07-01 20:58:00 2009-07-01 20:58:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 essentially, if overlapping allowed:
- "12 may 1999" parsed 1999-05-12 00:00:00
- "may 1999" parsed 1999-05-03 00:00:00 (because today 3rd day of month)
if, however, overlapping not allowed, "2009. on 1st july 2058" parsed 2009-07-01 20:58:00 , no attempt made parse date after period.
Comments
Post a Comment