Python - finding date in a string -


i want able read string , return first date appears in it. there ready module can use? tried write regexs possible date format, quite long. there better way it?

you can run date parser on subtexts of text , pick first date. of course, such solution either catch things not dates or not catch things are, or both.

let me provide example uses dateutil.parser catch looks date:

import dateutil.parser itertools import chain import re  # add more strings confuse parser in list uninteresting = set(chain(dateutil.parser.parserinfo.jump,                            dateutil.parser.parserinfo.pertain,                           ['a']))  def _get_date(tokens):     end in xrange(len(tokens), 0, -1):         region = tokens[:end]         if all(token.isspace() or token in uninteresting                token in region):             continue         text = ''.join(region)         try:             date = dateutil.parser.parse(text)             return end, date         except valueerror:             pass  def find_dates(text, max_tokens=50, allow_overlapping=false):     tokens = filter(none, re.split(r'(\s+|\w+)', text))     skip_dates_ending_before = 0     start in xrange(len(tokens)):         region = tokens[start:start + max_tokens]         result = _get_date(region)         if result not none:             end, date = result             if allow_overlapping or end > skip_dates_ending_before:                 skip_dates_ending_before = end                 yield date   test = """adelaide born in finchley, north london on 12 may 1999.  child during daleks' abduction , invasion of earth in 2009.  on 1st july 2058, bowie base 1 became first human colony on mars.  commanded captain adelaide brooke, , seemed prove  possible humans live long term on mars."""  print "with no overlapping:" date in find_dates(test, allow_overlapping=false):     print date   print "with overlapping:" date in find_dates(test, allow_overlapping=true):     print date 

the result code is, quite unsurprisingly, rubbish whether allow overlapping or not. if overlapping allowed, lot of dates seen, , if if not allowed, miss important date in text.

with no overlapping: 1999-05-12 00:00:00 2009-07-01 20:58:00 overlapping: 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-12 00:00:00 1999-05-03 00:00:00 1999-05-03 00:00:00 1999-07-03 00:00:00 1999-07-03 00:00:00 2009-07-01 20:58:00 2009-07-01 20:58:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-01 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 2058-07-03 00:00:00 

essentially, if overlapping allowed:

  1. "12 may 1999" parsed 1999-05-12 00:00:00
  2. "may 1999" parsed 1999-05-03 00:00:00 (because today 3rd day of month)

if, however, overlapping not allowed, "2009. on 1st july 2058" parsed 2009-07-01 20:58:00 , no attempt made parse date after period.


Comments

Popular posts from this blog

c++ - Is it possible to compile a VST on linux? -

java - Output of Eclipse is rubbish -

jquery - Confused with JSON data and normal data in Django ajax request -