Parsing text into dates?

T

Thomas W

I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
"month/day/year" compared to the clever and intuitive european-style
"day/month/year" etc.

I've searched google, but haven't found any good referances that helped
me solve this problem, especially with regards to the locale date
format issues.

Best regards,
Thomas
 
J

John Machin

I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
"month/day/year" compared to the clever and intuitive european-style
"day/month/year" etc.

<rant>
Well I'm from a locale that uses the dd/mm/yyyy style and I think it's
only marginally less stupid than the mm/dd/yyyy style.
</rant>

How much intuition is required to determine in an international
context what was meant by 01/12/2004? First of December or 12th of
January? The consequences of misinterpretation can be enormous.

If this application is being deployed from a central server where the
users can be worldwide, you have two options:

(a) try to work out somehow what the user's locale is, and then work
with dates in the legacy format "appropriate" to the locale.

(b) Use the considerably-less-stupid ISO 8601 standard format
yyyy-mm-dd (e.g. 2004-12-01) -- throughout your web-application, not
just in your data entry.

Having said all of that, [bottom-up question] how are you handling
locale differences in language, script, currency symbol, decimal
"point", thousands separator, postal address formats, surname /
given-name order, etc etc etc? [top-down question] What *is* your
target audience?
 
P

Peter Hansen

John said:
If this application is being deployed from a central server where the
users can be worldwide, you have two options:

(a) try to work out somehow what the user's locale is, and then work
with dates in the legacy format "appropriate" to the locale.

And this inevitably screws a large number of Canadians (and probably
others), those poor conflicted folk caught between their European roots
and their American neighbours, some of whom use mm/dd/yy and others of
whom use dd/mm/yy on a regular basis. And some of us who switch
willy-nilly, much as we do between metric and imperial. :-(
(b) Use the considerably-less-stupid ISO 8601 standard format
yyyy-mm-dd (e.g. 2004-12-01) -- throughout your web-application, not
just in your data entry.

+1 (emphatically!) (I almost always use this form even on government
submissions, and nobody has complained yet. Of course, they haven't
started changing the forms yet, either...)

-Peter
 
G

George Sakkis

Thomas W said:
I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
"month/day/year" compared to the clever and intuitive european-style
"day/month/year" etc.

I've searched google, but haven't found any good referances that helped
me solve this problem, especially with regards to the locale date
format issues.

Best regards,
Thomas

Although it is not a solution to the general localization problem, you
may try the mx.DateTimeFrom() factory function
(http://www.egenix.com/files/python/mxDateTime.html#DateTime) for the
parsing part. I had also written some time ago a more robust and
customized version of such parser. The ambiguous us/european style
dates are disambiguated by the provided optional argument USA (False by
default <wink>). Below is the doctest and the documentation (with
epydoc tags); mail me offlist if you'd like to check it out.

George

#=======================================================

def parseDateTime(string, USA=False, implyCurrentDate=False,
yearHeuristic=_20thcenturyHeuristic):
'''Tries to parse a string as a valid date and/or time.

It recognizes most common (and less common) date and time formats.

Examples: '2001-05-12 15:45:00'

@param USA: Disambiguates strings that are valid dates in both
(month,
day, year) and (day, month, year) order (e.g. 05/03/2002). If
True,
the first format is assumed.
@param implyCurrentDate: If True and the date is not given, the
current
date is implied.
@param yearHeuristic: If not None, a callable f(year) that
transforms the
value of the given year. The default heuristic transforms
2-digit
years to 4-digit years assuming they are in the 20th century::
lambda year: (year >= 100 and year
or year >= 10 and 1900 + year
or None)
The heuristic should return None if the year is not considered
valid.
If yearHeuristic is None, no year transformation takes place.
@return:
- C{datetime.date} if only the date is recognized.
- C{datetime.time} if only the time is recognized and
implyCurrentDate
is False.
- C{datetime.datetime} if both date and time are recognized.
@raise ValueError: If the string cannot be parsed successfully.
'''
 
J

John Machin

#=======================================================

def parseDateTime(string, USA=False, implyCurrentDate=False,
yearHeuristic=_20thcenturyHeuristic):
'''Tries to parse a string as a valid date and/or time.

It recognizes most common (and less common) date and time formats.
Impressive!

Examples: [snip]
str(parseDateTime('15.6.2001')) '2001-06-15'
str(parseDateTime('6.15.2001'))
'2001-06-15'

A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily
typoed into 6.13.2001 or 6.15.2001 on the numeric keypad.
 
G

George Sakkis

John Machin said:
#=======================================================

def parseDateTime(string, USA=False, implyCurrentDate=False,
yearHeuristic=_20thcenturyHeuristic):
'''Tries to parse a string as a valid date and/or time.

It recognizes most common (and less common) date and time
formats.
Impressive!

Examples: [snip]
str(parseDateTime('15.6.2001'))
'2001-06-15'
str(parseDateTime('6.15.2001'))
'2001-06-15'

A dangerous heuristic -- 6.12.2001 (meaning 2001-12-06) can be easily
typoed into 6.13.2001 or 6.15.2001 on the numeric keypad.

Sure, but how is this different from a typo of 2001-12-07 instead of
2001-12-06 ? There's no way you can catch all typos anyway by parsing
alone. Besides, 6.15.2001 is to be interpreted as 2001-06-15 in US
format. Currently the 'USA' flag is used only for ambiguous dates, but
that's easy to change to apply to all dates. Essentially you would gain
a little extra safety at the expense of a little lost recall over the
set of parseable dates.

George
 
J

John Roth

Thomas W said:
I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Besides the actual parsing, my main concern is the different locale
date formats and how to be able to parse those strange us-like
"month/day/year" compared to the clever and intuitive european-style
"day/month/year" etc.

I've searched google, but haven't found any good referances that helped
me solve this problem, especially with regards to the locale date
format issues.

There is no easy answer if you want to be able to enter three
numbers. There are two answers that work, although there will
be a lot of complaining. One is to use the international yyyy-mm-dd
form, and the other is to accept a 4 digit year, an alphabetic month
and a two digit day in any order.

Otherwise, if you get 4 digits as the first component, and it passes your
validation (whatever that is) for reasonable years, you're probably
pretty safe to assume that you've got yyyy-mm-dd. Otherwise
if you can't get a clean answser (one is > 31, one is 12 < x < 32
and one is <= 12, just give them a list of possibilities and politely
suggest that they enter it as yyyy-mm-dd next time.

I don't validate separators. As long as there is something that isn't a
number or a letter, it's a separator and which one doesn't matter. At
times I've even taken the transition between a digit and a letter as
a separator.

John Roth
 
G

gene.tani

The beautiful brand new cookbook2 has "Fuzzy parsing of Dates" using
dateutil.parser, which you run once you have a decent guess at locale
(page 127 of cookbook)
 
M

Mike Meyer

Thomas W said:
I'm developing a web-application where the user sometimes has to enter
dates in plain text, allthough a format may be provided to give clues.
On the server side this piece of text has to be parsed into a datetime
python-object. Does anybody have any pointers on this?

Why are you making it possible for the users to screw this up? Don't
give them a text widget to fill in and you have to figure out what the
format is, give them three widgets so you *know* what's what.

In doing that, you can also go to dropdown widgets for month, with
month names (in a locale appropriate for the page language), and for
the days in the month. For the latter, you can use JScript to get the
number of days in the list right, but make sure you fill it in with a
full 31 in case the user has JScript disabled. Finally, if you are
dealing with a fixed range of years, you can use a dropdown list for
that as well, eliminating having to deal with any text from the user
at all.

If the spec calls for plain text entry and you've already tried to get
that changed to something sane, my apologies.

<mike
 
J

John Machin

Why are you making it possible for the users to screw this up? Don't
give them a text widget to fill in and you have to figure out what the
format is, give them three widgets so you *know* what's what.

In doing that, you can also go to dropdown widgets for month, with
month names (in a locale appropriate for the page language), and for
the days in the month.

My experience: drop-down lists generate off-by-one errors. They also
annoy the bejaysus out of users -- e.g. year of birth, a 60+ element
list. It's quite possible of course that YMMV :)

BTW: I have seen a web page with a drop-down list for year of birth
where the first 18 entries were <current year>, <current year - 1>,
etc for a transaction that wasn't for minors.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,217
Latest member
IRMNikole

Latest Threads

Top