Converting a string to the most probable type

P

Pierre Quentel

Hi,

I would like to know if there is a module that converts a string to a
value of the "most probable type" ; for instance :
- if the string is "abcd" the value is the same string "abcd"
- string "123" : value = the integer 123
- string "-1.23" (or "-1,23" if the locale for decimals is ,) : value
= the float -1.23
- string "2008/03/06" (the format is also locale-dependant) : value =
datetime.date(2008,03,06)

Like in spreadsheets, special prefixes could be used to force the
type : for instance '123 would be converted to the *string* "123"
instead of the *integer* 123

I could code it myself, but this wheel is probably already invented

Regards,
Pierre
 
G

George Sakkis

Hi,

I would like to know if there is a module that converts a string to a
value of the "most probable type" ; for instance :
- if the string is "abcd" the value is the same string "abcd"
- string "123" : value = the integer 123
- string "-1.23" (or "-1,23" if the locale for decimals is ,) : value
= the float -1.23
- string "2008/03/06" (the format is also locale-dependant) : value =
datetime.date(2008,03,06)

Like in spreadsheets, special prefixes could be used to force the
type : for instance '123 would be converted to the *string* "123"
instead of the *integer* 123

I could code it myself, but this wheel is probably already invented

Maybe, but that's a so domain-specific and easy to code wheel that
it's no big deal reinventing.

George
 
R

rockingred

Maybe, but that's a so domain-specific and easy to code wheel that
it's no big deal reinventing.

George- Hide quoted text -

- Show quoted text -

Actually you could probably write your own code very easily, a couple
of try/except clauses. I would recommend you try int() first, then
try float(), then try date check and when all else fails leave it a
string. However, it may be an interesting challenge for those who are
willing to make the attempt. "Define Cell/Field contents".
 
L

Luis M. González

Hi,

I would like to know if there is a module that converts a string to a
value of the "most probable type" ; for instance :
- if the string is "abcd" the value is the same string "abcd"
- string "123" : value = the integer 123
- string "-1.23" (or "-1,23" if the locale for decimals is ,) : value
= the float -1.23
- string "2008/03/06" (the format is also locale-dependant) : value =
datetime.date(2008,03,06)

Like in spreadsheets, special prefixes could be used to force the
type : for instance '123 would be converted to the *string* "123"
instead of the *integer* 123

I could code it myself, but this wheel is probably already invented

Regards,
Pierre
if '.' in x:
try: return float(x)
except ValueError: return x
else:
try: return int(x)
except: return x

'hello'
 
H

hvendelbo.dev

if '.' in x:
try: return float(x)
except ValueError: return x
else:
try: return int(x)
except: return x


'hello'

Neat solution. The real challenge though is whether to support
localised dates, these are all valid:
20/10/01
102001
20-10-2001
20011020
 
R

rockingred

Dates can be a pain. I wrote my own date program, simply because
there are so many different ways to write a date:

Mar 8, 2008
March 8th, 08
03/08/08
03-08-2008

And so on and so forth. The tricky bit is how to tell the difference
between Day, Month and Year.

I wrote a program to check the format used for a date. I assumed that
any 4 digits together in a single group were the year. Then I had a
list of Months, with the first 3 characters of each and compared that
to the field being checked, if found, then that was the month. Then I
assumed any number greater than 12 was a day. If I couldn't match
those criteria I assumed Month Day Year (the standard at the company I
worked for).
 
J

John Machin

Neat solution. The real challenge though is whether to support
localised dates, these are all valid:
20/10/01
102001
20-10-2001
20011020

Neat solution doesn't handle the case of using dots as date separators
e.g. 20.10.01 [they are used in dates in some locales and the
location of . on the numeric keypad is easier on the pinkies than / or
-]

I'm a bit dubious about the utility of "most likely format" for ONE
input.

I've used a brute-force approach when inspecting largish CSV files
(with a low but non-zero rate of typos etc) with the goal of
determining what is the most likely type of data in each column.
E.g 102001 could be a valid MMDDYY date, but not a valid DDMMYY or
YYMMDD date. 121212 could be all of those. Both qualify as int, float
and text. A column with 100% of entries qualifying as text, 99.999% as
float, 99.99% as integer, 99.9% as DDMMYY, and much lower percentages
as MMDDYY and YYMMDD would be tagged as DDMMYY. The general rule is:
pick the type whose priority is highest and whose score exceeds a
threshold. Priorities: date > int > float > text. Finding the date
order works well with things like date of birth where there is a wide
distribution of days and years. However a field (e.g. date interest
credited to bank account) where the day is always 01 and the year is
in 01 to 08 would give the same scores for each of 3 date orders ...
eye-balling the actual data never goes astray.
 
P

Paul Rubin

Pierre Quentel said:
I would like to know if there is a module that converts a string to a
value of the "most probable type"

Python 2.4.4 (#1, Oct 23 2006, 13:58:00) The Zen of Python, by Tim Peters
...
In the face of ambiguity, refuse the temptation to guess.
 
J

John Machin

Python 2.4.4 (#1, Oct 23 2006, 13:58:00)
The Zen of Python, by Tim Peters
...
In the face of ambiguity, refuse the temptation to guess.

Reminds me of the story about the avionics software for a fighter
plane which when faced with the situation of being exactly upside-down
refused to guess whether it should initiate a left roll or a right
roll :)
 
S

Steven D'Aprano

Python 2.4.4 (#1, Oct 23 2006, 13:58:00)
The Zen of Python, by Tim Peters
...
In the face of ambiguity, refuse the temptation to guess.

Good advice, but one which really only applies to libraries. At the
application level, sometimes (but not always, or even most times!)
guessing is the right thing to do.

E.g. spreadsheet applications don't insist on different syntax for
strings, dates and numbers. You can use syntax to force one or the other,
but by default the application will auto-detect what you want according
to relatively simple, predictable and intuitive rules:

* if the string looks like a date, it's a date;
* if it looks like a number, it's a number;
* otherwise it's a string.

Given the user-base of the application, the consequences of a wrong
guess, and the ease of fixing it, guessing is the right thing to do.

Auto-completion is another good example of guessing in the face of
ambiguity. It's not guessing that is bad, but what you do with the guess.
 
L

Lie

Dates can be a pain.  I wrote my own date program, simply because
there are so many different ways to write a date:

Mar 8, 2008
March 8th, 08
03/08/08
03-08-2008

And so on and so forth.  The tricky bit is how to tell the difference
between Day, Month and Year.

I wrote a program to check the format used for a date.  I assumed that
any 4 digits together in a single group were the year.  Then I had a
list of Months, with the first 3 characters of each and compared that
to the field being checked, if found, then that was the month.  Then I
assumed any number greater than 12 was a day.  If I couldn't match
those criteria I assumed Month Day Year (the standard at the company I
worked for).

If humans are sometimes confused about this, how could a computer
reliably tells the correct date? I don't think it's possible (to
_reliably_ convert string to date), unless you've got an agreed
convention on how to input the date.
 
L

Lie

Good advice, but one which really only applies to libraries. At the
application level, sometimes (but not always, or even most times!)
guessing is the right thing to do.

Guessing should only be done when it have to be done. Users should
input data in an unambiguous way (such as using 4 digit years and
textual month name, this is the most preferred solution, as
flexibility is retained but ambiguity is ruled out) or be forced to
use a certain convention or be aware of how to properly input the
date. Guessing should be done at the minimum. Personally, when I'm
working with spreadsheet applications (in MS Office or OpenOffice) I
always input dates in an unambiguous way using 4-digit year and
textual month name (usually the 3-letter abbrevs for quicker
inputting), then I can confidently rely the spreadsheet to convert it
to its internal format correctly.

The general parsers like the OP wanted are easy to create if dates
aren't involved.
E.g. spreadsheet applications don't insist on different syntax for
strings, dates and numbers. You can use syntax to force one or the other,
but by default the application will auto-detect what you want according
to relatively simple, predictable and intuitive rules:

* if the string looks like a date, it's a date;
* if it looks like a number, it's a number;
* otherwise it's a string.

The worse thing that can happen is when we input a date in a format we
know but the application can't parse and it consider it as a string
instead. This kind of thing can sometimes easily pass our nose. I
remembered I once formatted a column in Excel to write date with
certain style, but when I tried to input the date with the same style,
Excel can't recognize it, making the whole column rendered as useless
string and requiring me to reinput the dates again.
 
P

Pierre Quentel

def convert(x):
        if '.' in x:
                try: return float(x)
                except ValueError: return x
        else:
                try: return int(x)
                except: return x
Hi,

That's fine for people who write floats with a "." ; but others learn
to enter them with ","

For the same float, the French write the literal 123.456.789,99 when
others write 123,456,789.99 ; for us, today is 8/3/2008 (or
08/03/2008) where for others it's 3/8/2008 or perhaps 2008/3/8

Popular spreadsheets know how to "guess" literals ; if the guess is
not correct users can usually specify the pattern to use. My question
was just to know if something similar had already been developed in
Python ; I understand that the answer is no

Thanks,
Pierre
 
C

castironpi

Hi,

That's fine for people who write floats with a "." ; but others learn
to enter them with ","

For the same float, the French write the literal 123.456.789,99 when
others write 123,456,789.99 ; for us, today is 8/3/2008 (or
08/03/2008) where for others it's 3/8/2008 or perhaps 2008/3/8

Popular spreadsheets know how to "guess" literals ; if the guess is
not correct users can usually specify the pattern to use. My question
was just to know if something similar had already been developed in
Python ; I understand that the answer is no

Thanks,
Pierre

In particular, you can retain access to the user/interface, and always
just query when probabilities aren't 100%. In general, retain access
to a higher power, such as offering a hook ( def butdonotguess( raw,
guesses ):). If you're making the decision, and a case-by-case is too
expensive, then you've made a policy, and someone gets shaft "'cause
policy". You can throw an exception that's guaranteed to be handled
or exits. '121212' isn't as bad as '121110'! If you want to find out
who and when writes dates like that, apply to PR. Obviously a person
knew at one point what the string was; how much is left?

P.S. def welltheydidthatthistime( raw, guesses ):.
 
M

Malcolm Greene

Pierre,
That's fine for people who write floats with a "." ; but others learn to enter them with ","

I have also been looking for a similar Python conversion library. One of
my requirements is that such a library must be locale aware (so it can
make reasonable assumptions regarding locale properties like thousands
separators, decimal points, etc) - either via a locale attribute or by
examining the locale of the thread its running under.

Malcolm
 
R

rockingred

Neat solution. The real challenge though is whether to support
localised dates, these are all valid:
20/10/01
102001
20-10-2001
20011020

Neat solution doesn't handle the case of using dots as date separators
e.g. 20.10.01 [they are used in dates in some locales and  the
location of . on the numeric keypad is easier on the pinkies than / or
-]

I'm a bit dubious about the utility of "most likely format" for ONE
input.

I've used a brute-force approach when inspecting largish CSV files
(with a low but non-zero rate of typos etc) with the goal of
determining what is the most likely type of data in each column.
E.g 102001 could be a valid MMDDYY date, but not a valid DDMMYY or
YYMMDD date. 121212 could be all of those. Both qualify as int, float
and text. A column with 100% of entries qualifying as text, 99.999% as
float, 99.99% as integer, 99.9% as DDMMYY, and much lower percentages
as MMDDYY and YYMMDD would be tagged as DDMMYY. The general rule is:
pick the type whose priority is highest and whose score exceeds a
threshold. Priorities: date > int > float > text. Finding the date
order works well with things like date of birth where there is a wide
distribution of days and years. However a field (e.g. date interest
credited to bank account) where the day is always 01 and the year is
in 01 to 08 would give the same scores for each of 3 date orders ...
eye-balling the actual data never goes astray.- Hide quoted text -

- Show quoted text -

In the case where dots are used as a date separator, count the number
of dots (you should also count commas). If a single comma appears and
is preceeded by only numbers or numbers with decimals, assume "foreign
float". If a single decimal appears and is preceeded by only numbers
or numbers with commas, assume "float". If 2 decimals appear and each
field is 2 or less characters in length and numeric, assume date. If
2 decimals appear and the first 2 fields are 2 or less characters in
length and numeric and the last field is 4 characters in length and
numeric, assume date.

There are things you can do, but you must be wary of the fact that it
may not always be 100% perfect.
 
C

cokofreedom

The trick in the case of when you do not want to guess, or the choices
grow too much, is to ask the user to tell you in what format they want
it and format according to their wishes.

Neatly avoids too much guessing and isn't much extra to add.
 
J

John Machin

In the case where dots are used as a date separator, count the number
of dots (you should also count commas). If a single comma appears and
is preceeded by only numbers or numbers with decimals, assume "foreign
float".

Bzzzzt. 1,234 is the integer 1234 to John, and its only a float to
Jean if Jean is not a Scots lass :)
 
J

John Machin

The trick in the case of when you do not want to guess, or the choices
grow too much, is to ask the user to tell you in what format they want
it and format according to their wishes.

Neatly avoids too much guessing and isn't much extra to add.

The plot is about understanding input, not formatting output.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top