converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Discussion in 'Python' started by py_genetic, May 19, 2007.

  1. py_genetic

    py_genetic Guest

    Hello,

    I'm importing large text files of data using csv. I would like to add
    some more auto sensing abilities. I'm considing sampling the data
    file and doing some fuzzy logic scoring on the attributes (colls in a
    data base/ csv file, eg. height weight income etc.) to determine the
    most efficient 'type' to convert the attribute coll into for further
    processing and efficient storage...

    Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
    there' '100,000,000,000'], [next row...] ....]

    Aside from a missing attribute designator, we can assume that the same
    type of data continues through a coll. For example, a string, int8,
    int16, float etc.

    1. What is the most efficient way in python to test weather a string
    can be converted into a given numeric type, or left alone if its
    really a string like 'A' or 'hello'? Speed is key? Any thoughts?

    2. Is there anything out there already which deals with this issue?

    Thanks,
    Conor
     
    py_genetic, May 19, 2007
    #1
    1. Advertisements

  2. py_genetic

    Dustan Guest

    given the string s:

    try:
    integerValue = int(s)
    except ValueError, e:
    try:
    floatValue = float(s)
    except ValueError:
    pass
    else:
    s = floatValue
    else:
    s = integerValue

    I believe it will automatically identify base 8 and base 16 integers
    (but not base 8/16 floats).
     
    Dustan, May 19, 2007
    #2
    1. Advertisements

  3. py_genetic

    James Stroud Guest

    This is untested, but here is an outline to do what you want.

    First convert rows to columns:


    columns = zip(*rows)


    Okay, that was a lot of typing. Now, you should run down the columns,
    testing with the most restrictive type and working to less restrictive
    types. You will also need to keep in mind the potential for commas in
    your numbers--so you will need to write your own converters, determining
    for yourself what literals map to what values. Only you can decide what
    you really want here. Here is a minimal idea of how I would do it:


    def make_int(astr):
    if not astr:
    return 0
    else:
    return int(astr.replace(',', ''))

    def make_float(astr):
    if not astr:
    return 0.0
    else:
    return float(astr.replace(',', ''))

    make_str = lambda s: s


    Now you can put the converters in a list, remembering to order them.


    converters = [make_int, make_float, make_str]


    Now, go down the columns checking, moving to the next, less restrictive,
    converter when a particular converter fails. We assume that the make_str
    identity operator will never fail. We could leave it out and have a
    flag, etc., for efficiency, but that is left as an exercise.


    new_columns = []
    for column in columns:
    for converter in converters:
    try:
    new_column = [converter(v) for v in column]
    break
    except:
    continue
    new_columns.append(new_column)


    For no reason at all, convert back to rows:


    new_rows = zip(*new_columns)


    You must decide for yourself how to deal with ambiguities. For example,
    will '1.0' be a float or an int? The above assumes you want all values
    in a column to have the same type. Reordering the loops can give mixed
    types in columns, but would not fulfill your stated requirements. Some
    things are not as efficient as they might be (for example, eliminating
    the clumsy make_str). But adding tests to improve efficiency would cloud
    the logic.

    James
     
    James Stroud, May 19, 2007
    #3
  4. py_genetic

    John Machin Guest

    [apologies in advance if this appears more than once]

    This approach is quite reasonable, IF:
    (1) the types involved follow a simple "ladder" hierarchy [ints pass the
    float test, floats pass the str test]
    (2) the supplier of the data has ensured that all values in a column are
    actually instances of the intended type.

    Constraint (1) falls apart if you need dates. Consider 31/12/99,
    31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's
    before you allow for dates in three different orders (dmy, mdy, ymd).

    Constraint (2) just falls apart -- with user-supplied data, there seem
    to be no rules but Rafferty's and no laws but Murphy's.

    The approach that I've adopted is to test the values in a column for all
    types, and choose the non-text type that has the highest success rate
    (provided the rate is greater than some threshold e.g. 90%, otherwise
    it's text).

    For large files, taking a 1/N sample can save a lot of time with little
    chance of misdiagnosis.

    Example: file of 1,079,000 records, with 15 columns, ultimately
    diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),
    and [no kidding] 1 x date (ymd order). Using N==101 took about 15
    seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes
    about 900 seconds. The "converter" function for dates is written in C.

    Cheers,
    John
     
    John Machin, May 19, 2007
    #4
  5. py_genetic

    Paddy Guest

    You might try investigating what can generate your data. With luck,
    it could turn out that the data generator is methodical and column
    data-types are consistent and easily determined by testing the
    first or second row. At worst, you will get to know how much you
    must check for human errors.

    - Paddy.
     
    Paddy, May 19, 2007
    #5
  6. py_genetic

    James Stroud Guest


    Why stop there? You could lower the minimum 1/N by straightforward
    application of Bayesian statistics, using results from previous tables
    as priors.


    James
     
    James Stroud, May 19, 2007
    #6
  7. py_genetic

    John Machin Guest

    The example I gave related to one file out of several files prepared at
    the same time by the same organisation from the same application by the
    same personnel using the same query tool for a yearly process which has
    been going on for several years. All files for a year should be in the
    same format, and the format should not change year by year, and the
    format should match the agreed specifications ... but this doesn't
    happen. Against that background, please explain to me how I can use
    "results from previous tables as priors".

    Cheers,
    John
     
    John Machin, May 20, 2007
    #7
  8. py_genetic

    John Machin Guest

    Here you go, Paddy, the following has been generated very methodically;
    what data type is the first column? What is the value in the first
    column of the 6th row likely to be?

    "$39,082.00","$123,456.78"
    "$39,113.00","$124,218.10"
    "$39,141.00","$124,973.76"
    "$39,172.00","$125,806.92"
    "$39,202.00","$126,593.21"

    N.B. I've kindly given you five lines instead of one or two :)

    Cheers,
    John
     
    John Machin, May 20, 2007
    #8
  9. py_genetic

    James Stroud Guest

    It depends on how you want to model your probabilities, but, as an
    example, you might find the following frequencies of columns in all
    tables you have parsed from this organization: 35% Strings, 25% Floats,
    20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have
    also used prior counting statistics to find that there is a 2% error
    rate in the columns (2% of the values of a typical Float column fail to
    cast to Float, 2% of values in Int columns fail to cast to Int, and
    so-on, though these need not all be equal). Lets also say that for
    non-Int columns, 1% of cells randomly selected cast to Int.

    These percentages could be converted to probabilities and these
    probabilities could be used as priors in Bayesian scheme to determine a
    column type. Lets say you take one cell randomly and it can be cast to
    an Int. What is the probability that the column is an Int? (See
    <http://tinyurl.com/2bdn38>.)

    P_1(H) = 0.20 --> Prior (20% prior columns are Int columns)
    P(D|H) = 0.98
    P(D|H') = 0.01

    P_1(H|D) = 0.9607843 --> Posterior & New Prior "P_2(H|D)"


    Now with one test positive for Int, you are getting pretty certain you
    have an Int column. Now we take a second cell randomly from the same
    column and find that it too casts to Int.

    P_2(H) = 0.9607843 --> Confidence its an Int column from round 1
    P(D|H) = 0.98
    P(D|H') = 0.02

    P_2(H|D) = 0.9995836


    Yikes! But I'm still not convinced its an Int because I haven't even had
    to wait a millisecond to get the answer. Lets burn some more clock cycles.

    Lets say we really have an Int column and get "lucky" with our tests (P
    = 0.98**4 = 92% chance) and find two more random cells successfully cast
    to Int:

    P_4(H) = 0.9999957
    P(D|H) = 0.98
    P(D|H') = 0.02

    P(H|D) = 0.9999999


    I don't know about you, but after only four positives, my calculator ran
    out of significant digits and so I am at least 99.99999% convinced its
    an Int column and I'm going to stop wasting CPU cycles and move on to
    test the next column. How do you know its not a float? Well, given
    floats with only one decimal place, you would expect only 1/10th could
    be cast to Int (were the tenths-decimal place to vary randomly). You
    could generate a similar statistical model to convince yourself with
    vanishing uncertainty that the column that tests positive for Int four
    times in a (random sample) is not actually a Float (even with only one
    decimal place known).


    James
     
    James Stroud, May 20, 2007
    #9
  10. py_genetic

    James Stroud Guest


    I had typos. P(D|H') should be 0.01 for all rounds.

    Also, I should clarify that 4 of 4 are positive with no fails observed.
    Integrating fails would use the last posterior as a prior in a similar
    scheme.

    Also, given a 1% false positive rate, after only 4 rounds you are 1 -
    (0.01**4) = 99.9999% sure your observations aren't because you
    accidentally pulled 4 of the false positives in succession.

    James
     
    James Stroud, May 20, 2007
    #10
  11. py_genetic

    John Machin Guest

    The model would have to be a lot more complicated than that. There is a
    base number of required columns. The kind suppliers of the data randomly
    add extra columns, randomly permute the order in which the columns
    appear, and, for date columns, randomly choose the day-month-year order,
    how much punctuation to sprinkle between the digits, and whether to
    append some bonus extra bytes like " 00:00:00".
    Past stats on failure to cast are no guide to the future ... a sudden
    change in the failure rate can be caused by the kind folk introducing a
    new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A',
    '#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]


    There is also the problem of first-time-participating organisations --
    in police parlance, they have no priors :)

    So, all in all, Bayesian inference doesn't seem much use in this scenario.
    That's fancy -- a great improvement on the slide rule and squared paper :)

    Cheers,
    John
     
    John Machin, May 20, 2007
    #11
  12. py_genetic

    James Stroud Guest

    This is equivalent to saying that any statistical analysis doesn't seem
    much use in this scenario--but you go ahead and use statistics anyway?
     
    James Stroud, May 20, 2007
    #12
  13. py_genetic

    Paddy Guest

    John,
    I've had cases where some investigation of the source of the data has
    completely removed any ambiguity. I've found that data was generated
    from one or two sources and been able to know what every field type is
    by just examining a field that I have determined wil tell me the
    source program that generated the data.

    I have also found that the flow generating some data is subject to
    hand editing so have had to both put in extra checks in my reader, and
    on some occasions created specific editors to replace hand edits by
    checked assisted hand edits.
    I stand by my statement; "Know the source of your data", its less
    likely to bite!

    - Paddy.
     
    Paddy, May 20, 2007
    #13
  14. py_genetic

    James Stroud Guest

    I'm going to ignore this because these things have absolutely no affect
    on the analysis whatsoever. Random order of columns? How could this
    influence any statistics, counting, Bayesian, or otherwise?

    randomly choose the day-month-year order,
    I absolutely do not understand how bonus bytes or any of the above would
    selectively adversely affect any single type of statistics--if your
    converter doesn't recognize it then your converter doesn't recognize it
    and so it will fail under every circumstance and influence any and all
    statistical analysis. Under such conditions, I want very robust
    analysis--probably more robust than simple counting statistics. And I
    definitely want something more efficient.
    Not true when using Bayesian statistics (and any type of inference for
    that matter). For example, where did you get 90% cutoff? From
    experience? I thought that past stats are no guide to future expectations?

    ... a sudden
    Using the rough model and having no idea that they threw in a few weird
    designators so that you might suspect a 20% failure (instead of the 2% I
    modeled previously), the *low probabilities of false positives* (say 5%
    of the non-Int columns evaluate to integer--after you've eliminated
    dates because you remembered to test more restrictive types first) would
    still *drive the statistics*. Remember, the posteriors become priors
    after the first test.

    P_1(H) = 0.2 (Just a guess, it'll wash after about 3 tests.)
    P(D|H) = 0.8 (Are you sure they have it together enough to pay you?)
    P(D|H') = 0.05 (5% of the names, salaries, etc., evaluate to float?)

    Lets model failures since the companies you work with have bad typists.
    We have to reverse the probabilities for this:

    Pf_1(H) = 0.2 (Only if this is round 1.)
    Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)
    Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

    You might take issue with Pf(D|H) = 0.2. I encourage you to try a range
    of values here to see what the posteriors look like. You'll find that
    this is not as important as the *low false positive rate*.

    For example, lets not stop until we are 99.9% sure one way or the other.
    With this cutoff, lets suppose this deplorable display of typing integers:

    pass-fail-fail-pass-pass-pass

    which might be expected from the above very pessimistic priors (maybe
    you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or
    the _Put_a_Quote_Around_Every_5th_Integer_League_):

    P_1(H|D) = 0.800 (pass)
    P_2(H|D) = 0.500 (fail)
    P_3(H|D) = 0.200 (fail--don't stop, not 99.9% sure)
    P_4(H|D) = 0.800 (pass)
    P_6(H|D) = 0.9846153 (pass--not there yet)
    P_7(H|D) = 0.9990243 (pass--got it!)

    Now this is with 5% all salaries, names of people, addresses, favorite
    colors, etc., evaluating to integers. (Pausing while I remember fondly
    Uncle 41572--such a nice guy...funny name, though.)
    Yes, because they teleported from Alpha Centauri where organizations are
    fundamentally different from here on Earth and we can not make any
    reasonable assumptions about them--like that they will indeed cough up
    money when the time comes or that they speak a dialect of an earth
    language or that they even generate spreadsheets for us to parse.

    James
     
    James Stroud, May 20, 2007
    #14
  15. py_genetic

    John Machin Guest

    The source program that produced my sample dataset was Microsoft Excel
    (or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
    operation. Does that help you determine the true nature of the first column?

    My dataset has a known source, and furthermore meets your "lucky"
    criteria (methodically generated, column type is consistent) -- I'm
    waiting to hear from you about the "easily determined" part :)

    Cheers,
    John
     
    John Machin, May 20, 2007
    #15
  16. py_genetic

    Paddy Guest

    John,
    Open up your Excel spreadsheet and check what the format is for the
    column. It's not a contest. If you KNOW what generated the data then
    USE that knowledge. It would be counter-productive to do otherwise
    surely?

    (I know, don't call you Shirley :)

    - Paddy.
     
    Paddy, May 20, 2007
    #16
  17. py_genetic

    John Machin Guest

    .... and I won't call you Patsy more than this once :)

    Patsy, re-read. The scenario is that I don't have the Excel
    spreadsheet; I have a CSV file. The format is rather obviously
    "currency" but that is not correct. The point is that (1) it was
    methodically [mis-]produced by a known source [your criteria] but the
    correct type of column 1 can't be determined by inspection of a value
    or 2.

    Yeah, it's not a contest, but I was kinda expecting that you might
    have taken first differences of column 1 by now ...

    Cheers,
    John
     
    John Machin, May 20, 2007
    #17
  18. There are several replies to your immediate column type-guessing
    problem, so I'm not going to address that. Once you decide the
    converters for each column, you have to pass the dataset through them
    (and optionally rearrange or omit some of them). That's easy to
    hardcode for a few datasets with the same or similar structure but it
    soon gets tiring.

    I had a similar task recently so I wrote a general and efficient (at
    least as far as pure python goes) row transformer that does the
    repetitive work. Below are some examples from an Ipython session; let
    me know if this might be useful and I'll post it here or at the
    Cookbook.

    George


    #======= RowTransformer examples ============================

    In [1]: from transrow import RowTransformer
    In [2]: rows = [row.split(',') for row in "1,3.34,4-3.2j,John",
    "4,4,4,4", "0,-1.1,3.4,None"]
    In [3]: rows
    Out[3]:
    [['1', '3.34', '4-3.2j', 'John'],
    ['4', '4', '4', '4'],
    ['0', '-1.1', '3.4', 'None']]

    # adapt the first three columns; the rest are omitted
    In [4]: for row in RowTransformer([int,float,complex])(rows):
    ...: print row
    ...:
    [1, 3.3399999999999999, (4-3.2000000000000002j)]
    [4, 4.0, (4+0j)]
    [0, -1.1000000000000001, (3.3999999999999999+0j)]

    # return the 2nd column as float, followed by the 4th column as is
    In [5]: for row in RowTransformer({1:float, 3:None})(rows):
    ....: print row
    ....:
    [3.3399999999999999, 'John']
    [4.0, '4']
    [-1.1000000000000001, 'None']


    # return the 3rd column as complex, followed by the 1st column as int
    In [6]: for row in RowTransformer([(2,complex),(0,int)])(rows):
    ....: print row
    ....:
    [(4-3.2000000000000002j), 1]
    [(4+0j), 4]
    [(3.3999999999999999+0j), 0]

    # return the first three columns, adapted by eval()
    # XXX: use eval() only for trusted data
    In [7]: for row in RowTransformer(include=range(3),
    default_adaptor=eval)(rows):
    ....: print row
    ....:
    [1, 3.3399999999999999, (4-3.2000000000000002j)]
    [4, 4, 4]
    [0, -1.1000000000000001, 3.3999999999999999]

    # equivalent to the previous
    In [8]: for row in RowTransformer(default_adaptor=eval, exclude=[3])
    (rows):
    ....: print row
    ....:
    [1, 3.3399999999999999, (4-3.2000000000000002j)]
    [4, 4, 4]
    [0, -1.1000000000000001, 3.3999999999999999]
     
    George Sakkis, May 21, 2007
    #18
  19. py_genetic

    James Stroud Guest

    I need to correct myself here before someone else does. I didn't
    actually reverse the probabilities as promised for the failing case. It
    was late last night and I was starting to get a little cloudy.
    This can be read instead as "probability that it will fail the test
    given that it is really from an Int column", which is 20% of the time.
    This can be read as "probability it will fail the test if it is not
    really from an Int column". That would be Pf(D|H') = 0.95 (e.g. testing
    the inability to cast to Int is a pretty bad test for Int because it
    gives false positives 95% of the time).

    This change doesn't change the conclusions of the example, with the
    P_3(H|D) = 0.1505882 (lower than 20%, but no where near the 0.001 cutoff
    to conclude the column is not Int) and the final probability P_7(H|D) =
    0.9986247 (rounding up to our 0.999 criteria for confidence that it is
    an Int ;).

    James
     
    James Stroud, May 21, 2007
    #19
  20. py_genetic

    Neil Cerutti Guest

    My experience with Excel-related mistakes leads me to think that
    column one contains dates that got somehow misformatted on
    export.
     
    Neil Cerutti, May 21, 2007
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.