Determine datatype

Discussion in 'Java' started by max_ny, Jan 12, 2006.

  1. max_ny

    max_ny Guest

    Hii,
    i am trying write some code to transfer data from csv file to a
    database.
    my problem is to determine the datatype of various column values in the
    csv file. I read the csv file column wise and have to determine what
    datatype each col. contains, so to map it to correct type for creating
    tables in the DB.
    Can someone provide me any reference for determining datatype of a col.
    value read from csv file, (my google search couldn't yield exact
    results); i can't figure out how should i determine the type of each
    column data

    thanks
     
    max_ny, Jan 12, 2006
    #1
    1. Advertising

  2. max_ny wrote on 12.01.2006 23:11:
    > Hii,
    > i am trying write some code to transfer data from csv file to a
    > database.
    > my problem is to determine the datatype of various column values in the
    > csv file. I read the csv file column wise and have to determine what
    > datatype each col. contains, so to map it to correct type for creating
    > tables in the DB.
    > Can someone provide me any reference for determining datatype of a col.
    > value read from csv file, (my google search couldn't yield exact
    > results); i can't figure out how should i determine the type of each
    > column data


    There is no way you can do it 100%. Any character sequence you read
    could be mapped to several data types.

    Take the sequence "20060112". This could be a VARCHAR column that
    happens to have no characters (only numbers), or it could be an integer
    value, or it could be a float value that happens to have no decimal
    digits, or it could be a date.

    I don't see a different solution then to let your user provide the
    necessary information.

    Thomas
     
    Thomas Kellerer, Jan 12, 2006
    #2
    1. Advertising

  3. max_ny

    h4hrr Guest

    okk, very true

    but suppose, i this csv data

    21.10.2003;6 ;0.99;Feb 96;1
    22.10.2003;5 ;0.99;Feb 66;1
    23.10.2003;10;0.99;Feb 34;1
    24.10.2003;3 ;1 ;02.Jun;1

    def. i can't say anything about "20060112", what type it can be

    but if its 0.99 (like in col 3, and fourth value is int in that col.);
    and i read it each value from file as String[], how can i determine its
    float and fourth value is int?

    i mean, any java commands or way to determine it..

    thanks.. (just wanna be sure what exactly i shud do and how to do or
    let user decide it)
     
    h4hrr, Jan 12, 2006
    #3
  4. max_ny

    Eric Sosman Guest

    max_ny wrote On 01/12/06 17:11,:
    > Hii,
    > i am trying write some code to transfer data from csv file to a
    > database.
    > my problem is to determine the datatype of various column values in the
    > csv file. I read the csv file column wise and have to determine what
    > datatype each col. contains, so to map it to correct type for creating
    > tables in the DB.
    > Can someone provide me any reference for determining datatype of a col.
    > value read from csv file, (my google search couldn't yield exact
    > results); i can't figure out how should i determine the type of each
    > column data


    There is no foolproof way. Fundamentally, all you
    get from the file is a bunch of Strings that are the
    external appearance of data fields. You're trying to
    deduce the original data by gazing at its appearance,
    and your problem is that there may be many different
    internal data items that could produce the same String.

    "4" - It looks like a number, but is it an integer
    or a floating-point value? Is it a percentage
    that really represents 0.04? Or maybe it's
    part of a date: do the nearby columns contain
    7 and 1776?

    "F" - It might be a short string, as in F Murray
    Abraham. Then again, it might be the Boolean
    value "false."

    "1:10" - Very likely a time, but what kind? If it
    denotes an interval, does it mean 70 seconds
    or 70 minutes? If it denotes a specific time
    of day, does it mean early in the morning or
    early in the afternoon? Then again, it might
    be the odds your bookie is quoting.

    "1/2" - If a date, is it January 2 or February 1,
    and in what year? Or is it just 0.5 (does
    the next column say "tsp")?

    The very best way to resolve such ambiguities is to
    get some outside authority (e.g., the user) to tell you
    what the columns contain. Failing that, a reasonable
    approach might be to take a sample of fifty or a hundred
    rows, and try several ways of "understanding" the set of
    values you find in each column: If one of them works
    more often than the others, guess that it's the right
    interpretation. (It may be a good idea to ignore the
    first several rows; they might be column headings from
    a spreadsheet or something of the kind.)

    For example, if the values found in column 3 are

    Quantity, "", 42, 29, 3001, 86, 107, 1336

    you might well decide that the first two rows are titles
    and the real data consists of integers in rows 3 and on.
    (They could also be floating-point values or percentages,
    of course, but a reasonable call might be to prefer the
    "simpler" explanation.) On the other hand, if you found

    Exchange, Rate, 1.0035, 0.9926, 1, 99.4217

    it would be reasonable to assume you're looking at a
    column of floating-point numbers plus two rows of titles.
    Yes, one of the values would also work as an integer
    (it's telling you how many gold splonders you can get for
    one gold splonder), but floating-point successfully
    "explains" more of the values and so is more likely to
    characterize the whole column.

    Similarly, an entire column of T and F or of Y and N
    is probably a column of Booleans; if A and R and L and
    S are also mixed in, it's probably a column of strings.
    A column consisting entirely of integers in the range 1..12
    may well be month numbers; 1..31 might mean day numbers.
    Maybe you're lucky and there's a leading "$" or trailing
    "%" to give you a hint (careful of those dollars, though:
    are they American, Canadian, Australian, or Doodledorfian?).
    Start with a set of "value interpreters" of various kinds,
    apply each of them to all the values in a column, and score
    each by how many times it's able to make sense of what it
    finds.

    It's going to be an imprecise business ...

    --
     
    Eric Sosman, Jan 12, 2006
    #4
  5. h4hrr wrote on 12.01.2006 23:51:
    > 21.10.2003;6 ;0.99;Feb 96;1
    > 22.10.2003;5 ;0.99;Feb 66;1
    > 23.10.2003;10;0.99;Feb 34;1
    > 24.10.2003;3 ;1 ;02.Jun;1
    >
    > def. i can't say anything about "20060112", what type it can be
    >
    > but if its 0.99 (like in col 3, and fourth value is int in that col.);
    > and i read it each value from file as String[], how can i determine its
    > float and fourth value is int?


    Why would you want to have a different data type for the first and
    fourth value? They all go into the same database column, right? So
    you'll have to use float for all of them.

    But this is precisely another situation why you cannot reliable
    determine the data type from a plain text file. Suppose you analyze the
    first row to find the datatype and the "1" is in there, so you'll go for
    an int, but later you'll hit 0.99 and your code throws an exception.

    But if you really need to do it, you could try something like this:

    String valueFromFile = ....;

    Object dbValue = null;
    try
    {
    dbValue = Integer.valueOf((valueFromFile);
    }
    catch (NumberFormatException e)
    {
    dbValue = null;

    }

    if (dbValue == null)
    {
    try
    {
    dbValue = Double.valueOf(valueFromFile);
    }
    catch (NumberFormatException e)
    {
    dbValue = null;
    }
    }

    if (dbValue == null)
    {
    .....
    }

    You get the idea.

    Basically try with one type, if that fails try the next until the
    conversion does not throw an Exception. Not the fastest way, but if you
    have a limited set of types, this might work. Won't help with the
    situation I described in my first post though.

    The actual dbValue object can be written to the database using
    setObject() on a PreparedStatement.

    Thomas
     
    Thomas Kellerer, Jan 12, 2006
    #5
  6. max_ny

    Roedy Green Guest

    On 12 Jan 2006 14:11:48 -0800, "max_ny" <>
    wrote, quoted or indirectly quoted someone who said :

    >my problem is to determine the datatype of various column values in the
    >csv file. I read the csv file column wise and have to determine what
    >datatype each col. contains, so to map it to correct type for creating
    >tables in the DB.


    This is similar to the logic in my Align utility, and also in the HTML
    table sorter, that scan a file first to determine if each column is
    alpha or numeric. If all entries are numeric, I right align or sort
    numerically.

    So you do a pass through the file with CSVReader (see
    http://mindprod.com/jgloss/csv.html

    There is an option to give you each entire line as an array of
    strings.

    Examine each string, and adjust your types[col] array.

    Now you know the types of each column. You can then read the file
    converting each entry to the column type and know it will work.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Jan 12, 2006
    #6
  7. max_ny

    h4hrr Guest

    Thanks for the nice explanation.

    my main aim was to do the same as you mentioned,

    >to take a sample of fifty or a hundred
    >rows, and try several ways of "understanding" the set of
    >values you find in each column: If one of them works
    >more often than the others, guess that it's the right
    >interpretation.


    but how to determine, whether the string[] value is integer, double or
    something else, i mean the java reference for it (as i said in my other
    reply).. i am not sure how to do that.. the java reference or command
    to determine the value type ..

    what to finalize which is datatype is it at the end, may be one has to
    do some heuristics for determining it out of the values read..

    thanks
     
    h4hrr, Jan 12, 2006
    #7
  8. max_ny

    h4hrr Guest

    ah, didn't look at this before posting the last reply..

    thanks for code sample.. yeah looks like its not a easy way to go and
    if i check 1000 rows or so, for all the different types.. numerics,
    dates,

    hmm..
     
    h4hrr, Jan 12, 2006
    #8
  9. max_ny wrote:
    > Hii,
    > i am trying write some code to transfer data from csv file to a
    > database.
    > my problem is to determine the datatype of various column values in the
    > csv file. I read the csv file column wise and have to determine what
    > datatype each col. contains, so to map it to correct type for creating
    > tables in the DB.
    > Can someone provide me any reference for determining datatype of a col.
    > value read from csv file, (my google search couldn't yield exact
    > results); i can't figure out how should i determine the type of each
    > column data
    >
    > thanks
    >


    You could use String.matches(r) with r being a regular expression.
    For example

    if (col.matches("[0-10]")) { type = integer }
    else if (col.matches("[0-10]\.{0-10}+")) { type = float }
    else if (col.matches("(true)(false)")) { type = bool }
    else if (col.matches("[0-31] [(jan)(feb)...] [0-2034]")) { type = date }
    and so on...

    (not sure about the regexp syntax, just to show how you could do it)

    Francesco
     
    Francesco Devittori, Jan 13, 2006
    #9
  10. max_ny

    Eric Sosman Guest

    h4hrr wrote On 01/12/06 18:42,:
    > Thanks for the nice explanation.
    >
    > my main aim was to do the same as you mentioned,
    >
    >
    >>to take a sample of fifty or a hundred
    >>rows, and try several ways of "understanding" the set of
    >>values you find in each column: If one of them works
    >>more often than the others, guess that it's the right
    >>interpretation.

    >
    >
    > but how to determine, whether the string[] value is integer, double or
    > something else, i mean the java reference for it (as i said in my other
    > reply).. i am not sure how to do that.. the java reference or command
    > to determine the value type ..
    >
    > what to finalize which is datatype is it at the end, may be one has to
    > do some heuristics for determining it out of the values read..


    Exactly: your code needs to do some guesswork. My
    suggestion is that you come up with a limited set of
    data types that you expect to find in the data: Let's
    say they are Date, Integer, Float, and String. Take
    all (or a good-sized sample) of the text values in a
    column and try to convert them to Dates; if none of the
    texts successfully convert to Date then you conclude
    that the column doesn't represent Dates. Try again,
    converting them all to Integers; if 30% of them convert
    successfully and 70% don't, you can be reasonably sure
    that the column doesn't represent Integers. Try again,
    this time converting every value to a Float; if they
    all succeed, you can be pretty sure the column contains
    floating-point numbers. If nothing else seems to fill
    the bill, declare that the column contains Strings (the
    conversion of a String to a String nearly always succeeds).

    The "interesting" cases are when none of your candidate
    data types successfully converts all the column's values,
    but one converts "most" of the column. You could fall
    back on String (the ultimate "I don't know"), or you could
    decide to reject the few un-convertable rows (especially
    if other columns in the same rows prove recalcitrant).
    Another "interesting" case is when two or more data types
    successfully convert all or most of the data, but neither
    is obviously "more specific" than the other. These call
    for judgment, a quality notably lacking in computers.

    --
     
    Eric Sosman, Jan 13, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. AJ

    Datatype issue

    AJ, Jan 12, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    500
    Mattias Sjögren
    Jan 12, 2004
  2. Mostafa

    VHDL SIGNED datatype

    Mostafa, Jul 1, 2003, in forum: VHDL
    Replies:
    6
    Views:
    28,607
    hussain.elsaid
    Dec 1, 2008
  3. Ralph Stuber
    Replies:
    2
    Views:
    513
    Ralph Stuber
    Apr 22, 2005
  4. Jeff
    Replies:
    2
    Views:
    1,183
    Steve C. Orr [MCSD, MVP, CSM, ASP Insider]
    Apr 16, 2007
  5. Peña, Botp
    Replies:
    1
    Views:
    252
    Robert Klemme
    Jan 24, 2004
Loading...

Share This Page