Determine datatype

M

max_ny

Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

thanks
 
T

Thomas Kellerer

max_ny wrote on 12.01.2006 23:11:
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

There is no way you can do it 100%. Any character sequence you read
could be mapped to several data types.

Take the sequence "20060112". This could be a VARCHAR column that
happens to have no characters (only numbers), or it could be an integer
value, or it could be a float value that happens to have no decimal
digits, or it could be a date.

I don't see a different solution then to let your user provide the
necessary information.

Thomas
 
H

h4hrr

okk, very true

but suppose, i this csv data

21.10.2003;6 ;0.99;Feb 96;1
22.10.2003;5 ;0.99;Feb 66;1
23.10.2003;10;0.99;Feb 34;1
24.10.2003;3 ;1 ;02.Jun;1

def. i can't say anything about "20060112", what type it can be

but if its 0.99 (like in col 3, and fourth value is int in that col.);
and i read it each value from file as String[], how can i determine its
float and fourth value is int?

i mean, any java commands or way to determine it..

thanks.. (just wanna be sure what exactly i shud do and how to do or
let user decide it)
 
E

Eric Sosman

max_ny wrote On 01/12/06 17:11,:
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

There is no foolproof way. Fundamentally, all you
get from the file is a bunch of Strings that are the
external appearance of data fields. You're trying to
deduce the original data by gazing at its appearance,
and your problem is that there may be many different
internal data items that could produce the same String.

"4" - It looks like a number, but is it an integer
or a floating-point value? Is it a percentage
that really represents 0.04? Or maybe it's
part of a date: do the nearby columns contain
7 and 1776?

"F" - It might be a short string, as in F Murray
Abraham. Then again, it might be the Boolean
value "false."

"1:10" - Very likely a time, but what kind? If it
denotes an interval, does it mean 70 seconds
or 70 minutes? If it denotes a specific time
of day, does it mean early in the morning or
early in the afternoon? Then again, it might
be the odds your bookie is quoting.

"1/2" - If a date, is it January 2 or February 1,
and in what year? Or is it just 0.5 (does
the next column say "tsp")?

The very best way to resolve such ambiguities is to
get some outside authority (e.g., the user) to tell you
what the columns contain. Failing that, a reasonable
approach might be to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation. (It may be a good idea to ignore the
first several rows; they might be column headings from
a spreadsheet or something of the kind.)

For example, if the values found in column 3 are

Quantity, "", 42, 29, 3001, 86, 107, 1336

you might well decide that the first two rows are titles
and the real data consists of integers in rows 3 and on.
(They could also be floating-point values or percentages,
of course, but a reasonable call might be to prefer the
"simpler" explanation.) On the other hand, if you found

Exchange, Rate, 1.0035, 0.9926, 1, 99.4217

it would be reasonable to assume you're looking at a
column of floating-point numbers plus two rows of titles.
Yes, one of the values would also work as an integer
(it's telling you how many gold splonders you can get for
one gold splonder), but floating-point successfully
"explains" more of the values and so is more likely to
characterize the whole column.

Similarly, an entire column of T and F or of Y and N
is probably a column of Booleans; if A and R and L and
S are also mixed in, it's probably a column of strings.
A column consisting entirely of integers in the range 1..12
may well be month numbers; 1..31 might mean day numbers.
Maybe you're lucky and there's a leading "$" or trailing
"%" to give you a hint (careful of those dollars, though:
are they American, Canadian, Australian, or Doodledorfian?).
Start with a set of "value interpreters" of various kinds,
apply each of them to all the values in a column, and score
each by how many times it's able to make sense of what it
finds.

It's going to be an imprecise business ...
 
T

Thomas Kellerer

h4hrr wrote on 12.01.2006 23:51:
21.10.2003;6 ;0.99;Feb 96;1
22.10.2003;5 ;0.99;Feb 66;1
23.10.2003;10;0.99;Feb 34;1
24.10.2003;3 ;1 ;02.Jun;1

def. i can't say anything about "20060112", what type it can be

but if its 0.99 (like in col 3, and fourth value is int in that col.);
and i read it each value from file as String[], how can i determine its
float and fourth value is int?

Why would you want to have a different data type for the first and
fourth value? They all go into the same database column, right? So
you'll have to use float for all of them.

But this is precisely another situation why you cannot reliable
determine the data type from a plain text file. Suppose you analyze the
first row to find the datatype and the "1" is in there, so you'll go for
an int, but later you'll hit 0.99 and your code throws an exception.

But if you really need to do it, you could try something like this:

String valueFromFile = ....;

Object dbValue = null;
try
{
dbValue = Integer.valueOf((valueFromFile);
}
catch (NumberFormatException e)
{
dbValue = null;

}

if (dbValue == null)
{
try
{
dbValue = Double.valueOf(valueFromFile);
}
catch (NumberFormatException e)
{
dbValue = null;
}
}

if (dbValue == null)
{
.....
}

You get the idea.

Basically try with one type, if that fails try the next until the
conversion does not throw an Exception. Not the fastest way, but if you
have a limited set of types, this might work. Won't help with the
situation I described in my first post though.

The actual dbValue object can be written to the database using
setObject() on a PreparedStatement.

Thomas
 
R

Roedy Green

my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.

This is similar to the logic in my Align utility, and also in the HTML
table sorter, that scan a file first to determine if each column is
alpha or numeric. If all entries are numeric, I right align or sort
numerically.

So you do a pass through the file with CSVReader (see
http://mindprod.com/jgloss/csv.html

There is an option to give you each entire line as an array of
strings.

Examine each string, and adjust your types[col] array.

Now you know the types of each column. You can then read the file
converting each entry to the column type and know it will work.
 
H

h4hrr

Thanks for the nice explanation.

my main aim was to do the same as you mentioned,
to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation.

but how to determine, whether the string[] value is integer, double or
something else, i mean the java reference for it (as i said in my other
reply).. i am not sure how to do that.. the java reference or command
to determine the value type ..

what to finalize which is datatype is it at the end, may be one has to
do some heuristics for determining it out of the values read..

thanks
 
H

h4hrr

ah, didn't look at this before posting the last reply..

thanks for code sample.. yeah looks like its not a easy way to go and
if i check 1000 rows or so, for all the different types.. numerics,
dates,

hmm..
 
F

Francesco Devittori

max_ny said:
Hii,
i am trying write some code to transfer data from csv file to a
database.
my problem is to determine the datatype of various column values in the
csv file. I read the csv file column wise and have to determine what
datatype each col. contains, so to map it to correct type for creating
tables in the DB.
Can someone provide me any reference for determining datatype of a col.
value read from csv file, (my google search couldn't yield exact
results); i can't figure out how should i determine the type of each
column data

thanks

You could use String.matches(r) with r being a regular expression.
For example

if (col.matches("[0-10]")) { type = integer }
else if (col.matches("[0-10]\.{0-10}+")) { type = float }
else if (col.matches("(true)(false)")) { type = bool }
else if (col.matches("[0-31] [(jan)(feb)...] [0-2034]")) { type = date }
and so on...

(not sure about the regexp syntax, just to show how you could do it)

Francesco
 
E

Eric Sosman

h4hrr wrote On 01/12/06 18:42,:
Thanks for the nice explanation.

my main aim was to do the same as you mentioned,

to take a sample of fifty or a hundred
rows, and try several ways of "understanding" the set of
values you find in each column: If one of them works
more often than the others, guess that it's the right
interpretation.


but how to determine, whether the string[] value is integer, double or
something else, i mean the java reference for it (as i said in my other
reply).. i am not sure how to do that.. the java reference or command
to determine the value type ..

what to finalize which is datatype is it at the end, may be one has to
do some heuristics for determining it out of the values read..

Exactly: your code needs to do some guesswork. My
suggestion is that you come up with a limited set of
data types that you expect to find in the data: Let's
say they are Date, Integer, Float, and String. Take
all (or a good-sized sample) of the text values in a
column and try to convert them to Dates; if none of the
texts successfully convert to Date then you conclude
that the column doesn't represent Dates. Try again,
converting them all to Integers; if 30% of them convert
successfully and 70% don't, you can be reasonably sure
that the column doesn't represent Integers. Try again,
this time converting every value to a Float; if they
all succeed, you can be pretty sure the column contains
floating-point numbers. If nothing else seems to fill
the bill, declare that the column contains Strings (the
conversion of a String to a String nearly always succeeds).

The "interesting" cases are when none of your candidate
data types successfully converts all the column's values,
but one converts "most" of the column. You could fall
back on String (the ultimate "I don't know"), or you could
decide to reject the few un-convertable rows (especially
if other columns in the same rows prove recalcitrant).
Another "interesting" case is when two or more data types
successfully convert all or most of the data, but neither
is obviously "more specific" than the other. These call
for judgment, a quality notably lacking in computers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,602
Members
45,182
Latest member
BettinaPol

Latest Threads

Top