converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2


P

py_genetic

This is excellect advise, thank you gentelman.

Paddy:

We can't really, in this arena make assumtions about the data source.
I fully agree with your point, but if we had the luxury of really
knowing the source we wouldn't be having this conversation. Files we
can deal with could be consumer data files, log files, financial
files... all from different users BCP-ed out or cvs excell etc.
However, I agree that we can make one basic assumtion, for each coll
there is a correct and furthermore optimal format. In many cases we
may have a supplied "data dictionary" with the data in which case you
are right and we can override much of this process, except we still
need to find the optimal format like int8 vs int16.

James:

Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.

What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?


John:
The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).
For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.

I like your approach, this could be simple. Intially, I was thinking
a loop that did exactly this, just test the sample colls for "hits"
and take the best. Thanks for the sample code.


George:

Thank you for offering to share your transform function. I'm very
interested.
 
Ad

Advertisements

J

James Stroud

py_genetic said:
Using a baysian method were my inital thoughts as well. The key to
this method, I feel is getting a solid random sample of the entire
file without having to load the whole beast into memory.

If you feel only the first 1000 rows are representative, then you can
take a random sample from the first 200-1000 rows depending on how good
you think the typing was. At 10,000 bytes per row, you are only reading
in a 10MB file if you read 1000 rows. I just timed reading a 28MB file
(source tgz file of Open Office 2.0.1) from a local drive at about 1s.
As I hope I have demonstrated, you will need only a small random sample
from this 200-1000 for testing (maybe 5-15 depending on quality and priors).
What are your thoughts on other techniques? For example training a
neural net and feeding it a sample, this might be nice and very fast
since after training (we would have to create a good global training
set) we could just do a quick transform on a coll sample and ave the
probabilities of the output units (one output unit for each type).
The question here would encoding, any ideas? A bin rep of the vars?
Furthermore, niave bayes decision trees etc?

I think these latter ideas require more characterization of your input
than is necessary, especially with judicious ordering of simple
converter tests. What properties, aside from passing and failing certain
converter tests, would you use for the data types? For example, you
might use fraction of the string that are integer digits as a property
to classify integers. But how many CPU cycles would you spend counting
these digits and calculating the fractions for your sample? Do you
expect that your data quality is poor enough to warrant expending CPU
cycles to quantify the several properties that might characterize each type?

However, you might need some advanced learning tools if you want to
automatically decide whether a column is a last name or a first name.
I'm guessing this is not required and you would only want to know
whether to make such a column an int or a string, in which case the
testing is pretty straightforward and quick (number of tests per column
< 10).

I've never converted tables from organizations on a routine basis, but I
have a feeling that the quality of these tables are not as poor as one
might fear, especially given reasonable foreknowledge of how data types
are typically encoded.

James
 

Top