Sniffing Text Files

D

David Pratt

Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to be
extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into memory.
What approach would be recommended for sniffing the files for the
different text formats. I realize CSV module has a sniffer but it is
something that is limited more or less to delimited files. I have a
couple of ideas on what I could do but I am interested in hearing from
others on how they might handle something like this so I can determine
the best approach to take. Many thanks.

Regards,
David
 
M

Mike Meyer

David Pratt said:
Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to
be extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into
memory. What approach would be recommended for sniffing the files for
the different text formats. I realize CSV module has a sniffer but it
is something that is limited more or less to delimited files. I have
a couple of ideas on what I could do but I am interested in hearing
from others on how they might handle something like this so I can
determine the best approach to take. Many thanks.

With GB memory machines being common, I wouldn't think twice about
slurping a couple of meg into RAM to examine. But if that's to much,
how about simply reading in the first <chunk> bytes, and checking that
for the characters you want? <chunk> should be large enough to reveal
what you need, but small enogh that your'e comfortable reading it
in. I'm not sure that there aren't funny interactions between read and
readline, so do be careful with that.

Another approach to consider is libmagic. Google turns up a number of
links to Python wrappers for it.

<mike
 
S

Steven D'Aprano

Hi. I have files that I will be importing in at least four different
plain text formats, one of them being tab delimited format, a couple
being token based uses pipes (but not delimited with pipes), another
being xml. There will likely be others as well but the data needs to be
extracted and rewritten to a single format. The files can be fairly
large (several MB) so I do not want to read the whole file into memory.

Why ever not? On modern machines, "several MB" counts as small files. Let
your operating system worry about memory, at least until you get to really
big (several hundred megabytes) files.
What approach would be recommended for sniffing the files for the
different text formats.

In no particular order:

(1) Push the problem onto the user: they specify what sort of file they
think it is. If they tell your program the file is XML when it is in fact
a CSV file, your XML importer will report back that that the input file is
a broken XML file.

(2) Look at the file extension (.xml, .csv, .txt, etc) and assume that it
is correct. If the user gives you an XML file called "data.csv", you can
hardly be blamed for treating it wrong. This behaviour is more accepted
under Windows than Linux or Macintosh.

(3) Use the Linux command "file" to determine the contents of the file.
There may be equivalents on other OSes.

(4) Write your own simple scanner that tries to determine if the file is
xml, csv, tab-delimited text, etc. A basic example:

(Will need error checking and hardening)

def sniff(filename):
"""Return one of "xml", "csv", "txt" or "tkn", or "???"
if it can't decide the file type.
"""
fp = open(filename, "r")
scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0}
for line in fp.readlines():
if not line:
continue
if line[0] == "<":
scores["xml"] += 1
if '\t' in line:
scores["txt"] += 1
if ',' in line:
scores["csv"] += 1
if SOMETOKEN in line:
scores["csv"] += 1
# Pick the best guess:
L = [(score, name) for (name, score) in scores.items()]
L.sort()
L.reverse()
# L is now sorted from highest down to lowest by score.
best_guess = L[0]
second_best_guess = L[0]
if best_guess[0] > 10*second_best_guess[0]:
fp.close()
return best_guess[1]
fp.close()
return "???"


Note that the above code really isn't good enough for production work, but
it should give you an idea how to proceed.


Hope that helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top