Working with Huge Text Files

L

Lorn Davies

Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at
simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn
 
M

mensanator

Lorn said:
Hi there, I'm a Python newbie hoping for some direction in working with
text files that range from 100MB to 1G in size. Basically certain rows,
sorted by the first (primary) field maybe second (date), need to be
copied and written to their own file, and some string manipulations
need to happen as well. An example of the current format:

XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
|
| followed by like a million rows similar to the above, with
| incrementing date and time, and then on to next primary field
|
ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
|
| etc., there are usually 10-20 of the first field per file
| so there's a lot of repetition going on
|

The export would ideally look like this where the first field would be
written as the name of the file (XYZ.txt):

19930104, 93027, 2887, 7600, 40, 0, Z, N

Pretty ambitious for a newbie? I really hope not. I've been looking at
simpleParse, but it's a bit intense at first glance... not sure where
to start, or even if I need to go that route. Any help from you guys in
what direction to go or how to approach this would be hugely
appreciated.

Best regards,
Lorn

You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.
 
M

Michael Hoffman

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.

And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.
 
M

mensanator

guys

You could use the csv module.

Here's the example from the manual with your sample data in a file
named simple.csv:

Obviously, I meant "some.csv". Make sure the name in the program
matches the file you want to process, or pass the input file name
to the program as an argument.
import csv
reader = csv.reader(file("some.csv"))
for row in reader:
print row

"""
['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
"""

The csv module while bring each line in as a list of strings.
Of course, you want to process each line before printing it.
And you don't just want to print it, you want to write it to a file.

So after reading the first line, open a file for writing with the
first field (row[0]) as the file name. Then you want to process
fields row[1], row[2] and row[3] to get them in the right format
and then write all the row fields except row[0] to the file that's
open for writing.

On every subsequent line you must check to see if row[0] has changed,
so you'll have to store row[0] in a variable. If it's changed, close
the file you've been writing to and open a new file with the new
row[0]. Then continue processing lines as before.

It will only be this simple if you can guarantee that the original
file is actually sorted by the first field.
 
C

cwazir

Hi,

Lorn said:
..... working with text files that range from 100MB to 1G in size.
.....
XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
.....

I've found that for working with simple large text files like this,
nothing beats the plain old built-in string operations. Using a parsing
library is convenient if the data format is complex, but otherwise it's
overkill.
In this particular case, even the csv module isn't much of an
advantage. I'd just use split.

The following code should do the job:

data_file = open('data.txt', 'r')
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
fields[2] = fields[2].replace(':', '')
fields[3] = fields[3].replace('.', '')
print >>output_files[filename], ', '.join(fields[1:])
for filename in output_files:
output_files[filename].close()
data_file.close()

Note that it does work with unsorted data - at the minor cost of
keeping all output files open till the end of the entire process.

Chirag Wazir
http://chirag.freeshell.org
 
A

Al Christians

I did some similar stuff way back about 12-15 years ago -- in 640k
MS-DOS with gigabyte files on 33 MHz machines. I got good performance,
able to bring up any record out of 10 million or so on the screen in a
couple of seconds (not using Python, but that should not make much
difference, maybe even some things in Python would make it work better.)

Even though my files were text, I read them as random-access binary
files. You need to be able to dive in at an arbitrary point in the
file, read a chunk of data, split it up into lines, discarding any
partial lines at the beginning and end, pull out the keys and see where
you are. Even with a gigabyte of file, if you are reading a decent size
chunk, you can binary search down to the spot you want in 15-20 tries or
so. That's the first time, but after that you've got a better idea
where to look. Use a dictionary to save the information from each chunk
to give you an index to get a headstart on the next search. If you can
keep 10k to 100k entries in your index, you can do 1000's of searches or
so before you even have to worry about having too many index entries.

I did learn that on 32-bit hardware, doing a binary search of a file
over a gigabyte will fail if you calculate the next place to look as
(a+b)/2, because a+b can be more than 2GB and overflow. You gotta do
(a + (b-a)/2)


Al
 
A

Al Christians

Michael said:
And if not you can either sort the file ahead of time, or just keep
reopening the files in append mode when necessary. You could sort them
in memory in your Python program but given the size of these files I
think one of the other alternatives would be simpler.

There used to be a very nice sort program for PC's that came from
someplace in Nevada. It cost less than $100 and could sort files
faster than most programming languages could read or write them. For
linux, you've gotta figure out the posix sort. If you do, please splain
it to me.

Al
 
L

Lorn Davies

Thank you all very much for your suggestions and input... they've been
very helpful. I found the easiest apporach, as a beginner to this, was
working with Chirag's code. Thanks Chirag, I was actually able to read
and make some edit's to the code and then use it... woohooo!

My changes are annotated with ##:

data_file = open('G:\pythonRead.txt', 'r')
data_file.readline() ## this was to skip the first line
months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
'DEC':'12'}
output_files = {}
for line in data_file:
fields = line.strip().split(',')
length = len(fields[3]) ## check how long the field is
N = 'P','N'
filename = fields[0]
if filename not in output_files:
output_files[filename] = open(filename+'.txt', 'w')
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?
fields[1] = fields[1][5:] + months[fields[1][2:5]] +
fields[1][:2]
fields[2] = fields[2].replace(':', '')
if length == 6: ## check for 6 if not add a 0
fields[3] = fields[3].replace('.', '')
else:
fields[3] = fields[3].replace('.', '') + '0'
print >>output_files[filename], ', '.join(fields[1:5])
for filename in output_files:
output_files[filename].close()
data_file.close()

The main changes were to create a check for the length of fields[3], I
wanted to normalize it at 6 digits... the problem I can seee with it
potentially is if I come across lengths < 5, but I have some ideas to
fix that. The other change I attempted was a criteria for what to print
based on the value of fields[8] and fields[6]. It didn't work so well.
I'm a little confused at how to structure booleans like that... I come
from a little experience in a Pascal type scripting language where "x
and y" would entail both having to be true before continuing and "x or
y" would mean either could be true before continuing. Python, unless
I'm misunderstanding (very possible), doesn't organize it as such. I
thought of perhaps using a set of if, elif, else statements for
processing the fileds, but didn't think that would be the most
elegant/efficient solution.

Anyway, any critiques/ideas are welcome... they'll most definitely help
me understand this language a bit better. Thank you all again for your
great replies and thank you Chirag for getting me up and going.

Lorn
 
C

cwazir

Lorn said:
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?

In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):
The main changes were to create a check for the length of fields[3],
I wanted to normalize it at 6 digits...

Well, you needn't really check the length - you could directly do this:
fields[3] = (fields[3].replace('.', '') + '000000')[:6]
(of course if there are more than 6 digits originally, they'd get
truncated in this case)

Chirag Wazir
http://chirag.freeshell.org
 
J

John Machin

Lorn said:
if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
## This line above doesn't work, can't figure out how to struct?

In Python you would need to phrase that as follows:
if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
or fields[6] == '1'):
or alternatively:
if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):

and given that the files are huge, a little bit of preprocessing
wouldn't go astray:

initially:

valid_8 = set(['N', 'P'])
valid_6 = set(['0', '1'])

then for each record:

if fields[8] in valid_8 and fields[6] in valid_6:

More meaningful names wouldn't go astray either :)
 
C

cwazir

John said:
More meaningful names wouldn't go astray either :)

I heartily concur!

Instead of starting with:
fields = line.strip().split(',')
you could use something like:
(f_name, f_date, f_time, ...) = line.strip().split(',')

Of course then you won't be able to use ', '.join(fields[1:])
for the output, but the rest of the program will be
MUCH more readable/maintainable.

Chirag Wazir
http://chirag.freeshell.org
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,188
Latest member
Crypto TaxSoftware

Latest Threads

Top