list comprehension help

R

rkmr.em

Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!


f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()
 
M

Marc 'BlackJack' Rintsch

In <[email protected]>,
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?

No way I can see here.
f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

You can get rid of splitting the same line twice, or use `split()` and
`rsplit()` with the `maxsplit` argument to avoid splitting the line at
*every* space character.

And if the names give the right hints `db.sync()` may be a potentially
expensive operation. Try to call it at a lower frequency if possible.

Ciao,
Marc 'BlackJack' Rintsch
 
G

George Sakkis

Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

Best,
George
 
R

rkmr.em

I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()
You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB

Can you give example of how you process the 15-25GB files with the
buffering parameter?
It will be educational to everyone I think.
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

I changed the sync to once in every 100,000 lines.
thanks a lot everyone!
 
A

Alex Martelli

George Sakkis said:
Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

Out of curiosity, what OS and FS are you using? On a well-tuned FS and
OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land). IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)


Alex
 
R

rkmr.em

George Sakkis said:
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()
You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
Out of curiosity, what OS and FS are you using? On a well-tuned FS and

Fedora Core 4 and ext 3. Is there something I should do to the FS?
OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land). IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)


Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.
thanks!
 
A

Alex Martelli

Fedora Core 4 and ext 3. Is there something I should do to the FS?

In theory, nothing. In practice, this is strange.
Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.

I'd be curious to see a read-only loop on the file, opened with (say)
1MB of buffer vs 30MB vs 1GB -- just loop on the lines, do a .split() on
each, and do nothing with the results. What elapsed times do you
measure with each buffersize...?

If the huge buffers confirm their worth, it's time to take a nice
critical look at what other processes you're running and what all are
they doing to your disk -- maybe some daemon (or frequently-run cron
entry, etc) is out of control...? You could try running the benchmark
again in single-user mode (with essentially nothing else running) and
see how the elapsed-time measurements change...


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top