list comprehension help

rkmr.em · Mar 18, 2007

Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

Marc 'BlackJack' Rintsch · Mar 18, 2007

In <[email protected]>,

I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?

No way I can see here.

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

You can get rid of splitting the same line twice, or use `split()` and
`rsplit()` with the `maxsplit` argument to avoid splitting the line at
*every* space character.

And if the names give the right hints `db.sync()` may be a potentially
expensive operation. Try to call it at a lower frequency if possible.

Ciao,
Marc 'BlackJack' Rintsch

George Sakkis · Mar 19, 2007

Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

Best,
George

rkmr.em · Mar 19, 2007

I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

Click to expand...

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB

Can you give example of how you process the 15-25GB files with the
buffering parameter?
It will be educational to everyone I think.

files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

I changed the sync to once in every 100,000 lines.
thanks a lot everyone!

Alex Martelli · Mar 19, 2007

George Sakkis said:
Hi
I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
it in this case?
thanks a lot!

f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

Click to expand...

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering
option too, make sure you use it and don't synchronize on every line.

Out of curiosity, what OS and FS are you using? On a well-tuned FS and
OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land). IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)

Alex

rkmr.em · Mar 19, 2007

George Sakkis said:
George Sakkis said:

I need to process a really huge text file (4GB) and this is what i
need to do. It takes for ever to complete this. I read some where that
"list comprehension" can fast up things. Can you point out how to do
f = open('file.txt','r')
for line in f:
db[line.split(' ')[0]] = line.split(' ')[-1]
db.sync()

Click to expand...

You got several good suggestions; one that has not been mentioned but
makes a big (or even the biggest) difference for large/huge file is
the buffering parameter of open(). Set it to the largest value you can
afford to keep the I/O as low as possible. I'm processing 15-25 GB
files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
compared to the default value. BerkeleyDB should have a buffering

Click to expand...

Out of curiosity, what OS and FS are you using? On a well-tuned FS and

Fedora Core 4 and ext 3. Is there something I should do to the FS?

OS combo that does "read-ahead" properly, I would not expect such
improvements for moving from large to huge buffering (unless some other
pesky process is perking up once in a while and sending the disk heads
on a quest to never-never land). IOW, if I observed this performance
behavior on a server machine I'm responsible for, I'd look for
system-level optimizations (unless I know I'm being forced by myopic
beancounters to run inappropriate OSs/FSs, in which case I'd spend the
time polishing my resume instead) - maybe tuning the OS (or mount?)
parameters, maybe finding a way to satisfy the "other pesky process"
without flapping disk heads all over the prairie, etc, etc.

The delay of filling a "1 GB or more" buffer before actual processing
can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
that is, something bad is seriously interfering with the normal
read-ahead system level optimization... and in that case I'd normally be
more interested in finding and squashing the "something bad", than in
trying to work around it by overprovisioning application bufferspace!-)

Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.
thanks!

Alex Martelli · Mar 19, 2007

Fedora Core 4 and ext 3. Is there something I should do to the FS?

In theory, nothing. In practice, this is strange.

Which should I do? How much buffer should I allocate? I have a box
with 2GB memory.

I'd be curious to see a read-only loop on the file, opened with (say)
1MB of buffer vs 30MB vs 1GB -- just loop on the lines, do a .split() on
each, and do nothing with the results. What elapsed times do you
measure with each buffersize...?

If the huge buffers confirm their worth, it's time to take a nice
critical look at what other processes you're running and what all are
they doing to your disk -- maybe some daemon (or frequently-run cron
entry, etc) is out of control...? You could try running the benchmark
again in single-user mode (with essentially nothing else running) and
see how the elapsed-time measurements change...

Alex

Python List Comprehension Error: Unexpected Output	1	Aug 28, 2023
is list comprehension necessary?	15	Oct 26, 2010
Code suggestion - List comprehension	0	Dec 12, 2013
list comprehension misbehaving	1	Mar 28, 2013
How to write this as a list comprehension?	10	Jan 17, 2014
dict comprehension question.	13	Dec 29, 2012
List comprehension timing difference.	4	Sep 2, 2011
Newbie: list comprehension troubles..	5	Aug 24, 2009

list comprehension help

rkmr.em

Marc 'BlackJack' Rintsch

George Sakkis

rkmr.em

Alex Martelli

rkmr.em

Alex Martelli

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads