list comprehension help

Discussion in 'Python' started by rkmr.em@gmail.com, Mar 18, 2007.

  1. Guest

    Hi
    I need to process a really huge text file (4GB) and this is what i
    need to do. It takes for ever to complete this. I read some where that
    "list comprehension" can fast up things. Can you point out how to do
    it in this case?
    thanks a lot!


    f = open('file.txt','r')
    for line in f:
    db[line.split(' ')[0]] = line.split(' ')[-1]
    db.sync()
     
    , Mar 18, 2007
    #1
    1. Advertising

  2. In <>,
    wrote:

    > I need to process a really huge text file (4GB) and this is what i
    > need to do. It takes for ever to complete this. I read some where that
    > "list comprehension" can fast up things. Can you point out how to do
    > it in this case?


    No way I can see here.

    > f = open('file.txt','r')
    > for line in f:
    > db[line.split(' ')[0]] = line.split(' ')[-1]
    > db.sync()


    You can get rid of splitting the same line twice, or use `split()` and
    `rsplit()` with the `maxsplit` argument to avoid splitting the line at
    *every* space character.

    And if the names give the right hints `db.sync()` may be a potentially
    expensive operation. Try to call it at a lower frequency if possible.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Mar 18, 2007
    #2
    1. Advertising

  3. On Mar 18, 12:11 pm, "" <> wrote:

    > Hi
    > I need to process a really huge text file (4GB) and this is what i
    > need to do. It takes for ever to complete this. I read some where that
    > "list comprehension" can fast up things. Can you point out how to do
    > it in this case?
    > thanks a lot!
    >
    > f = open('file.txt','r')
    > for line in f:
    > db[line.split(' ')[0]] = line.split(' ')[-1]
    > db.sync()


    You got several good suggestions; one that has not been mentioned but
    makes a big (or even the biggest) difference for large/huge file is
    the buffering parameter of open(). Set it to the largest value you can
    afford to keep the I/O as low as possible. I'm processing 15-25 GB
    files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
    setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
    compared to the default value. BerkeleyDB should have a buffering
    option too, make sure you use it and don't synchronize on every line.

    Best,
    George
     
    George Sakkis, Mar 19, 2007
    #3
  4. Guest

    On 18 Mar 2007 19:01:27 -0700, George Sakkis <> wrote:
    > On Mar 18, 12:11 pm, "" <> wrote:
    > > I need to process a really huge text file (4GB) and this is what i
    > > need to do. It takes for ever to complete this. I read some where that
    > > "list comprehension" can fast up things. Can you point out how to do
    > > it in this case?
    > > thanks a lot!
    > >
    > > f = open('file.txt','r')
    > > for line in f:
    > > db[line.split(' ')[0]] = line.split(' ')[-1]
    > > db.sync()

    > You got several good suggestions; one that has not been mentioned but
    > makes a big (or even the biggest) difference for large/huge file is
    > the buffering parameter of open(). Set it to the largest value you can
    > afford to keep the I/O as low as possible. I'm processing 15-25 GB


    Can you give example of how you process the 15-25GB files with the
    buffering parameter?
    It will be educational to everyone I think.

    > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
    > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
    > compared to the default value. BerkeleyDB should have a buffering
    > option too, make sure you use it and don't synchronize on every line.


    I changed the sync to once in every 100,000 lines.
    thanks a lot everyone!
     
    , Mar 19, 2007
    #4
  5. George Sakkis <> wrote:

    > On Mar 18, 12:11 pm, "" <> wrote:
    >
    > > Hi
    > > I need to process a really huge text file (4GB) and this is what i
    > > need to do. It takes for ever to complete this. I read some where that
    > > "list comprehension" can fast up things. Can you point out how to do
    > > it in this case?
    > > thanks a lot!
    > >
    > > f = open('file.txt','r')
    > > for line in f:
    > > db[line.split(' ')[0]] = line.split(' ')[-1]
    > > db.sync()

    >
    > You got several good suggestions; one that has not been mentioned but
    > makes a big (or even the biggest) difference for large/huge file is
    > the buffering parameter of open(). Set it to the largest value you can
    > afford to keep the I/O as low as possible. I'm processing 15-25 GB
    > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
    > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
    > compared to the default value. BerkeleyDB should have a buffering
    > option too, make sure you use it and don't synchronize on every line.


    Out of curiosity, what OS and FS are you using? On a well-tuned FS and
    OS combo that does "read-ahead" properly, I would not expect such
    improvements for moving from large to huge buffering (unless some other
    pesky process is perking up once in a while and sending the disk heads
    on a quest to never-never land). IOW, if I observed this performance
    behavior on a server machine I'm responsible for, I'd look for
    system-level optimizations (unless I know I'm being forced by myopic
    beancounters to run inappropriate OSs/FSs, in which case I'd spend the
    time polishing my resume instead) - maybe tuning the OS (or mount?)
    parameters, maybe finding a way to satisfy the "other pesky process"
    without flapping disk heads all over the prairie, etc, etc.

    The delay of filling a "1 GB or more" buffer before actual processing
    can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
    that is, something bad is seriously interfering with the normal
    read-ahead system level optimization... and in that case I'd normally be
    more interested in finding and squashing the "something bad", than in
    trying to work around it by overprovisioning application bufferspace!-)


    Alex
     
    Alex Martelli, Mar 19, 2007
    #5
  6. Guest

    On 3/18/07, Alex Martelli <> wrote:
    > George Sakkis <> wrote:
    > > On Mar 18, 12:11 pm, "" <> wrote:
    > > > I need to process a really huge text file (4GB) and this is what i
    > > > need to do. It takes for ever to complete this. I read some where that
    > > > "list comprehension" can fast up things. Can you point out how to do
    > > > f = open('file.txt','r')
    > > > for line in f:
    > > > db[line.split(' ')[0]] = line.split(' ')[-1]
    > > > db.sync()

    > > You got several good suggestions; one that has not been mentioned but
    > > makes a big (or even the biggest) difference for large/huge file is
    > > the buffering parameter of open(). Set it to the largest value you can
    > > afford to keep the I/O as low as possible. I'm processing 15-25 GB
    > > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
    > > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
    > > compared to the default value. BerkeleyDB should have a buffering

    > Out of curiosity, what OS and FS are you using? On a well-tuned FS and


    Fedora Core 4 and ext 3. Is there something I should do to the FS?

    > OS combo that does "read-ahead" properly, I would not expect such
    > improvements for moving from large to huge buffering (unless some other
    > pesky process is perking up once in a while and sending the disk heads
    > on a quest to never-never land). IOW, if I observed this performance
    > behavior on a server machine I'm responsible for, I'd look for
    > system-level optimizations (unless I know I'm being forced by myopic
    > beancounters to run inappropriate OSs/FSs, in which case I'd spend the
    > time polishing my resume instead) - maybe tuning the OS (or mount?)
    > parameters, maybe finding a way to satisfy the "other pesky process"
    > without flapping disk heads all over the prairie, etc, etc.
    >
    > The delay of filling a "1 GB or more" buffer before actual processing
    > can begin _should_ defeat any gains over, say, a 1 MB buffer -- unless,
    > that is, something bad is seriously interfering with the normal
    > read-ahead system level optimization... and in that case I'd normally be
    > more interested in finding and squashing the "something bad", than in
    > trying to work around it by overprovisioning application bufferspace!-)



    Which should I do? How much buffer should I allocate? I have a box
    with 2GB memory.
    thanks!
     
    , Mar 19, 2007
    #6
  7. <> wrote:
    ...
    > > > files (you see "huge" is really relative ;-)) on 2-4GB RAM boxes and
    > > > setting a big buffer (1GB or more) reduces the wall time by 30 to 50%
    > > > compared to the default value. BerkeleyDB should have a buffering

    > > Out of curiosity, what OS and FS are you using? On a well-tuned FS and

    >
    > Fedora Core 4 and ext 3. Is there something I should do to the FS?


    In theory, nothing. In practice, this is strange.

    > Which should I do? How much buffer should I allocate? I have a box
    > with 2GB memory.


    I'd be curious to see a read-only loop on the file, opened with (say)
    1MB of buffer vs 30MB vs 1GB -- just loop on the lines, do a .split() on
    each, and do nothing with the results. What elapsed times do you
    measure with each buffersize...?

    If the huge buffers confirm their worth, it's time to take a nice
    critical look at what other processes you're running and what all are
    they doing to your disk -- maybe some daemon (or frequently-run cron
    entry, etc) is out of control...? You could try running the benchmark
    again in single-user mode (with essentially nothing else running) and
    see how the elapsed-time measurements change...


    Alex
     
    Alex Martelli, Mar 19, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Daniel Nogradi

    Re: list comprehension help

    Daniel Nogradi, Mar 18, 2007, in forum: Python
    Replies:
    5
    Views:
    273
    Alex Martelli
    Mar 18, 2007
  2. Replies:
    3
    Views:
    266
    Alex Martelli
    Mar 18, 2007
  3. Shane Geiger
    Replies:
    4
    Views:
    387
    bullockbefriending bard
    Mar 25, 2007
  4. Debajit Adhikary
    Replies:
    17
    Views:
    692
    Debajit Adhikary
    Oct 18, 2007
  5. Vedran Furac(
    Replies:
    4
    Views:
    331
    Marc 'BlackJack' Rintsch
    Dec 19, 2008
Loading...

Share This Page