Re: Populating a dictionary, fast

Discussion in 'Python' started by Michael Bacarella, Nov 11, 2007.

  1. > That's an awfully complicated way to iterate over a file. Try this
    > instead:


    >
    > id2name = {}
    > for line in open('id2name.txt'):
    > id,name = line.strip().split(':')
    > id = long(id)
    > id2name[id] = name
    >
    > > This takes about 45 *minutes*
    > >

    > On my system, it takes about a minute and a half to produce a

    dictionary
    > with 8191180 entries.


    Doing something similar on my system is very fast as well.

    $ cat dict-8191180.py

    #!/usr/bin/python



    v = {}

    for i in xrange(8191180):

    v = i


    $ time ./dict-8191180.py

    real 0m5.877s
    user 0m4.953s
    sys 0m0.924s

    But...

    > > If I comment out the last line in the loop body it takes only about

    30
    > > _seconds_ to run. This would seem to implicate the line id2name[id] =
    > > name as being excruciatingly slow.

    >
    > No, dictionary access is one of the most highly-optimized, fastest,

    most
    > efficient parts of Python. What it indicates to me is that your system

    is
    > running low on memory, and is struggling to find room for 517MB worth

    of
    > data.



    If only it were so easy.




    $ free


    total used free shared buffers cached


    Mem: 7390244 2103448 5286796 0 38996 1982756


    -/+ buffers/cache: 81696 7308548


    Swap: 2096472 10280 2086192





    Here's your Python implementation running as badly as mine did.



    $ wc -l id2name.txt

    8191180 id2name.txt



    $ cat cache-id2name.py

    #!/usr/bin/python



    id2name = {}

    for line in open('id2name.txt'):

    id,name = line.strip().split(':',1)

    id = long(id)

    id2name[id] = name



    $ time ./cache-id2name.py

    ^C

    I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed.


    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

    18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p



    36 minutes, 05.99 seconds.



    To rule out the file reading/parsing logic as culprit, here's same thing, but with

    dictionary insertion removed.



    $ cat nocache-id2name.py

    #!/usr/bin/python



    id2name = {}

    for line in open('id2name.txt'):

    id,name = line.strip().split(':',1)

    id = long(id)



    $ time ./nocache-id2name.py



    real 0m33.518s

    user 0m33.070s

    sys 0m0.415s





    Here's a Perl implementation running very fast.



    $ cat cache-id2name.pl

    #!/usr/bin/perl



    my %id2name = ();

    my $line;

    my $id;

    my $name;

    open(F,"<id2name.txt");



    foreach $line (<F>) {

    chomp($line);

    ($id,$name) = split(/:/,$line,1);

    $id = int($id);

    $id2name{$id} = $name;

    }



    $ time ./cache-id2name.pl



    real 0m46.363s

    user 0m43.730s

    sys 0m2.611s





    So, you think the Python's dict implementation degrades towards O(N)

    performance when it's fed millions of 64-bit pseudo-random longs?
     
    Michael Bacarella, Nov 11, 2007
    #1
    1. Advertising

  2. On Sat, 10 Nov 2007 17:18:37 -0800, Michael Bacarella wrote:


    > So, you think the Python's dict implementation degrades towards O(N)
    > performance when it's fed millions of 64-bit pseudo-random longs?


    No.

    Here's my sample file:

    $ wc -l id2name.txt
    8191180 id2name.txt
    $ ls -lh id2name.txt
    -rw-rw-r-- 1 steve steve 371M 2007-11-11 14:00 id2name.txt

    And the results of reading it into a dict (code follows below):

    $ time ./slurp_dict.py
    Starting at Sun Nov 11 14:26:51 2007
    Line 0
    Line 1000000
    Line 2000000
    Line 3000000
    Line 4000000
    Line 5000000
    Line 6000000
    Line 7000000
    Line 8000000
    Items in dict: 8191180
    Completed import at Sun Nov 11 14:29:31 2007
    Starting to delete dict...


    Traceback (most recent call last):
    File "./slurp_dict.py", line 20, in <module>
    del id2name
    KeyboardInterrupt

    real 35m52.334s
    user 1m17.663s
    sys 0m16.758s


    Notice that the dict is completely read into memory in just two and a
    half minutes. The script then tries to delete the dict, and 32 minutes
    later is still struggling. That's the point I got sick of waiting and
    interrupted the script.

    Conclusion: it's a memory issue, or maybe a garbage collection issue, not
    a problem with dicts.




    Here's my code for creating the key:value file in the first place:

    #!/usr/bin/python
    """Make a big file of 64-bit integer keys plus random values."""

    bits64 = 2**64
    import random
    template = '%d:This is a bunch of text...\n'
    fp = open('id2name.txt', 'w')
    for i in xrange(8191180):
    fp.write(template % random.randint(0, bits64))
    fp.close()

    ###

    And here's my code for slurping it in to a dict:

    #!/usr/bin/python
    """Read a big file into a dict."""

    import gc
    import time
    print "Starting at %s" % time.asctime()
    flag = gc.isenabled()
    gc.disable()
    id2name = {}
    for n, line in enumerate(open('id2name.txt', 'r')):
    if n % 1000000 == 0:
    # Give feedback.
    print "Line %d" % n
    id,name = line.strip().split(':', 1)
    id = long(id)
    id2name[id] = name
    print "Items in dict:", len(id2name)
    print "Completed import at %s" % time.asctime()
    print "Starting to delete dict..."
    del id2name
    print "Completed deletion at %s" % time.asctime()
    if flag:
    gc.enable()
    print "Finishing at %s" % time.asctime()

    ###


    So, what can you do? Given that I'm not willing to do any more unpaid
    experimentation for you, here are my suggestions, in no particular order:

    (1) Presumably you don't want to run your app with the garbage collector
    turned off. You might still want to play around with the gc module to see
    what you can learn.

    (2) More memory will help avoid paging. If you can't get more memory, try
    more virtual memory. It will still be slow, but at least the operating
    system doesn't have to try moving blocks around as much.

    (3) Are you sure you need all eight-million-plus items in the cache all
    at once?

    (4) There are lots of algorithms out there for dealing with data too big
    to fit into main memory. Do some research.

    (5) There is a data structure designed for dealing with tens of millions
    of records at once. It is called "a database". If you can't find a better
    algorithm, and refuse to use an existing RDBMS, I suspect you're going to
    end up inventing a primitive, inefficient, buggy database which is no
    faster than existing systems out there.



    --
    Steven.
     
    Steven D'Aprano, Nov 11, 2007
    #2
    1. Advertising

  3. Michael Bacarella

    Yu-Xi Lim Guest

    Steven D'Aprano wrote:
    > (2) More memory will help avoid paging. If you can't get more memory, try
    > more virtual memory. It will still be slow, but at least the operating
    > system doesn't have to try moving blocks around as much.
    >


    Based on his previous post, it would seem he has 7GB of RAM (with about
    5GB free) and 2GB of swap. I don't think RAM is the issue.

    Maybe there's something wrong with his specific Python installation.
    What version is it and was it compiled by him?
     
    Yu-Xi Lim, Nov 11, 2007
    #3
  4. Michael Bacarella

    Paul Rubin Guest

    Michael Bacarella <> writes:
    > If only it were so easy.


    I think I know what's going on, the dictionary updates are sending the
    GC into quadratic behavior. Try turning off the GC:

    import gc
    gc.disable()
     
    Paul Rubin, Nov 11, 2007
    #4
  5. >>>>> "Steven" == Steven D'Aprano <> writes:

    Steven> $ time ./slurp_dict.py Starting at Sun Nov 11 14:26:51
    Steven> 2007 Line 0 Line 1000000 Line 2000000 Line 3000000 Line
    Steven> 4000000 Line 5000000 Line 6000000 Line 7000000 Line
    Steven> 8000000 Items in dict: 8191180 Completed import at Sun Nov
    Steven> 11 14:29:31 2007 Starting to delete dict...


    Steven> Traceback (most recent call last): File "./slurp_dict.py",
    Steven> line 20, in <module> del id2name KeyboardInterrupt

    Steven> real 35m52.334s user 1m17.663s sys 0m16.758s


    Steven> Notice that the dict is completely read into memory in
    Steven> just two and a half minutes. The script then tries to
    Steven> delete the dict, and 32 minutes later is still
    Steven> struggling. That's the point I got sick of waiting and
    Steven> interrupted the script.

    Steven> Conclusion: it's a memory issue, or maybe a garbage
    Steven> collection issue, not a problem with dicts.


    uh, strange results...

    I run your same scripts with and without garbage collection enabled
    and those are the results:

    with gc enabled:

    azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
    Starting at Sun Nov 11 16:35:12 2007
    Line 0
    Line 1000000
    Line 2000000
    Line 3000000
    Line 4000000
    Line 5000000
    Line 6000000
    Line 7000000
    Line 8000000
    Items in dict: 8191180
    Completed import at Sun Nov 11 16:36:03 2007
    Starting to delete dict...
    Completed deletion at Sun Nov 11 16:36:09 2007
    Finishing at Sun Nov 11 16:36:09 2007

    and without gc enabled

    azazel@lizard:~/wip/zodb_test$ python slurp_dict.py
    Starting at Sun Nov 11 16:39:02 2007
    Line 0
    Line 1000000
    Line 2000000
    Line 3000000
    Line 4000000
    Line 5000000
    Line 6000000
    Line 7000000
    Line 8000000
    Items in dict: 8191180
    Completed import at Sun Nov 11 16:39:49 2007
    Starting to delete dict...
    Completed deletion at Sun Nov 11 16:39:56 2007
    Finishing at Sun Nov 11 16:39:56 2007


    all with python2.4 on and i386 Linux

    cheers

    Alberto
     
    Alberto Berti, Nov 11, 2007
    #5
  6. On Sun, 11 Nov 2007 04:32:44 -0000, Steven D'Aprano
    <> declaimed the following in
    comp.lang.python:

    >
    > real 35m52.334s
    > user 1m17.663s
    > sys 0m16.758s
    >
    >
    > Notice that the dict is completely read into memory in just two and a
    > half minutes. The script then tries to delete the dict, and 32 minutes
    > later is still struggling. That's the point I got sick of waiting and
    > interrupted the script.
    >

    Sure seems a problem somewhere... Out of insanity I just ran your
    exact programs -- on a 32-bit WinXP Pro machine with an ActiveState
    Python 2.4.x release. 2GB physical RAM, 1.5GB swap.

    >pythonw -u "Dload.py"

    Starting at Sun Nov 11 10:25:22 2007
    Line 0
    Line 1000000
    Line 2000000
    Line 3000000
    Line 4000000
    Line 5000000
    Line 6000000
    Line 7000000
    Line 8000000
    Items in dict: 8191180
    Completed import at Sun Nov 11 10:26:08 2007
    Starting to delete dict...
    Completed deletion at Sun Nov 11 10:26:12 2007
    Finishing at Sun Nov 11 10:26:12 2007
    >Exit code: 0


    Peak commit charge: 1437724K, which may have just been large enough
    to start swapping when one accounts for the OS, Firefox, Eudora, and
    Agent (as user applications -- I won't bother to track down the
    background processes).
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Nov 11, 2007
    #6
  7. Paul Rubin <http://> writes:

    > Michael Bacarella <> writes:
    >> If only it were so easy.

    >
    > I think I know what's going on, the dictionary updates are sending the
    > GC into quadratic behavior. Try turning off the GC:
    >
    > import gc
    > gc.disable()


    This is becoming an FAQ on this newsgroup, having come up first with
    lists and now with dictionaries. It can also be considered a serious
    implementation problem, since code like Michael's is expected to
    perform in (close to) linear time and in fact did so in previous
    versions of Python, and still does in Perl and other popular scripting
    languages. Neither introductory nor intermediate Python learning
    materials don't cover the need to disable GC when building large data
    structures.

    Maybe the default GC strategy should be rethought.
     
    Hrvoje Niksic, Nov 12, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Bacarella

    Populating a dictionary, fast

    Michael Bacarella, Nov 10, 2007, in forum: Python
    Replies:
    4
    Views:
    512
    Istvan Albert
    Nov 11, 2007
  2. Michael Bacarella

    Re: Populating a dictionary, fast

    Michael Bacarella, Nov 11, 2007, in forum: Python
    Replies:
    3
    Views:
    265
    Ben Finney
    Nov 11, 2007
  3. Michael Bacarella

    Re: Populating a dictionary, fast

    Michael Bacarella, Nov 11, 2007, in forum: Python
    Replies:
    2
    Views:
    452
    Jeffrey Froman
    Nov 12, 2007
  4. Michael Bacarella

    Re: Populating a dictionary, fast

    Michael Bacarella, Nov 11, 2007, in forum: Python
    Replies:
    16
    Views:
    688
    Francesc Altet
    Nov 21, 2007
  5. Michael Bacarella

    Re: Populating a dictionary, fast

    Michael Bacarella, Nov 11, 2007, in forum: Python
    Replies:
    5
    Views:
    264
    MrJean1
    Nov 12, 2007
Loading...

Share This Page