Populating a dictionary, fast

Michael Bacarella · Nov 10, 2007

The id2name.txt file is an index of primary keys to strings. They look like this:

11293102971459182412

escriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)

Ben Finney · Nov 10, 2007

Michael Bacarella said:
id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about
30 _seconds_ to run. This would seem to implicate the line
id2name[id] = name as being excruciatingly slow.

Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.

Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.

Is there a fast, functionally equivalent way of doing this?

You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::

input_file = open("id2name.txt")
id2name = dict(
(long(id), name) for (id, name) in
line.strip().split(":") for line in input_file
)

All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.

That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.

Steven D'Aprano · Nov 10, 2007

The id2name.txt file is an index of primary keys to strings. They look
like this:

11293102971459182412escriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

That's an awfully complicated way to iterate over a file. Try this
instead:

id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30
_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.

No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is
not fast enough.)

You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.

Paul Rubin · Nov 10, 2007

Michael Bacarella said:
Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash
is not fast enough.)

As Steven says maybe you need to add more ram to your system. The
memory overhead of dictionary cells is considerable. If worse comes
to worse you could concoct some more storage-efficient representation.

Istvan Albert · Nov 11, 2007

This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.

A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)

i.

[ANN] Erubis 2.2.0 release - a fast eRuby implementation	0	Feb 12, 2007
[ANN] Erubis 2.3.0 released - a fast and extensible eRuby implementation	0	May 23, 2007
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
sorting text in a file	4	Mar 26, 2008
ANN: Sequel 3.1.0 Released	0	Jun 4, 2009
ANN: Sequel 3.6.0 Released	0	Nov 3, 2009
Shining a nice sunshine on American neighborhoods	1	Dec 21, 2006
Help with code(newbie stuff)	1	Apr 25, 2007

Populating a dictionary, fast

Michael Bacarella

Ben Finney

Steven D'Aprano

Paul Rubin

Istvan Albert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads