Populating a dictionary, fast

M

Michael Bacarella

The id2name.txt file is an index of primary keys to strings. They look like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30 _seconds_ to run.
This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is not fast enough.)
 
B

Ben Finney

Michael Bacarella said:
id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about
30 _seconds_ to run. This would seem to implicate the line
id2name[id] = name as being excruciatingly slow.

Or, rather, that the slowdown is caused by allocating these items in a
dictionary at all.

Dictionaries are implemented very efficiently in Python, but there
will still be overhead in inserting millions of distinct items. Of
course, if you just throw each item away instead of allocating space
for it, the loop will run very quickly.
Is there a fast, functionally equivalent way of doing this?

You could, instead of individual assignments in a 'for' loop, try
letting the 'dict' type operate on a generator::

input_file = open("id2name.txt")
id2name = dict(
(long(id), name) for (id, name) in
line.strip().split(":") for line in input_file
)

All that code inside the 'dict()' call is a "generator expression"; if
you don't know what they are yet, have a read of Python's
documentation on them. It creates a generator which will spit out
key+value tuples to be fed directly to the dict constructor as it
requests them.

That allows the generator to parse each item from the file exactly as
the 'dict' constructor needs it, possibly saving some extra "allocate,
assign, discard" steps. Not having your data set, I can't say if it'll
be significantly faster.
 
S

Steven D'Aprano

The id2name.txt file is an index of primary keys to strings. They look
like this:

11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n

The file's properties are:

# wc -l id2name.txt

8191180 id2name.txt
# du -h id2name.txt
517M id2name.txt

I'm loading the file into memory with code like this:

id2name = {}
for line in iter(open('id2name.txt').readline,''):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name

That's an awfully complicated way to iterate over a file. Try this
instead:

id2name = {}
for line in open('id2name.txt'):
id,name = line.strip().split(':')
id = long(id)
id2name[id] = name


On my system, it takes about a minute and a half to produce a dictionary
with 8191180 entries.

This takes about 45 *minutes*

If I comment out the last line in the loop body it takes only about 30
_seconds_ to run. This would seem to implicate the line id2name[id] =
name as being excruciatingly slow.

No, dictionary access is one of the most highly-optimized, fastest, most
efficient parts of Python. What it indicates to me is that your system is
running low on memory, and is struggling to find room for 517MB worth of
data.

Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash is
not fast enough.)

You'll pardon me if I'm skeptical. Considering the convoluted, weird way
you had to iterate over a file, I wonder what other less-than-efficient
parts of your code you are struggling under. Nine times out of ten, if a
program runs too slowly, it's because you're using the wrong algorithm.
 
P

Paul Rubin

Michael Bacarella said:
Is there a fast, functionally equivalent way of doing this?

(Yes, I really do need this cached. No, an RDBMS or disk-based hash
is not fast enough.)

As Steven says maybe you need to add more ram to your system. The
memory overhead of dictionary cells is considerable. If worse comes
to worse you could concoct some more storage-efficient representation.
 
I

Istvan Albert

This would seem to implicate the line id2name[id] = name as being excruciatingly slow.

As others have pointed out there is no way that this takes 45
minutes.Must be something with your system or setup.

A functionally equivalent code for me runs in about 49 seconds!
(it ends up using about twice as much ram as the data on disk)

i.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top