dict is really slow for big truck

forrest yang · Apr 28, 2009

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Thomas G. Willis · Apr 28, 2009

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

I think you need to upgrade the l2 cache on your cpu.

Aahz · Apr 28, 2009

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.

Try gc.disable() before the loop and gc.enable() afterward.

bearophileHUGS · Apr 28, 2009

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[line.split(None, 1)[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Keys are integers, so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))

It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O and/or the
list allocation. A possible suggestion is to not split the arrays, but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
line = line.strip()
d[line.split(None, 1)[0]] = line

if that's not fast enough you can simplify it:

d = {}
for line in open(file):
d[line.split(None, 1)[0]] = line

If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

Bye,
bearophile

Steven D'Aprano · Apr 29, 2009

Try gc.disable() before the loop and gc.enable() afterward.

And if that doesn't help, it's possible that you have run into a variant
of this obscure, hardware-specific bug:

http://groups.google.com/group/comp.lang.python/browse_thread/
thread/77e5d747c4a727cb/

Warning: it's a LONG discussion, and threading is badly broken in it.
There was no consensus as to whether it was a real bug or not, although
I'm convinced it was. If you care, please spend some time reading the
various posts with subject lines:

"Populating a dictionary, fast"
"Populating a dictionary, fast [SOLVED]"
"Populating a dictionary, fast [SOLVED SOLVED]"

here:

http://mail.python.org/pipermail/python-list/2007-November/thread.html

Sion Arrowsmith · Apr 29, 2009

for line in open(file)
arr=line.strip().split('\t')
dict[line.split(None, 1)[0]]=arr

Click to expand...

Keys are integers, so they are very efficiently managed by the dict.

The keys aren't integers, though, they're strings. Though I don't
think that's going to make much difference.

You can still get the original result with a single split call
(which may indeed be a significant overhead):

for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

unless I've misunderstood the data format. (And if I have, is it
really the intention that "1 1\t1\t1" gets overwritten by a
subsequent "1 2\t3\t4" as the original code does?)

bearophileHUGS · Apr 29, 2009

Sion Arrowsmith:

The keys aren't integers, though, they're strings.

You are right, sorry. I need to add an int() there.

Bye,
bearophile

Bruno Desthuilliers · Apr 29, 2009

(e-mail address removed) a écrit :

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[line.split(None, 1)[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Click to expand...

Keys are integers,

Actually strings. But this is probably not the problem here.

so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))
It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O

If the OP experiments a noticeable slow down during the process then I
doubt the problem is with IO. If he finds the process to be slow but of
constant slowness, then it may or not have to with IO, but possibly not
as the single factor.

Hint : don't guess, profile.

and/or the
list allocation. A possible suggestion is to not split the arrays,

The OP is actually splitting a string.

but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
line = line.strip()
d[line.split(None, 1)[0]] = line

You still split the string - but only once, which is indeed better !-)

Bu you can have your cake and eat it too:

d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

if that's not fast enough you can simplify it:

d = {}
for line in open(file):
d[line.split(None, 1)[0]] = line

I doubt this will save that much processing time...

If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

IMHO, not much better...

pruebauno · Apr 29, 2009

Bruno said:
Bruno said:

d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Click to expand...

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

--Scott David Daniels
(e-mail address removed)

They don't.
It is probably out of habit of using the generalized idiom:

line="a,b,c\n"
line.strip().split(",") ['a', 'b', 'c']
line.split(",")

Click to expand...

Click to expand...

['a', 'b', 'c\n']

MRAB · Apr 29, 2009

Scott said:
Bruno said:

d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Click to expand...

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

http://www.python.org/doc/current/library/stdtypes.html?highlight=split#str.split

If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start or
end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace with
a None separator returns [].

J. Cliff Dyer · Apr 29, 2009

Bruno said:
Bruno said:

d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Click to expand...

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

Good question. I can't count the number of times I've used
line.strip().split() in my own code, just because I didn't 1) read the
documentation closely enough, or 2) try the alternative, but lo and
behold, they are the same, at least in the cases I was trying to account
for.

' a b c '.split() == ' a b c '.strip().split() == 'a b c'.split()

Thanks for pointing this out.

Cheers,
Cliff

MRAB · Apr 29, 2009

Scott said:
MRAB said:

Scott said:

Bruno Desthuilliers wrote:
d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

Click to expand...

... <explanation of what split does>

Click to expand...

You misunderstand the question. For what values of line is
the following expression False?

line.strip().split() == line.split()

If you know of one, I'd like to hear of it. If not, I assert
that (baring some incredibly over-ambitous optimizer operations)
line.strip().split()
is simply an inefficient (both performance and text) way of saying:
line.split()

From the explanation it's clear that the .strip() is unnecessary.

Bruno Desthuilliers · Apr 30, 2009

(e-mail address removed) a écrit :

Sion Arrowsmith:

You are right, sorry. I need to add an int() there.

Which is not garanteed to speed up the code FWIW

Bruno Desthuilliers · Apr 30, 2009

forrest yang a écrit :

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

How "like" is it ?-)

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory,

Looks like your system is starting to swap. Use 'top' or any other
system monitor to check it out.

by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory

dicts are Python's central data type (objects are based on dicts, all
non-local namespaces are based on dicts, etc), so you can safely assume
they are highly optimized.

or
something other reason.

FWIW, a very loose (and partially wrong, cf below) estimation based on
wild guesses: assuming an average size of 512 bytes per object (remember
that Python doesn't have 'primitive' types), the above would use =~ 22G.

Hopefully, CPython does some caching for some values of some immutable
types (specifically, small ints and strings that respect the grammar for
Python identifiers), so depending on your real data, you might need a
bit less RAM. Also, the 512 bytes per object is really more of a wild
guess than anything else (but given the internal structure of a CPython
object, I think it's about that order - please someone correct me if I'm
plain wrong).

Anyway: I'm afraid the problem has more to do with your design than with
your code or Python's dict implementation itself.

is there any one can provide a better solution

Use a DBMS. They are designed - and highly optimised - for fast lookup
over huge data sets.

My 2 cents.

Python 3: dict & dict.keys()	4	Jul 24, 2013
Big problem I need to solve with some unix utils	1	Jun 19, 2022
How to not load an insanely big dataset in less than 50 hrs	1	Sep 2, 2023
dict would be very slow for big data	4	May 12, 2009
collections.Counter surprisingly slow	11	Jul 28, 2013
del not working for (exhausted) dict iterable value (Python 3.3)	3	Mar 12, 2013
really slow gzip decompress, why?	3	Jan 26, 2009
XML to dict(d)	1	Jan 31, 2011

dict is really slow for big truck

forrest yang

Thomas G. Willis

Aahz

bearophileHUGS

Steven D'Aprano

Sion Arrowsmith

bearophileHUGS

Bruno Desthuilliers

pruebauno

MRAB

J. Cliff Dyer

MRAB

Bruno Desthuilliers

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads