dict is really slow for big truck

F

forrest yang

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution
 
T

Thomas G. Willis

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
   arr=line.strip().split('\t')
   dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

I think you need to upgrade the l2 cache on your cpu.
 
A

Aahz

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.

Try gc.disable() before the loop and gc.enable() afterward.
 
B

bearophileHUGS

i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
   arr=line.strip().split('\t')
   dict[line.split(None, 1)[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Keys are integers, so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))

It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O and/or the
list allocation. A possible suggestion is to not split the arrays, but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
line = line.strip()
d[line.split(None, 1)[0]] = line

if that's not fast enough you can simplify it:

d = {}
for line in open(file):
d[line.split(None, 1)[0]] = line

If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

Bye,
bearophile
 
S

Steven D'Aprano

Try gc.disable() before the loop and gc.enable() afterward.



And if that doesn't help, it's possible that you have run into a variant
of this obscure, hardware-specific bug:

http://groups.google.com/group/comp.lang.python/browse_thread/
thread/77e5d747c4a727cb/


Warning: it's a LONG discussion, and threading is badly broken in it.
There was no consensus as to whether it was a real bug or not, although
I'm convinced it was. If you care, please spend some time reading the
various posts with subject lines:

"Populating a dictionary, fast"
"Populating a dictionary, fast [SOLVED]"
"Populating a dictionary, fast [SOLVED SOLVED]"

here:

http://mail.python.org/pipermail/python-list/2007-November/thread.html
 
S

Sion Arrowsmith

for line in open(file)
   arr=line.strip().split('\t')
   dict[line.split(None, 1)[0]]=arr

Keys are integers, so they are very efficiently managed by the dict.

The keys aren't integers, though, they're strings. Though I don't
think that's going to make much difference.

You can still get the original result with a single split call
(which may indeed be a significant overhead):

for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

unless I've misunderstood the data format. (And if I have, is it
really the intention that "1 1\t1\t1" gets overwritten by a
subsequent "1 2\t3\t4" as the original code does?)
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

code
for line in open(file)
arr=line.strip().split('\t')
dict[line.split(None, 1)[0]]=arr

but, the dict is really slow as i load more data into the memory, by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory or
something other reason.
is there any one can provide a better solution

Keys are integers,

Actually strings. But this is probably not the problem here.
so they are very efficiently managed by the dict.
If I do this:
d = dict.fromkeys(xrange(9000000))
It takes only a little more than a second on my normal PC.
So probably the problem isn't in the dict, it's the I/O

If the OP experiments a noticeable slow down during the process then I
doubt the problem is with IO. If he finds the process to be slow but of
constant slowness, then it may or not have to with IO, but possibly not
as the single factor.

Hint : don't guess, profile.
and/or the
list allocation. A possible suggestion is to not split the arrays,

The OP is actually splitting a string.
but
keep it as strings, and split them only when you use them:

d = {}
for line in open(file):
line = line.strip()
d[line.split(None, 1)[0]] = line

You still split the string - but only once, which is indeed better !-)

Bu you can have your cake and eat it too:

d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

if that's not fast enough you can simplify it:

d = {}
for line in open(file):
d[line.split(None, 1)[0]] = line

I doubt this will save that much processing time...
If you have memory problems still, then you can only keep the line
number as dict values, of even absolute file positions, to seek later.
You can also use memory mapped files.

Tell us how is the performance now.

IMHO, not much better...
 
P

pruebauno

Bruno said:
d = {}
for line in open(thefile):
   arr = line.strip().split()
   d[arr[0]] = arr

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

--Scott David Daniels
(e-mail address removed)

They don't.
It is probably out of habit of using the generalized idiom:
line="a,b,c\n"
line.strip().split(",") ['a', 'b', 'c']
line.split(",")
['a', 'b', 'c\n']
 
M

MRAB

Scott said:
Bruno said:
d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?
http://www.python.org/doc/current/library/stdtypes.html?highlight=split#str.split

If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start or
end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace with
a None separator returns [].
 
J

J. Cliff Dyer

Bruno said:
d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?

Good question. I can't count the number of times I've used
line.strip().split() in my own code, just because I didn't 1) read the
documentation closely enough, or 2) try the alternative, but lo and
behold, they are the same, at least in the cases I was trying to account
for.

' a b c '.split() == ' a b c '.strip().split() == 'a b c'.split()

Thanks for pointing this out.

Cheers,
Cliff
 
M

MRAB

Scott said:
MRAB said:
Scott said:
Bruno Desthuilliers wrote:
d = {}
for line in open(thefile):
arr = line.strip().split()
d[arr[0]] = arr

Sorry, not picking on Bruno in particular, but I keep seeing
this formulation around various places.
When does line.strip().split() ever differ from line.split()?
... <explanation of what split does>

You misunderstand the question. For what values of line is
the following expression False?

line.strip().split() == line.split()

If you know of one, I'd like to hear of it. If not, I assert
that (baring some incredibly over-ambitous optimizer operations)
line.strip().split()
is simply an inefficient (both performance and text) way of saying:
line.split()
From the explanation it's clear that the .strip() is unnecessary.
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
Sion Arrowsmith:

You are right, sorry. I need to add an int() there.

Which is not garanteed to speed up the code FWIW
 
B

Bruno Desthuilliers

forrest yang a écrit :
i try to load a big file into a dict, which is about 9,000,000 lines,
something like
1 2 3 4
2 2 3 4
3 4 5 6

How "like" is it ?-)
code
for line in open(file)
arr=line.strip().split('\t')
dict[arr[0]]=arr

but, the dict is really slow as i load more data into the memory,

Looks like your system is starting to swap. Use 'top' or any other
system monitor to check it out.
by
the way the mac i use have 16G memory.
is this cased by the low performace for dict to extend memory

dicts are Python's central data type (objects are based on dicts, all
non-local namespaces are based on dicts, etc), so you can safely assume
they are highly optimized.
or
something other reason.

FWIW, a very loose (and partially wrong, cf below) estimation based on
wild guesses: assuming an average size of 512 bytes per object (remember
that Python doesn't have 'primitive' types), the above would use =~ 22G.

Hopefully, CPython does some caching for some values of some immutable
types (specifically, small ints and strings that respect the grammar for
Python identifiers), so depending on your real data, you might need a
bit less RAM. Also, the 512 bytes per object is really more of a wild
guess than anything else (but given the internal structure of a CPython
object, I think it's about that order - please someone correct me if I'm
plain wrong).

Anyway: I'm afraid the problem has more to do with your design than with
your code or Python's dict implementation itself.
is there any one can provide a better solution

Use a DBMS. They are designed - and highly optimised - for fast lookup
over huge data sets.

My 2 cents.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top