gather information from various files efficiently

K

Klaus Neuner

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================


The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =


Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?


Klaus
 
K

Keith Dart

Klaus said:
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================


The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =


Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] =

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.






--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================
 
F

Fernando Perez

Keith said:
Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] =

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.


I wonder if

dct.setdefault(a,[]).append(b)

wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).

Cheers,

f

ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).
 
K

Keith Dart

Kent said:
Keith said:
try:
dict[a].append(b)
except KeyError:
dict[a] =



or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent


Hey, when did THAT get in there? ;-) That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =


Which is a pattern I have seen often.




--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
============================================================================
 
F

Fredrik Lundh

Keith said:
try:
dict[a].append(b)
except KeyError:
dict[a] =


the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.
dict.setdefault(a, []).append(b)

the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.

(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)
Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful
pattern for many similiar situations that the OP might want to keep in mind. It is usually better
than the following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =


the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

</F>
 
K

Keith Dart

Fredrik said:
...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =



the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. :cool: Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)







--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================
 
K

Keith Dart

Fredrik said:
...
if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =



the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.


Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. :cool: Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)







--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================
 
R

Richie Hindle

[Keith]
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. :cool:

s/Python //g
 
P

Peter Hansen

Keith said:
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.

And this is different from optimizing in *any* other language
in what way?

-Peter
 
S

Simon Brunning

And this is different from optimizing in *any* other language
in what way?

In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.
 
S

Steven Bethard

Klaus said:
The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =


So I timed the three suggestions with a few different datasets:
> cat builddict.py
def askpermission(d, k, v):
if k in d:
d[k].append(v)
else:
d[k] = [v]

def askforgiveness(d, k, v):
try:
d[k].append(v)
except KeyError:
d[k] = [v]

def default(d, k, v):
d.setdefault(k, []).append(v)

def test(items, func):
d = {}
for k, v in items:
func(d, k, v)



Dataset where every value causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.62 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.58 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i
in range(1000)], bd.default)"
100 loops, best of 3: 2.03 msec per loop


Dataset where no value causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.askpermission)"
100 loops, best of 3: 2.29 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.askforgiveness)"
100 loops, best of 3: 9.96 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i
in range(1000)], bd.default)"
100 loops, best of 3: 2.98 msec per loop


Datset where one of every 5 values causes a collision:
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.82 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.79 msec per loop
> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for
i in range(1000)], bd.default)"
100 loops, best of 3: 2.2 msec per loop


So, when there are lots of collisions, you may get a small benefit from
the try/except solution. If there are very few collisions, you probably
would rather the if/else solution. The setdefault solution patterns
about like the if/else solution, but is somewhat slower.

I will probably continue to use setdefault, because I think it's
prettier =) but if you're running into a speed bottleneck in this sort
of situation, you might consider one of the other solutions.

Steve
 
J

Jeff Shannon

Klaus said:
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

Others have shown how you can make your dictionary code more efficient,
which should provide a big speed boost, especially if there are many
keys in your dicts.

However, if you're taking this long to read files each time, perhaps
there's a better high-level approach than just a brute-force scan of
every file every time. You don't say anything about where those files
are coming from, or how they're created. Are they relatively static?
(That is to say, are they (nearly) the same files being read on each
run?) Do you control the process that creates the files? Given the
right conditions, you may be able to store your data in a shelve, or
even proper database, saving you lots of time in parsing through these
files on each run. Even if it's entirely new data on each run, you may
be able to find a more efficient way of transferring data from whatever
the source is into your program.

Jeff Shannon
Technician/Programmer
Credit International
 
M

Mike Meyer

Simon Brunning said:
In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.

+1 QOTW.

<mike
 
P

Paul McGuire

Klaus Neuner said:
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================


The straightforward way to solve this problem is to create a
dictionary. Like so:


[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =


Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?


Klaus


You have gotten a number of suggestions on the relative improvements for
updating your global dictionary of values. My business partner likens code
optimization to lowering the water in a river. Performance bottlenecks
stick out like rocks sticking out of a river. Once you resolve one problem
(remove the rock), you lower the water level, and the next rock/bottleneck
appears. Have you looked at what is happening in your get_information
method? If you are still taking long periods of time to scan through these
files, you should look into what get_information is doing. In working with
my pyparsing module, I've seen people scan multimegabyte files in seconds,
so taking hours to sift through 4Mb of data sounds like there may be other
problems going on.

With this clean a code input, something like:

def get_information(line):
return map(str.strip, line.split(":",1))

should do the trick. For that matter, you could get rid of the function
call (calls are expensive in Python), and just inline this to :

a,b = map(str.strip, line.split(":",1))
if a in dct:
dct[a] += b.split()
else:
dct[a] = b.split()

(I'm guessing you want to convert b values that have multiple numbers to a
list, based on your "dict[a] = " source line.)
I also renamed dict to dct, per Fernando Perez's suggestion.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top