Sorted and reversed on huge dict ?

V

vd12005

Hello,

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?
 
P

Paul Rubin

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

i'm wondering if i can have a 10s of these in memory,

Depends on how much memory your machine has.
or if i should proceed one after the other.

Obviously that's more memory friendly if you can do it that way.
from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

I think the above is pretty good. sorted necessarily builds and
returns a list, but itervalues/iterkeys, izip, and reversed, just
build small iterator objects.

If the lists are really large, your next step after the above is
probably to use an external sort, but 150000 entries is not that many,
and anyway if sorting is a strain on available memory, then having the
dicts in memory at all will probably also be a strain. Maybe you
should start looking into storing the dict contents externally, such
as in a database.
 
F

Fredrik Lundh

i would like to sort(ed) and reverse(d) the result of many huge
dictionaries (a single dictionary will contain ~ 150000 entries). Keys
are words, values are count (integer).

not sure 150k entries qualify as huge, though, unless you're short on
memory.
i'm wondering if i can have a 10s of these in memory, or if i should
proceed one after the other.

why not just try it out?
but moreover i'm interested in saving theses as values, keys sorted and
reversed (ie most frequent first), i can do it with sort from unix
command but i wonder how i should do it with python to be memory
friendly.

can it be done by using :

from itertools import izip
pairs = izip(d.itervalues(), d.iterkeys())
for v, k in reversed(sorted(pairs)):
print k, v

or will it be the same as building the whole list ?

sorted() needs access to all data, so it'll build a list, even if you
feed it a generator. you will have to test it yourself, but I suspect
that the most memory-efficient way to do what you want could be:

items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of the
sort is done in place.

</F>
 
P

Paul Rubin

Fredrik Lundh said:
items = d.items()
items.sort(key=operator.itemgetter(1), reverse=True)

the items list would require a couple of megabytes for 150k dictionary
entries, or so. the key map needs some memory too, but the rest of
the sort is done in place.

I think the OP's method avoided the key map.
 
F

Fredrik Lundh

Paul said:
I think the OP's method avoided the key map.

right. I should have written "most efficient", not memory efficient.
the items+itemgetter approach is a bit faster, but uses a few extra
megabytes of memory temporarily, during the sort.

</F>
 
V

vd12005

thanks for your replies :)

so i just have tried, even if i think it will not go to the end => i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

memory is 4Gb of ram, is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command? by now around 220mb is used for around 200.000 words handled
in 15 dicts)
 
P

Paul Rubin

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

There will still be a pointer for each key, the strings themselves
won't be duplicated.
memory is 4Gb of ram,

That sounds like enough ram to hold all your stuff easily.
is there a good way to know how much ram is used
directly from python (or should i rely on 'top' and other unix
command?

I think try resource.getrusage()
by now around 220mb is used for around 200.000 words handled in 15
dicts)

That sounds very manageable given a 4gb machine. Otherwise, since
it sounds like you're trying to scan some big text corpus and figure
out word frequencies, you could do it the old fashioned way. Write
the words one per line into a big file, then sort the file with the
Unix sort utility (which is an external sort), then read the sorted
file (or pipe it through "uniq -c") to figure out the counts.
 
K

Klaas

thanks for your replies :)

so i just have tried, even if i think it will not go to the end => i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

Definitely a good strategy. The easiest way is to use intern(key) when
storing the values. (This will only work if you are using 8bit
strings. You'll have to maintain your own object cache if you are
using unicode).

I've reduced the memory requirements of very similar apps this way.
-Mike
 
V

vd12005

so it still unfinished :) around 1GB for 1033268 words :) (comes from a
top unix command)

Paul > i was also thinking on doing it like that by pip-ing to 'sort |
uniq -c | sort -nr' , but i'm pleased if Python can handle it. (well
but maybe Python is slower? will check later...)

Klaas > i do not know about intern construct, i will have look, but
when googling i first found a post from Raymond Hettinger so i'm going
to mess my mental space :)
http://mail.python.org/pipermail/python-dev/2003-November/040433.html

best regards.
 
N

Nick Craig-Wood

Paul Rubin said:
I think try resource.getrusage()

That should work (under BSD anyway) but unfortunately doesn't under
linux :-(

From the man page

CONFORMING TO
SVr4, 4.3BSD. POSIX.1-2001 specifies getrusage(), but only specifies
the fields ru_utime and ru_stime.

and

The above struct was taken from 4.3BSD Reno. Not all fields are mean-
ingful under Linux. In linux 2.4 only the fields ru_utime, ru_stime,
ru_minflt, and ru_majflt are maintained. Since Linux 2.6, ru_nvcsw and
ru_nivcsw are also maintained.

Ie none of the memory based ones are filled in.

This linux only code works though :-

def memory_used():
"""Return the memory used by this process under linux in bytes"""
return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()

Eg
.... return int(file("/proc/self/statm").read().split()[0]) * resource.getpagesize()
....
If anyone knows a (unix) portable way of doing this I'd be interested!
 
V

vd12005

so it has worked :) and last 12h4:56, 15 dicts with 1133755 keys, i do
not know how much ram was used as i was not always monitoring it.

thanks for all replies, i'm going to study intern and others
suggestions, hope also someone will bring a pythonic way to know memory
usage :)

best.
 
V

vd12005

just to be sure about intern, it is used as :
d, f = {}, {}
s = "this is a string"
d[intern(s)] = 1
f[intern(s)] = 1

so actually the key in d and f are a pointer on an the same intern-ed
string? if so it can be interesting,
intern(string) -> string

``Intern'' the given string. This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary
lookups.
Return the string itself or the previously interned string object with
the
same value.

the comment here: "(Changed in version 2.3: Interned strings used to be
immortal, but you now need to keep a reference to the interned string
around.)", if it the string is used as a key, it is still reference-d,
am i right?
 
S

Sion Arrowsmith

so i just have tried, even if i think it will not go to the end => i
was wrong : it is around 1.400.000 entries by dict...

but maybe if keys of dicts are not duplicated in memory it can be done
(as all dicts will have the same keys, with different (count) values)?

I've had programs handling dicts with 1.7million items and it isn't
a great problem, providing you're careful not to duplicate data.
Creating a copy of keys() in a separate list, for example, will
inflate memory usage noticably.
memory is 4Gb of ram, [ ... ]

Unless you're on a 64bit OS, that's irrelevant. You'll hit the 2G
per-process limit before you start putting a strain on real memory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top