optimizing memory utilization

A

Anon

Hello all,

I'm hoping for some guidance here... I am a c/c++ "expert", but a
complete python virgin. I'm trying to create a program that loads in the
entire FreeDB database (excluding the CDDBID itself) and uses this
"database" for other subsequent processing. The problem is, I'm running
out of memory on a linux RH8 box with 512MB. The FreeDB database that I'm
trying to load is presently consisting of two "CSV" files. The first file
contains a "list" of albums with artist name and an arbitrary sequential
album ID an the CDDBID (an ascii-hex representation of a 32-bit value).
The second file contains a list of all of the tracks on each of the
albums, crossreferenced via the album ID. When I load into memory, I
create a python list where each entry in the list is itself a list
representing the data for a given album. The album data list consists of
a small handful of text items like the album title, author, genre, and
year, as well as a list which itself contains a list for each of the track
on the album.

[[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
[<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
...
[<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

So the problem I'm having is that I want to load it all in memory (the two
files total about 250MB of raw data) but just loading the first 50,000
lines of tracks (about 25MB of raw data) consumes 75MB of RAM. If the
approximation is fairly accurate, I'd need >750MB of available RAM just to
load my in-memory database.

The bottom line is, is there a more memory efficient way to load all this
arbitrary field length and count type data into RAM? I can already see
that creating a seperate list for the group of tracks on an album is
probably wasteful versus just appending them to the album list, but I
doubt that will yeild the desired level of optimization of memory usage??

Any data structures suggestions for this application? BTW, the later
accesses to this database would not benefit in any way from being
presorted, so no need to warn me in advance about concepts like
presorting the albums list to facillitate faster look-up later...

Thanks,
Wes
 
S

Skip Montanaro

anon> [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
anon> [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
anon> ...
anon> [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

anon> So the problem I'm having is that I want to load it all in memory
anon> (the two files total about 250MB of raw data) but just loading the
anon> first 50,000 lines of tracks (about 25MB of raw data) consumes
anon> 75MB of RAM. If the approximation is fairly accurate, I'd need
anon> >750MB of available RAM just to load my in-memory database.

anon> The bottom line is, is there a more memory efficient way to load
anon> all this arbitrary field length and count type data into RAM?

Sure, assuming you know what your keys are, store them in a db file. Let's
assume you want to search by artist. Do your csv thing, but store the
records in a shelve keyed by the AlbNArtist field:

import shelve
import csv

reader = csv.reader(open("file1.csv", "rb"))
db = shelve.open("file1.db")

for row in reader:
stuff = db.get(row[1], [])
stuff.append(row)
db[row[1]] = stuff
db.close()

I'm not sure I've interpreted your sample csv quite right, but I think
you'll get the idea. You can of course have multiple db files, each keyed
by a different field (or part of a field).

Obviously, using a db file will be slower than an in-memory dictionary, but
if memory is a bottleneck, this will likely help. You can also avoid
initializing the db file on subsequent program runs if the csv file is older
than the db file, probably resulting in faster startup.

Skip
 
F

Frithiof Andreas Jensen

Any data structures suggestions for this application? BTW, the later
accesses to this database would not benefit in any way from being
presorted,

You keep saying it yourself: Use a Database ;-).

Databases "knows" about stuff in memory, as well as any searching and
sorting you might dream up later.

Python supports several, the simplest is probably PySQLite:
http://sourceforge.net/projects/pysqlite/
 
W

Wes Crucius

Frithiof Andreas Jensen said:
You keep saying it yourself: Use a Database ;-).

Databases "knows" about stuff in memory, as well as any searching and
sorting you might dream up later.

Python supports several, the simplest is probably PySQLite:
http://sourceforge.net/projects/pysqlite/

Thanks for the specific suggestion, but it's an approach I'd hoped to
avoid. I'd rather get 2Gig of RAM if I was confident I could make it
work that way... The problem with a file-based database (versus an
in-memory data-structure-based database as I was meaning) is
performance.

Maybe someone has a better suggestion if I give little more info: I
want to iterate through about 10,000 strings (each ~256 characters or
less) looking for occurances (within those 10,000 strings) of any one
of about 500,000 smaller (~30-50 characters average) strings. Then I
generate an XML doc that records which of the 500,000 strings were
found within each of the 10,000 strings.

I guess I am hoping to optimize memory usage in order to avoid
thrashing my drives to death (due to using a file based database). I
can't believe that python requires memory 200% "overhead" to store my
data as lists of lists. I gotta believe I've chosen the wrong
data-type or have mis-applied it somehow??

BTW, my illustration wasn't my input data, it was literally the output
of "print Albums" (with only a couple albums worth of dummy data
loaded). Albums is a list containing a list for each album, which in
turn contains another list for each track on the album, which in-turn
contains a couple of items of data about the given track.

Bottom line question here: Is a 'list' the most efficient data type
to use for storing "arrays" of "arrays" of "arrays" and strings",
given that I'm happy to iterate and don't need any sort of lookup or
search functionality??


Thanks again,
Wes
 
B

Bengt Richter

Thanks for the specific suggestion, but it's an approach I'd hoped to
avoid. I'd rather get 2Gig of RAM if I was confident I could make it
work that way... The problem with a file-based database (versus an
in-memory data-structure-based database as I was meaning) is
performance.

Maybe someone has a better suggestion if I give little more info: I
want to iterate through about 10,000 strings (each ~256 characters or
less) looking for occurances (within those 10,000 strings) of any one
of about 500,000 smaller (~30-50 characters average) strings. Then I
generate an XML doc that records which of the 500,000 strings were
found within each of the 10,000 strings.

Need example of raw input and desired output ;-)

Patterns starting at any character? What if you have overlapping matches? I.e.,
if you had a string 'abbcde123', and were looking for 'bb' and 'bc'
would it count as two matches? Would longer matches have priority?
E.g., matching 'bbcd' overrides 'bb and 'cd' as separate matches?

Anyway, let's see what a brute force simple approach does. E.g, just put the
500,000 small strings in a small tree of dictionaries based on say string length, then
the string (you could tier it again with some leading characters and the rest, but I'd
try the easy thing first. But anyway, the idea (practically untested) is something like:

This finds repeating and overlapping patterns, since it was easy. That may not be what
you want, but you didn't say ;-)

----< s500kin10k.py >---------------------------------------
def main(path500k, path10k):
root = {}
for line in file(path500k):
line = line.rstrip()
if not line: continue
root.setdefault(len(line),{})[line] = None
lengths = root.keys()
lengths.sort()
print lengths
print root
# and then walk through your 10k strings of ~256 characters

for nl, line in enumerate(file(path10k)):
line = line.rstrip()
hdr = 'Line %s: %r\n' % (nl, line)
nc = len(line)
for length in lengths:
if length > nc: break
tier1 = root[length]
for nstep in xrange(nc+1-length): # walk a window of length chars
if line[nstep:nstep+length] in tier1:
print '%s %r' % (hdr, line[nstep:nstep+length])
hdr = ''

if __name__ == '__main__':
import sys
if len(sys.argv)!=3: raise SystemExit, """
Usage: s500kin10k.py path500k path10k
where path500k has 500k lines of one string each, and
path10k has 10k lines to search for the 500k in any position.
"""
main(*sys.argv[1:])
------------------------------------------------------------
I guess I am hoping to optimize memory usage in order to avoid
thrashing my drives to death (due to using a file based database). I
can't believe that python requires memory 200% "overhead" to store my
data as lists of lists. I gotta believe I've chosen the wrong
data-type or have mis-applied it somehow??

BTW, my illustration wasn't my input data, it was literally the output
of "print Albums" (with only a couple albums worth of dummy data
loaded). Albums is a list containing a list for each album, which in
turn contains another list for each track on the album, which in-turn
contains a couple of items of data about the given track.

Bottom line question here: Is a 'list' the most efficient data type
to use for storing "arrays" of "arrays" of "arrays" and strings",
given that I'm happy to iterate and don't need any sort of lookup or
search functionality??
I suspect that you won't be that happy iterating through just arrays,
unless they are arrays of patternmatching tables like IIRC flex generates
to do parsing. Basically doing tricky parallel pattern matching by keeping
track of which patterns are alive as you go from character to character.
But 500k patterns are a lot of patterns....
Thanks again,

See if the above runs you out of memory. Or time ;-)
I'd try it on shorter files first. You didn't say what you wanted
your final output to look like. Small examples of input => output communicate well ;-)

Regards,
Bengt Richter
 
A

Anon

Need example of raw input and desired output ;-)
The input is:
- one "csv" file containing a list of albums with their artist and an
AlbumID - one "csv" file containing a list of track names and associated
AlbumID - one big file containing fully-qualified paths to mp3 files

The desired output is:
and XML file with the "proposed" artist, album, and track name for each
file. The file name would be the top level element for a given XML node
and then there would be element groupings of "proposed" artist, album, and
track name for each file.

The last step is to edit the XML file and "select" the desired fields from
list of candidates for each fields, and then use the XML as a work order
for tagging (and possibly renaming) all of the identified files.

Patterns starting at any character? What if you have overlapping
matches? I.e., if you had a string 'abbcde123', and were looking for
'bb' and 'bc' would it count as two matches? Would longer matches have
priority? E.g., matching 'bbcd' overrides 'bb and 'cd' as separate
matches?

Yes, yes, and yes...

Anyway, let's see what a brute force simple approach does. E.g, just put
the 500,000 small strings in a small tree of dictionaries based on say
string length, then the string (you could tier it again with some
leading characters and the rest, but I'd try the easy thing first. But
anyway, the idea (practically untested) is something like:

I had planned to sort the list based upon string length (and probably
apply some minimum length requirement too), but the tree of dictionaries
surely seems a bit more elegant.

This finds repeating and overlapping patterns, since it was easy. That
may not be what you want, but you didn't say ;-)

----< s500kin10k.py >--------------------------------------- def
main(path500k, path10k):
root = {}
for line in file(path500k):
line = line.rstrip()
if not line: continue
root.setdefault(len(line),{})[line] = None
lengths = root.keys()
lengths.sort()
print lengths
print root
# and then walk through your 10k strings of ~256 characters

for nl, line in enumerate(file(path10k)):
line = line.rstrip()
hdr = 'Line %s: %r\n' % (nl, line)
nc = len(line)
for length in lengths:
if length > nc: break
tier1 = root[length]
for nstep in xrange(nc+1-length): # walk a window of length
chars
if line[nstep:nstep+length] in tier1:
print '%s %r' % (hdr, line[nstep:nstep+length])
hdr = ''

if __name__ == '__main__':
import sys
if len(sys.argv)!=3: raise SystemExit, """ Usage: s500kin10k.py
path500k path10k
where path500k has 500k lines of one string each, and path10k
has 10k lines to search for the 500k in any position.
"""
main(*sys.argv[1:])
Gee, thanks. That's a nice example!
I suspect that you won't be that happy iterating through just arrays,
unless they are arrays of patternmatching tables like IIRC flex
generates to do parsing. Basically doing tricky parallel pattern
matching by keeping track of which patterns are alive as you go from
character to character. But 500k patterns are a lot of patterns....

Since I only want to do it once in rare while, execution time isn't a huge
deal... I could readily write this code in C and make my data fit within
a reasonable amount of RAM, but I'm trying to use the project as an
opportunity to lean Python.

See if the above runs you out of memory. Or time ;-) I'd try it on
shorter files first. You didn't say what you wanted your final output to
look like. Small examples of input => output communicate well ;-)

Well, basically, I'm trying to use the FreeDB database as a list of valid
artists, albums, and tracks to locate these strings within the file names
of mp3 files and then use those result to apply ID tags to the mp3 files.
I'm assuming that I'll have multiple matches for any given file so my plan
was to output to an XML file, hand 'clean' that, and then use the XML as
the input for the actual tagging operation.

Thanks,
Wes
 
K

Kjetil Torgrim Homme

[Anon]:
Well, basically, I'm trying to use the FreeDB database as a list
of valid artists, albums, and tracks to locate these strings
within the file names of mp3 files and then use those result to
apply ID tags to the mp3 files.

have you considered using musicbrainz instead? it computes a checksum
of the music, not the bits and bytes.

http://www.musicbrainz.org/
 
F

Frithiof Andreas Jensen

Thanks for the specific suggestion, but it's an approach I'd hoped to
avoid. I'd rather get 2Gig of RAM if I was confident I could make it
work that way... The problem with a file-based database (versus an
in-memory data-structure-based database as I was meaning) is
performance.

I am not convinced ;-)

The people designing databases have run into exactly that kind of problem
and they have spent a long time studying algorithms, eficcient
datastructures and writing confererence papers on how to cleverly map and
search through said data structures that may not fit in memory - to the
point where I think one will have a hard time doing it better by hand.

PySQLite will try to keep everything in memory, so if you have enough, it
will only hit the disk on opening the database and on comitting a change.
Commit can take a long time though.
 
I

Istvan Albert

can't believe that python requires memory 200% "overhead" to store my
data as lists of lists.

And it almost certainly doesn't.

One one hand there could be a sizebale overhead when running
your program even when loading a single data line.
So don't extrapolate from a small datasets.

But what I believe happens to you is that you keep references
alive and you end up with data duplication.

For example if you do:

lines = file('whatever').readlines()

then you say for example:

stripped_lines = [ line.strip() for line in lines ]

you've just made your program use twice the memory
at that point of the code.

You may also inadvertently keep your 'lines' variable
in the scope of the full program then
you'll effectively end up with a lot more memory consumption
overall.

Check and plug all these holes.

Istvan.
 
J

John Lenton

Hello all,

I'm hoping for some guidance here... I am a c/c++ "expert", but a
complete python virgin. I'm trying to create a program that loads in the
entire FreeDB database (excluding the CDDBID itself) and uses this
"database" for other subsequent processing. The problem is, I'm running
out of memory on a linux RH8 box with 512MB. The FreeDB database that I'm
trying to load is presently consisting of two "CSV" files. The first file
contains a "list" of albums with artist name and an arbitrary sequential
album ID an the CDDBID (an ascii-hex representation of a 32-bit value).
The second file contains a list of all of the tracks on each of the
albums, crossreferenced via the album ID. When I load into memory, I
create a python list where each entry in the list is itself a list
representing the data for a given album. The album data list consists of
a small handful of text items like the album title, author, genre, and
year, as well as a list which itself contains a list for each of the track
on the album.

[[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
[<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
...
[<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

silly question: have you looked into using tuples instead of lists for
the inner objects? They're supposed to be more memory efficient,
although I have found that that isn't necessarily the case.

--
John Lenton ([email protected]) -- Random fortune:
You're a card which will have to be dealt with.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBSIMRgPqu395ykGsRAgPgAJ9DqQNAvQI4pnr9JUEshQ/ACThW6QCfYmIJ
2W/vN0+beKvg5IdF2PsJe/I=
=5lCT
-----END PGP SIGNATURE-----
 
B

Bengt Richter

The input is:
- one "csv" file containing a list of albums with their artist and an
AlbumID - one "csv" file containing a list of track names and associated
AlbumID - one big file containing fully-qualified paths to mp3 files
Is the latter the 10k file to be searched for matches to patterns from the former two?
If so, you could certainly do better than walking through mp3 paths sliding
a pattern window one character at a time, since windows containing path delimiters
are not likely to match anything. IOW, you probably want to split on the delimiters
and also do some text normalization before trying to match.

How about some actual raw data? ;-) I.e., a few lines from the raw csv files and
the mp3 path file. Actual samples will probably raise questions about how the search
patterns are formed and how the searched data may have to be modified to match. E.g.,
removing punctuation, collapsing multiple embedded spaces, case sensitivity, spelling, etc.
The desired output is:
and XML file with the "proposed" artist, album, and track name for each
file. The file name would be the top level element for a given XML node
and then there would be element groupings of "proposed" artist, album, and
track name for each file.
XML lets you do stuff like <foo fie="fie" fo="fum /> or <foo><fie>fie</fie><fo>fum</fo></foo>
and variations. You can generate a lot of fluff if you don't watch out. Compare the
economy of python source vs XML representing python source ;-) Do you have
a utility that is going to accept the XML as input, and so is defining the format?
The last step is to edit the XML file and "select" the desired fields from
list of candidates for each fields, and then use the XML as a work order
for tagging (and possibly renaming) all of the identified files.



Yes, yes, and yes...
You'll need a change to make long patterns come out first, but no big deal.
I had planned to sort the list based upon string length (and probably
apply some minimum length requirement too), but the tree of dictionaries
surely seems a bit more elegant.
Before that, though, are you going to modify the strings for consistent
white space and punctuation? Or check spellings? )On the other end too, of course).
This finds repeating and overlapping patterns, since it was easy. That
may not be what you want, but you didn't say ;-)

----< s500kin10k.py >--------------------------------------- def [...]
Gee, thanks. That's a nice example!
You're welcome.
Since I only want to do it once in rare while, execution time isn't a huge
deal... I could readily write this code in C and make my data fit within
a reasonable amount of RAM, but I'm trying to use the project as an
opportunity to lean Python.
Sounds like a good way to go. If you are good in C, you can also learn how
to make the best of both worlds by writing your own extension modules in C
that you can import in python.

Don't forget to look into available modules that may help. E.g., the array
module can make arrays of numbers very efficient. mmap can let you look at
a file as if it were a character array, etc. Also re for regular expressions,
which you may need at some point.
Well, basically, I'm trying to use the FreeDB database as a list of valid
artists, albums, and tracks to locate these strings within the file names
of mp3 files and then use those result to apply ID tags to the mp3 files.
I'm assuming that I'll have multiple matches for any given file so my plan
was to output to an XML file, hand 'clean' that, and then use the XML as
the input for the actual tagging operation.
Again, "locate these strings" is easy to say, but each word hides a lot of
meaning. Messing with a small batch of _representative_ actual data will tell you a lot ;-)

Good luck.

Regards,
Bengt Richter
 
J

Jeremy Bowers

Maybe someone has a better suggestion if I give little more info: I
want to iterate through about 10,000 strings (each ~256 characters or
less) looking for occurances (within those 10,000 strings) of any one
of about 500,000 smaller (~30-50 characters average) strings. Then I
generate an XML doc that records which of the 500,000 strings were
found within each of the 10,000 strings.

There may be a good reason for this, but it should be asked: Why not
stream one set of strings or the other in, instead of loading them up
front?

The other obvious answer that isn't cute, but will probably be faster to
program and still run faster then a process that runs into swap is to
break the problem up: Either check some % of the 10,000 in one run and
start over, or check some % of the 500,000 and start over with a new
chunk.

There's nothing wrong with such chunking; when you start asking for
results from datasets that challenge your machine there are many
techniques that seem wasteful but really aren't. One of my personal
canonical examples is something called "iterative deepening depth first
search" for large graphs that may have cycles (which can trap the
depth-first search). On the first run, you search your first level. On the
second, you search down two levels, and so on. By stopping at a certain
depth you ensure that you cover all nodes down to that depth without being
trapped. It sounds horridly wasteful compared to trying to keep track of
the visited nodes, but it turns out that it may only double or less the
execution time while saving vaste swathes of memory. It still boggles my
mind.
 
A

Anon

Hello all,

I'm hoping for some guidance here... I am a c/c++ "expert", but a
complete python virgin. I'm trying to create a program that loads in
the entire FreeDB database (excluding the CDDBID itself) and uses this
"database" for other subsequent processing. The problem is, I'm
running out of memory on a linux RH8 box with 512MB. The FreeDB
database that I'm trying to load is presently consisting of two "CSV"
files. The first file contains a "list" of albums with artist name and
an arbitrary sequential album ID an the CDDBID (an ascii-hex
representation of a 32-bit value). The second file contains a list of
all of the tracks on each of the albums, crossreferenced via the album
ID. When I load into memory, I create a python list where each entry
in the list is itself a list representing the data for a given album.
The album data list consists of a small handful of text items like the
album title, author, genre, and year, as well as a list which itself
contains a list for each of the track on the album.

[[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
[<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
...
[<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
[["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

silly question: have you looked into using tuples instead of lists for
the inner objects? They're supposed to be more memory efficient,
although I have found that that isn't necessarily the case.

That's exactly the kind of feedback that I was looking for when I asked
the question. However the suggestion that MusicBrainz already does what I
want in a possibly more accurate way looks like it might be an even better
suggestion, leaving me to concentrate my efforts on some other
as-yet-unsolved problem instead!

Thanks!
Wes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top