optimizing memory utilization

Discussion in 'Python' started by Anon, Sep 14, 2004.

  1. Anon

    Anon Guest

    Hello all,

    I'm hoping for some guidance here... I am a c/c++ "expert", but a
    complete python virgin. I'm trying to create a program that loads in the
    entire FreeDB database (excluding the CDDBID itself) and uses this
    "database" for other subsequent processing. The problem is, I'm running
    out of memory on a linux RH8 box with 512MB. The FreeDB database that I'm
    trying to load is presently consisting of two "CSV" files. The first file
    contains a "list" of albums with artist name and an arbitrary sequential
    album ID an the CDDBID (an ascii-hex representation of a 32-bit value).
    The second file contains a list of all of the tracks on each of the
    albums, crossreferenced via the album ID. When I load into memory, I
    create a python list where each entry in the list is itself a list
    representing the data for a given album. The album data list consists of
    a small handful of text items like the album title, author, genre, and
    year, as well as a list which itself contains a list for each of the track
    on the album.

    [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
    [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
    [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    ...
    [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
    [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

    So the problem I'm having is that I want to load it all in memory (the two
    files total about 250MB of raw data) but just loading the first 50,000
    lines of tracks (about 25MB of raw data) consumes 75MB of RAM. If the
    approximation is fairly accurate, I'd need >750MB of available RAM just to
    load my in-memory database.

    The bottom line is, is there a more memory efficient way to load all this
    arbitrary field length and count type data into RAM? I can already see
    that creating a seperate list for the group of tracks on an album is
    probably wasteful versus just appending them to the album list, but I
    doubt that will yeild the desired level of optimization of memory usage??

    Any data structures suggestions for this application? BTW, the later
    accesses to this database would not benefit in any way from being
    presorted, so no need to warn me in advance about concepts like
    presorting the albums list to facillitate faster look-up later...

    Thanks,
    Wes
     
    Anon, Sep 14, 2004
    #1
    1. Advertising

  2. anon> [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
    anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    anon> [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
    anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    anon> ...
    anon> [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
    anon> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

    anon> So the problem I'm having is that I want to load it all in memory
    anon> (the two files total about 250MB of raw data) but just loading the
    anon> first 50,000 lines of tracks (about 25MB of raw data) consumes
    anon> 75MB of RAM. If the approximation is fairly accurate, I'd need
    anon> >750MB of available RAM just to load my in-memory database.

    anon> The bottom line is, is there a more memory efficient way to load
    anon> all this arbitrary field length and count type data into RAM?

    Sure, assuming you know what your keys are, store them in a db file. Let's
    assume you want to search by artist. Do your csv thing, but store the
    records in a shelve keyed by the AlbNArtist field:

    import shelve
    import csv

    reader = csv.reader(open("file1.csv", "rb"))
    db = shelve.open("file1.db")

    for row in reader:
    stuff = db.get(row[1], [])
    stuff.append(row)
    db[row[1]] = stuff
    db.close()

    I'm not sure I've interpreted your sample csv quite right, but I think
    you'll get the idea. You can of course have multiple db files, each keyed
    by a different field (or part of a field).

    Obviously, using a db file will be slower than an in-memory dictionary, but
    if memory is a bottleneck, this will likely help. You can also avoid
    initializing the db file on subsequent program runs if the csv file is older
    than the db file, probably resulting in faster startup.

    Skip
     
    Skip Montanaro, Sep 14, 2004
    #2
    1. Advertising

  3. "Anon" <> wrote in message
    news:p...

    > Any data structures suggestions for this application? BTW, the later
    > accesses to this database would not benefit in any way from being
    > presorted,


    You keep saying it yourself: Use a Database ;-).

    Databases "knows" about stuff in memory, as well as any searching and
    sorting you might dream up later.

    Python supports several, the simplest is probably PySQLite:
    http://sourceforge.net/projects/pysqlite/
     
    Frithiof Andreas Jensen, Sep 14, 2004
    #3
  4. Anon

    Wes Crucius Guest

    "Frithiof Andreas Jensen" <frithiof.jensen@die_spammer_die.ericsson.com> wrote in message news:<ci6f73$921$>...
    > "Anon" <> wrote in message
    > news:p...
    >
    > > Any data structures suggestions for this application? BTW, the later
    > > accesses to this database would not benefit in any way from being
    > > presorted,

    >
    > You keep saying it yourself: Use a Database ;-).
    >
    > Databases "knows" about stuff in memory, as well as any searching and
    > sorting you might dream up later.
    >
    > Python supports several, the simplest is probably PySQLite:
    > http://sourceforge.net/projects/pysqlite/


    Thanks for the specific suggestion, but it's an approach I'd hoped to
    avoid. I'd rather get 2Gig of RAM if I was confident I could make it
    work that way... The problem with a file-based database (versus an
    in-memory data-structure-based database as I was meaning) is
    performance.

    Maybe someone has a better suggestion if I give little more info: I
    want to iterate through about 10,000 strings (each ~256 characters or
    less) looking for occurances (within those 10,000 strings) of any one
    of about 500,000 smaller (~30-50 characters average) strings. Then I
    generate an XML doc that records which of the 500,000 strings were
    found within each of the 10,000 strings.

    I guess I am hoping to optimize memory usage in order to avoid
    thrashing my drives to death (due to using a file based database). I
    can't believe that python requires memory 200% "overhead" to store my
    data as lists of lists. I gotta believe I've chosen the wrong
    data-type or have mis-applied it somehow??

    BTW, my illustration wasn't my input data, it was literally the output
    of "print Albums" (with only a couple albums worth of dummy data
    loaded). Albums is a list containing a list for each album, which in
    turn contains another list for each track on the album, which in-turn
    contains a couple of items of data about the given track.

    Bottom line question here: Is a 'list' the most efficient data type
    to use for storing "arrays" of "arrays" of "arrays" and strings",
    given that I'm happy to iterate and don't need any sort of lookup or
    search functionality??


    Thanks again,
    Wes
     
    Wes Crucius, Sep 14, 2004
    #4
  5. On 14 Sep 2004 14:55:56 -0700, (Wes Crucius) wrote:

    >"Frithiof Andreas Jensen" <frithiof.jensen@die_spammer_die.ericsson.com> wrote in message news:<ci6f73$921$>...
    >> "Anon" <> wrote in message
    >> news:p...
    >>
    >> > Any data structures suggestions for this application? BTW, the later
    >> > accesses to this database would not benefit in any way from being
    >> > presorted,

    >>
    >> You keep saying it yourself: Use a Database ;-).
    >>
    >> Databases "knows" about stuff in memory, as well as any searching and
    >> sorting you might dream up later.
    >>
    >> Python supports several, the simplest is probably PySQLite:
    >> http://sourceforge.net/projects/pysqlite/

    >
    >Thanks for the specific suggestion, but it's an approach I'd hoped to
    >avoid. I'd rather get 2Gig of RAM if I was confident I could make it
    >work that way... The problem with a file-based database (versus an
    >in-memory data-structure-based database as I was meaning) is
    >performance.
    >
    >Maybe someone has a better suggestion if I give little more info: I
    >want to iterate through about 10,000 strings (each ~256 characters or
    >less) looking for occurances (within those 10,000 strings) of any one
    >of about 500,000 smaller (~30-50 characters average) strings. Then I
    >generate an XML doc that records which of the 500,000 strings were
    >found within each of the 10,000 strings.


    Need example of raw input and desired output ;-)

    Patterns starting at any character? What if you have overlapping matches? I.e.,
    if you had a string 'abbcde123', and were looking for 'bb' and 'bc'
    would it count as two matches? Would longer matches have priority?
    E.g., matching 'bbcd' overrides 'bb and 'cd' as separate matches?

    Anyway, let's see what a brute force simple approach does. E.g, just put the
    500,000 small strings in a small tree of dictionaries based on say string length, then
    the string (you could tier it again with some leading characters and the rest, but I'd
    try the easy thing first. But anyway, the idea (practically untested) is something like:

    This finds repeating and overlapping patterns, since it was easy. That may not be what
    you want, but you didn't say ;-)

    ----< s500kin10k.py >---------------------------------------
    def main(path500k, path10k):
    root = {}
    for line in file(path500k):
    line = line.rstrip()
    if not line: continue
    root.setdefault(len(line),{})[line] = None
    lengths = root.keys()
    lengths.sort()
    print lengths
    print root
    # and then walk through your 10k strings of ~256 characters

    for nl, line in enumerate(file(path10k)):
    line = line.rstrip()
    hdr = 'Line %s: %r\n' % (nl, line)
    nc = len(line)
    for length in lengths:
    if length > nc: break
    tier1 = root[length]
    for nstep in xrange(nc+1-length): # walk a window of length chars
    if line[nstep:nstep+length] in tier1:
    print '%s %r' % (hdr, line[nstep:nstep+length])
    hdr = ''

    if __name__ == '__main__':
    import sys
    if len(sys.argv)!=3: raise SystemExit, """
    Usage: s500kin10k.py path500k path10k
    where path500k has 500k lines of one string each, and
    path10k has 10k lines to search for the 500k in any position.
    """
    main(*sys.argv[1:])
    ------------------------------------------------------------

    >
    >I guess I am hoping to optimize memory usage in order to avoid
    >thrashing my drives to death (due to using a file based database). I
    >can't believe that python requires memory 200% "overhead" to store my
    >data as lists of lists. I gotta believe I've chosen the wrong
    >data-type or have mis-applied it somehow??
    >
    >BTW, my illustration wasn't my input data, it was literally the output
    >of "print Albums" (with only a couple albums worth of dummy data
    >loaded). Albums is a list containing a list for each album, which in
    >turn contains another list for each track on the album, which in-turn
    >contains a couple of items of data about the given track.
    >
    >Bottom line question here: Is a 'list' the most efficient data type
    >to use for storing "arrays" of "arrays" of "arrays" and strings",
    >given that I'm happy to iterate and don't need any sort of lookup or
    >search functionality??
    >
    >

    I suspect that you won't be that happy iterating through just arrays,
    unless they are arrays of patternmatching tables like IIRC flex generates
    to do parsing. Basically doing tricky parallel pattern matching by keeping
    track of which patterns are alive as you go from character to character.
    But 500k patterns are a lot of patterns....
    >Thanks again,


    See if the above runs you out of memory. Or time ;-)
    I'd try it on shorter files first. You didn't say what you wanted
    your final output to look like. Small examples of input => output communicate well ;-)

    Regards,
    Bengt Richter
     
    Bengt Richter, Sep 15, 2004
    #5
  6. Anon

    Anon Guest

    On Wed, 15 Sep 2004 02:20:38 +0000, Bengt Richter wrote:

    > Need example of raw input and desired output ;-)

    The input is:
    - one "csv" file containing a list of albums with their artist and an
    AlbumID - one "csv" file containing a list of track names and associated
    AlbumID - one big file containing fully-qualified paths to mp3 files

    The desired output is:
    and XML file with the "proposed" artist, album, and track name for each
    file. The file name would be the top level element for a given XML node
    and then there would be element groupings of "proposed" artist, album, and
    track name for each file.

    The last step is to edit the XML file and "select" the desired fields from
    list of candidates for each fields, and then use the XML as a work order
    for tagging (and possibly renaming) all of the identified files.


    > Patterns starting at any character? What if you have overlapping
    > matches? I.e., if you had a string 'abbcde123', and were looking for
    > 'bb' and 'bc' would it count as two matches? Would longer matches have
    > priority? E.g., matching 'bbcd' overrides 'bb and 'cd' as separate
    > matches?


    Yes, yes, and yes...


    > Anyway, let's see what a brute force simple approach does. E.g, just put
    > the 500,000 small strings in a small tree of dictionaries based on say
    > string length, then the string (you could tier it again with some
    > leading characters and the rest, but I'd try the easy thing first. But
    > anyway, the idea (practically untested) is something like:


    I had planned to sort the list based upon string length (and probably
    apply some minimum length requirement too), but the tree of dictionaries
    surely seems a bit more elegant.


    > This finds repeating and overlapping patterns, since it was easy. That
    > may not be what you want, but you didn't say ;-)
    >
    > ----< s500kin10k.py >--------------------------------------- def
    > main(path500k, path10k):
    > root = {}
    > for line in file(path500k):
    > line = line.rstrip()
    > if not line: continue
    > root.setdefault(len(line),{})[line] = None
    > lengths = root.keys()
    > lengths.sort()
    > print lengths
    > print root
    > # and then walk through your 10k strings of ~256 characters
    >
    > for nl, line in enumerate(file(path10k)):
    > line = line.rstrip()
    > hdr = 'Line %s: %r\n' % (nl, line)
    > nc = len(line)
    > for length in lengths:
    > if length > nc: break
    > tier1 = root[length]
    > for nstep in xrange(nc+1-length): # walk a window of length
    > chars
    > if line[nstep:nstep+length] in tier1:
    > print '%s %r' % (hdr, line[nstep:nstep+length])
    > hdr = ''
    >
    > if __name__ == '__main__':
    > import sys
    > if len(sys.argv)!=3: raise SystemExit, """ Usage: s500kin10k.py
    > path500k path10k
    > where path500k has 500k lines of one string each, and path10k
    > has 10k lines to search for the 500k in any position.
    > """
    > main(*sys.argv[1:])
    > ------------------------------------------------------------
    >
    >

    Gee, thanks. That's a nice example!

    > I suspect that you won't be that happy iterating through just arrays,
    > unless they are arrays of patternmatching tables like IIRC flex
    > generates to do parsing. Basically doing tricky parallel pattern
    > matching by keeping track of which patterns are alive as you go from
    > character to character. But 500k patterns are a lot of patterns....


    Since I only want to do it once in rare while, execution time isn't a huge
    deal... I could readily write this code in C and make my data fit within
    a reasonable amount of RAM, but I'm trying to use the project as an
    opportunity to lean Python.


    > See if the above runs you out of memory. Or time ;-) I'd try it on
    > shorter files first. You didn't say what you wanted your final output to
    > look like. Small examples of input => output communicate well ;-)


    Well, basically, I'm trying to use the FreeDB database as a list of valid
    artists, albums, and tracks to locate these strings within the file names
    of mp3 files and then use those result to apply ID tags to the mp3 files.
    I'm assuming that I'll have multiple matches for any given file so my plan
    was to output to an XML file, hand 'clean' that, and then use the XML as
    the input for the actual tagging operation.

    Thanks,
    Wes
     
    Anon, Sep 15, 2004
    #6
  7. [Anon]:
    >
    > Well, basically, I'm trying to use the FreeDB database as a list
    > of valid artists, albums, and tracks to locate these strings
    > within the file names of mp3 files and then use those result to
    > apply ID tags to the mp3 files.


    have you considered using musicbrainz instead? it computes a checksum
    of the music, not the bits and bytes.

    http://www.musicbrainz.org/
    --
    Kjetil T.
     
    Kjetil Torgrim Homme, Sep 15, 2004
    #7
  8. "Wes Crucius" <> wrote in message
    news:...

    > Thanks for the specific suggestion, but it's an approach I'd hoped to
    > avoid. I'd rather get 2Gig of RAM if I was confident I could make it
    > work that way... The problem with a file-based database (versus an
    > in-memory data-structure-based database as I was meaning) is
    > performance.


    I am not convinced ;-)

    The people designing databases have run into exactly that kind of problem
    and they have spent a long time studying algorithms, eficcient
    datastructures and writing confererence papers on how to cleverly map and
    search through said data structures that may not fit in memory - to the
    point where I think one will have a hard time doing it better by hand.

    PySQLite will try to keep everything in memory, so if you have enough, it
    will only hit the disk on opening the database and on comitting a change.
    Commit can take a long time though.
     
    Frithiof Andreas Jensen, Sep 15, 2004
    #8

  9. > can't believe that python requires memory 200% "overhead" to store my
    > data as lists of lists.


    And it almost certainly doesn't.

    One one hand there could be a sizebale overhead when running
    your program even when loading a single data line.
    So don't extrapolate from a small datasets.

    But what I believe happens to you is that you keep references
    alive and you end up with data duplication.

    For example if you do:

    lines = file('whatever').readlines()

    then you say for example:

    stripped_lines = [ line.strip() for line in lines ]

    you've just made your program use twice the memory
    at that point of the code.

    You may also inadvertently keep your 'lines' variable
    in the scope of the full program then
    you'll effectively end up with a lot more memory consumption
    overall.

    Check and plug all these holes.

    Istvan.
     
    Istvan Albert, Sep 15, 2004
    #9
  10. Anon

    John Lenton Guest

    On Tue, Sep 14, 2004 at 04:39:49AM +0000, Anon wrote:
    > Hello all,
    >
    > I'm hoping for some guidance here... I am a c/c++ "expert", but a
    > complete python virgin. I'm trying to create a program that loads in the
    > entire FreeDB database (excluding the CDDBID itself) and uses this
    > "database" for other subsequent processing. The problem is, I'm running
    > out of memory on a linux RH8 box with 512MB. The FreeDB database that I'm
    > trying to load is presently consisting of two "CSV" files. The first file
    > contains a "list" of albums with artist name and an arbitrary sequential
    > album ID an the CDDBID (an ascii-hex representation of a 32-bit value).
    > The second file contains a list of all of the tracks on each of the
    > albums, crossreferenced via the album ID. When I load into memory, I
    > create a python list where each entry in the list is itself a list
    > representing the data for a given album. The album data list consists of
    > a small handful of text items like the album title, author, genre, and
    > year, as well as a list which itself contains a list for each of the track
    > on the album.
    >
    > [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
    > [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    > [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
    > [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    > ...
    > [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
    > [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]


    silly question: have you looked into using tuples instead of lists for
    the inner objects? They're supposed to be more memory efficient,
    although I have found that that isn't necessarily the case.

    --
    John Lenton () -- Random fortune:
    You're a card which will have to be dealt with.

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)

    iD8DBQFBSIMRgPqu395ykGsRAgPgAJ9DqQNAvQI4pnr9JUEshQ/ACThW6QCfYmIJ
    2W/vN0+beKvg5IdF2PsJe/I=
    =5lCT
    -----END PGP SIGNATURE-----
     
    John Lenton, Sep 15, 2004
    #10
  11. On Wed, 15 Sep 2004 03:44:36 GMT, "Anon" <> wrote:

    >On Wed, 15 Sep 2004 02:20:38 +0000, Bengt Richter wrote:
    >
    >> Need example of raw input and desired output ;-)

    >The input is:
    >- one "csv" file containing a list of albums with their artist and an
    >AlbumID - one "csv" file containing a list of track names and associated
    >AlbumID - one big file containing fully-qualified paths to mp3 files

    Is the latter the 10k file to be searched for matches to patterns from the former two?
    If so, you could certainly do better than walking through mp3 paths sliding
    a pattern window one character at a time, since windows containing path delimiters
    are not likely to match anything. IOW, you probably want to split on the delimiters
    and also do some text normalization before trying to match.

    How about some actual raw data? ;-) I.e., a few lines from the raw csv files and
    the mp3 path file. Actual samples will probably raise questions about how the search
    patterns are formed and how the searched data may have to be modified to match. E.g.,
    removing punctuation, collapsing multiple embedded spaces, case sensitivity, spelling, etc.

    >
    >The desired output is:
    >and XML file with the "proposed" artist, album, and track name for each
    >file. The file name would be the top level element for a given XML node
    >and then there would be element groupings of "proposed" artist, album, and
    >track name for each file.

    XML lets you do stuff like <foo fie="fie" fo="fum /> or <foo><fie>fie</fie><fo>fum</fo></foo>
    and variations. You can generate a lot of fluff if you don't watch out. Compare the
    economy of python source vs XML representing python source ;-) Do you have
    a utility that is going to accept the XML as input, and so is defining the format?

    >
    >The last step is to edit the XML file and "select" the desired fields from
    >list of candidates for each fields, and then use the XML as a work order
    >for tagging (and possibly renaming) all of the identified files.
    >
    >
    >> Patterns starting at any character? What if you have overlapping
    >> matches? I.e., if you had a string 'abbcde123', and were looking for
    >> 'bb' and 'bc' would it count as two matches? Would longer matches have
    >> priority? E.g., matching 'bbcd' overrides 'bb and 'cd' as separate
    >> matches?

    >
    >Yes, yes, and yes...
    >

    You'll need a change to make long patterns come out first, but no big deal.
    >
    >> Anyway, let's see what a brute force simple approach does. E.g, just put
    >> the 500,000 small strings in a small tree of dictionaries based on say
    >> string length, then the string (you could tier it again with some
    >> leading characters and the rest, but I'd try the easy thing first. But
    >> anyway, the idea (practically untested) is something like:

    >
    >I had planned to sort the list based upon string length (and probably
    >apply some minimum length requirement too), but the tree of dictionaries
    >surely seems a bit more elegant.
    >

    Before that, though, are you going to modify the strings for consistent
    white space and punctuation? Or check spellings? )On the other end too, of course).
    >
    >> This finds repeating and overlapping patterns, since it was easy. That
    >> may not be what you want, but you didn't say ;-)
    >>
    >> ----< s500kin10k.py >--------------------------------------- def

    [...]
    >>

    >Gee, thanks. That's a nice example!
    >

    You're welcome.

    >> I suspect that you won't be that happy iterating through just arrays,
    >> unless they are arrays of patternmatching tables like IIRC flex
    >> generates to do parsing. Basically doing tricky parallel pattern
    >> matching by keeping track of which patterns are alive as you go from
    >> character to character. But 500k patterns are a lot of patterns....

    >
    >Since I only want to do it once in rare while, execution time isn't a huge
    >deal... I could readily write this code in C and make my data fit within
    >a reasonable amount of RAM, but I'm trying to use the project as an
    >opportunity to lean Python.

    Sounds like a good way to go. If you are good in C, you can also learn how
    to make the best of both worlds by writing your own extension modules in C
    that you can import in python.

    Don't forget to look into available modules that may help. E.g., the array
    module can make arrays of numbers very efficient. mmap can let you look at
    a file as if it were a character array, etc. Also re for regular expressions,
    which you may need at some point.

    >
    >
    >> See if the above runs you out of memory. Or time ;-) I'd try it on
    >> shorter files first. You didn't say what you wanted your final output to
    >> look like. Small examples of input => output communicate well ;-)

    >
    >Well, basically, I'm trying to use the FreeDB database as a list of valid
    >artists, albums, and tracks to locate these strings within the file names
    >of mp3 files and then use those result to apply ID tags to the mp3 files.
    >I'm assuming that I'll have multiple matches for any given file so my plan
    >was to output to an XML file, hand 'clean' that, and then use the XML as
    >the input for the actual tagging operation.
    >

    Again, "locate these strings" is easy to say, but each word hides a lot of
    meaning. Messing with a small batch of _representative_ actual data will tell you a lot ;-)

    Good luck.

    Regards,
    Bengt Richter
     
    Bengt Richter, Sep 15, 2004
    #11
  12. On Tue, 14 Sep 2004 14:55:56 -0700, Wes Crucius wrote:
    > Maybe someone has a better suggestion if I give little more info: I
    > want to iterate through about 10,000 strings (each ~256 characters or
    > less) looking for occurances (within those 10,000 strings) of any one
    > of about 500,000 smaller (~30-50 characters average) strings. Then I
    > generate an XML doc that records which of the 500,000 strings were
    > found within each of the 10,000 strings.


    There may be a good reason for this, but it should be asked: Why not
    stream one set of strings or the other in, instead of loading them up
    front?

    The other obvious answer that isn't cute, but will probably be faster to
    program and still run faster then a process that runs into swap is to
    break the problem up: Either check some % of the 10,000 in one run and
    start over, or check some % of the 500,000 and start over with a new
    chunk.

    There's nothing wrong with such chunking; when you start asking for
    results from datasets that challenge your machine there are many
    techniques that seem wasteful but really aren't. One of my personal
    canonical examples is something called "iterative deepening depth first
    search" for large graphs that may have cycles (which can trap the
    depth-first search). On the first run, you search your first level. On the
    second, you search down two levels, and so on. By stopping at a certain
    depth you ensure that you cover all nodes down to that depth without being
    trapped. It sounds horridly wasteful compared to trying to keep track of
    the visited nodes, but it turns out that it may only double or less the
    execution time while saving vaste swathes of memory. It still boggles my
    mind.
     
    Jeremy Bowers, Sep 15, 2004
    #12
  13. Anon

    Anon Guest

    On Wed, 15 Sep 2004 14:59:45 +0000, John Lenton wrote:

    > On Tue, Sep 14, 2004 at 04:39:49AM +0000, Anon wrote:
    >> Hello all,
    >>
    >> I'm hoping for some guidance here... I am a c/c++ "expert", but a
    >> complete python virgin. I'm trying to create a program that loads in
    >> the entire FreeDB database (excluding the CDDBID itself) and uses this
    >> "database" for other subsequent processing. The problem is, I'm
    >> running out of memory on a linux RH8 box with 512MB. The FreeDB
    >> database that I'm trying to load is presently consisting of two "CSV"
    >> files. The first file contains a "list" of albums with artist name and
    >> an arbitrary sequential album ID an the CDDBID (an ascii-hex
    >> representation of a 32-bit value). The second file contains a list of
    >> all of the tracks on each of the albums, crossreferenced via the album
    >> ID. When I load into memory, I create a python list where each entry
    >> in the list is itself a list representing the data for a given album.
    >> The album data list consists of a small handful of text items like the
    >> album title, author, genre, and year, as well as a list which itself
    >> contains a list for each of the track on the album.
    >>
    >> [[<Alb1ID#>, '<Alb1Artist>', '<Alb1Title>', '<Alb1Genre>','<Alb1Year>',
    >> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    >> [<Alb2ID#>, '<Alb2Artist>', '<Alb2Title>', '<Alb2Genre>','<Alb2Year>',
    >> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]],
    >> ...
    >> [<AlbNID#>, '<AlbNArtist>', '<AlbNTitle>', '<AlbNGenre>','<AlbNYear>',
    >> [["Track1", 1], ["Track2", 2], ["Track3", 3], ..., ["TrackN",N]]]]

    >
    > silly question: have you looked into using tuples instead of lists for
    > the inner objects? They're supposed to be more memory efficient,
    > although I have found that that isn't necessarily the case.


    That's exactly the kind of feedback that I was looking for when I asked
    the question. However the suggestion that MusicBrainz already does what I
    want in a possibly more accurate way looks like it might be an even better
    suggestion, leaving me to concentrate my efforts on some other
    as-yet-unsolved problem instead!

    Thanks!
    Wes
     
    Anon, Sep 17, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Kavimandan

    Memory utilization for Windows services

    Jay Kavimandan, Oct 25, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    1,501
    Prasad
    Nov 2, 2004
  2. pt
    Replies:
    1
    Views:
    2,290
    Aaron W. West
    Jun 4, 2004
  3. Replies:
    1
    Views:
    2,678
    Stephen Kellett
    Mar 12, 2006
  4. Replies:
    1
    Views:
    304
    Michael Torrie
    Jul 29, 2007
  5. Memory utilization

    , Jan 30, 2008, in forum: Java
    Replies:
    8
    Views:
    564
Loading...

Share This Page