Memory issues when storing as List of Strings vs List of List

Discussion in 'Python' started by OW Ghim Siong, Nov 30, 2010.

  1. Hi all,

    I have a big file 1.5GB in size, with about 6 million lines of
    tab-delimited data. I have to perform some filtration on the data and
    keep the good data. After filtration, I have about 5.5 million data left
    remaining. As you might already guessed, I have to read them in batches
    and I did so using .readlines(100000000). After reading each batch, I
    will split the line (in string format) to a list using .split("\t") and
    then check several conditions, after which if all conditions are
    satisfied, I will store the list into a matrix.

    The code is as follows:
    -----Start------
    a=open("bigfile")
    matrix=[]
    while True:
    lines = a.readlines(100000000)
    for line in lines:
    data=line.split("\t")
    if several_conditions_are_satisfied:
    matrix.append(data)
    print "Number of lines read:", len(lines), "matrix.__sizeof__:",
    matrix.__sizeof__()
    if len(lines)==0:
    break
    -----End-----

    Results:
    Number of lines read: 461544 matrix.__sizeof__: 1694768
    Number of lines read: 449840 matrix.__sizeof__: 3435984
    Number of lines read: 455690 matrix.__sizeof__: 5503904
    Number of lines read: 451955 matrix.__sizeof__: 6965928
    Number of lines read: 452645 matrix.__sizeof__: 8816304
    Number of lines read: 448555 matrix.__sizeof__: 9918368

    Traceback (most recent call last):
    MemoryError

    The peak memory usage at the task manager is > 2GB which results in the
    memory error.

    However, if I modify the code, to store as a list of string rather than
    a list of list by changing the append statement stated above to
    "matrix.append("\t".join(data))", then I do not run out of memory.

    Results:
    Number of lines read: 461544 matrix.__sizeof__: 1694768
    Number of lines read: 449840 matrix.__sizeof__: 3435984
    Number of lines read: 455690 matrix.__sizeof__: 5503904
    Number of lines read: 451955 matrix.__sizeof__: 6965928
    Number of lines read: 452645 matrix.__sizeof__: 8816304
    Number of lines read: 448555 matrix.__sizeof__: 9918368
    Number of lines read: 453455 matrix.__sizeof__: 12552984
    Number of lines read: 432440 matrix.__sizeof__: 14122132
    Number of lines read: 432921 matrix.__sizeof__: 15887424
    Number of lines read: 464259 matrix.__sizeof__: 17873376
    Number of lines read: 450875 matrix.__sizeof__: 20107572
    Number of lines read: 458552 matrix.__sizeof__: 20107572
    Number of lines read: 453261 matrix.__sizeof__: 22621044
    Number of lines read: 413456 matrix.__sizeof__: 22621044
    Number of lines read: 166464 matrix.__sizeof__: 25448700
    Number of lines read: 0 matrix.__sizeof__: 25448700

    In this case, the peak memory according to the task manager is about 1.5 GB.

    Does anyone know why is there such a big difference memory usage when
    storing the matrix as a list of list, and when storing it as a list of
    string? According to __sizeof__ though, the values are the same whether
    storing it as a list of list, or storing it as a list of string. Is
    there any methods how I can store all the info into a list of list? I
    have tried creating such a matrix of equivalent size and it only uses
    35mb of memory but I am not sure why when using the code above, the
    memory usage shot up so fast and exceeded 2GB.

    Any advice is greatly appreciated.

    Regards,
    Jinxiang
     
    OW Ghim Siong, Nov 30, 2010
    #1
    1. Advertising

  2. OW Ghim Siong wrote:
    > I have a big file 1.5GB in size, with about 6 million lines of
    > tab-delimited data.


    How many fields are there an each line?

    > I have to perform some filtration on the data and
    > keep the good data. After filtration, I have about 5.5 million data left
    > remaining. As you might already guessed, I have to read them in batches
    > and I did so using .readlines(100000000).


    I'd have guessed differently. Typically, I would say that you read one line,
    apply whatever operation you want to it and then write out the result. At
    least that is the "typical" operation of filtering.

    > a=open("bigfile")


    I guess you are on MS Windows. There, you have different handling of textual
    and non-textual files with regards to the handling of line endings.
    Generally, using non-textual as input is easier, because it doesn't require
    any translations. However, textual input is the default, therefore:

    a = open("bigfile", "rb")

    Or, even better:

    with open("bigfile", "rb") as a:

    to make sure the file is closed correctly and in time.

    > matrix=[]
    > while True:
    > lines = a.readlines(100000000)
    > for line in lines:


    I believe you could do

    for line in a:
    # use line here

    > data=line.split("\t")


    Question here: How many elements does each line contain? And what is their
    content? The point is that each object has its overhead, and if the content
    is just e.g. an integral number or a short string, the ratio of interesting
    content to overhead is rather bad! Compare this to storing a longer string
    with just the overhead of a single string object instead, it should be
    obvious.

    > However, if I modify the code, to store as a list of string rather than
    > a list of list by changing the append statement stated above to
    > "matrix.append("\t".join(data))", then I do not run out of memory.


    You already have the result of that join:

    matrix.append(line)

    > Does anyone know why is there such a big difference memory usage when
    > storing the matrix as a list of list, and when storing it as a list of
    > string? According to __sizeof__ though, the values are the same whether
    > storing it as a list of list, or storing it as a list of string.


    I can barely believe that. How are you using __sizeof__? Why aren't you
    using sys.getsizeof() instead? Are you aware that the size of a list
    doesn't include the size for its content (even though it grows with the
    number of elements), while the size of a string does?


    > Is there any methods how I can store all the info into a list of list? I
    > have tried creating such a matrix of equivalent size and it only uses
    > 35mb of memory but I am not sure why when using the code above, the
    > memory usage shot up so fast and exceeded 2GB.


    The size of an empty list is 20 here, plus 4 per element (makes sense on a
    32-bit machine), excluding the elements themselves. That means that you
    have around 8M elements (25448700/4). These take around 32MB of memory,
    which is what you are probably seeing. The point is that your 35mb don't
    include any content, probably just a single interned integer or None, so
    that all elements of your list are the same and only require memory once.
    In your real-world application that is obviously not so.

    My suggestions:
    1. Find out what exactly is going on here, in particular why our
    interpretations of the memory usage differ.
    2. Redesign your code to really use a filtering design, i.e. don't keep the
    whole data in memory.
    3. If you still have memory issues, take a look at the array library, which
    should make storage of large arrays a bit more efficient.


    Good luck!

    Uli

    --
    Domino Laser GmbH
    Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
     
    Ulrich Eckhardt, Nov 30, 2010
    #2
    1. Advertising

  3. OW Ghim Siong

    Peter Otten Guest

    OW Ghim Siong wrote:

    > Hi all,
    >
    > I have a big file 1.5GB in size, with about 6 million lines of
    > tab-delimited data. I have to perform some filtration on the data and
    > keep the good data. After filtration, I have about 5.5 million data left
    > remaining. As you might already guessed, I have to read them in batches
    > and I did so using .readlines(100000000). After reading each batch, I
    > will split the line (in string format) to a list using .split("\t") and
    > then check several conditions, after which if all conditions are
    > satisfied, I will store the list into a matrix.
    >
    > The code is as follows:
    > -----Start------
    > a=open("bigfile")
    > matrix=[]
    > while True:
    > lines = a.readlines(100000000)
    > for line in lines:
    > data=line.split("\t")
    > if several_conditions_are_satisfied:
    > matrix.append(data)
    > print "Number of lines read:", len(lines), "matrix.__sizeof__:",
    > matrix.__sizeof__()
    > if len(lines)==0:
    > break
    > -----End-----


    As Ulrich says, don't use readlines(), use

    for line in a:
    ...

    that way you have only one line in memory at a time instead of the huge
    lines list.

    > Results:
    > Number of lines read: 461544 matrix.__sizeof__: 1694768
    > Number of lines read: 449840 matrix.__sizeof__: 3435984
    > Number of lines read: 455690 matrix.__sizeof__: 5503904
    > Number of lines read: 451955 matrix.__sizeof__: 6965928
    > Number of lines read: 452645 matrix.__sizeof__: 8816304
    > Number of lines read: 448555 matrix.__sizeof__: 9918368
    >
    > Traceback (most recent call last):
    > MemoryError
    >
    > The peak memory usage at the task manager is > 2GB which results in the
    > memory error.
    >
    > However, if I modify the code, to store as a list of string rather than
    > a list of list by changing the append statement stated above to
    > "matrix.append("\t".join(data))", then I do not run out of memory.
    >
    > Results:
    > Number of lines read: 461544 matrix.__sizeof__: 1694768
    > Number of lines read: 449840 matrix.__sizeof__: 3435984
    > Number of lines read: 455690 matrix.__sizeof__: 5503904
    > Number of lines read: 451955 matrix.__sizeof__: 6965928
    > Number of lines read: 452645 matrix.__sizeof__: 8816304
    > Number of lines read: 448555 matrix.__sizeof__: 9918368
    > Number of lines read: 453455 matrix.__sizeof__: 12552984
    > Number of lines read: 432440 matrix.__sizeof__: 14122132
    > Number of lines read: 432921 matrix.__sizeof__: 15887424
    > Number of lines read: 464259 matrix.__sizeof__: 17873376
    > Number of lines read: 450875 matrix.__sizeof__: 20107572
    > Number of lines read: 458552 matrix.__sizeof__: 20107572
    > Number of lines read: 453261 matrix.__sizeof__: 22621044
    > Number of lines read: 413456 matrix.__sizeof__: 22621044
    > Number of lines read: 166464 matrix.__sizeof__: 25448700
    > Number of lines read: 0 matrix.__sizeof__: 25448700
    >
    > In this case, the peak memory according to the task manager is about 1.5
    > GB.
    >
    > Does anyone know why is there such a big difference memory usage when
    > storing the matrix as a list of list, and when storing it as a list of
    > string? According to __sizeof__ though, the values are the same whether
    > storing it as a list of list, or storing it as a list of string. Is


    sizeof gives you the "shallow" size of the list, basically the memory to
    hold C pointers to the items in the list. A better approximation for the
    total size of a list of lists of string is

    >>> from sys import getsizeof as sizeof
    >>> matrix = [["alpha", "beta"], ["gamma", "delta"]]
    >>> sizeof(matrix), sum(sizeof(row) for row in matrix), sum(sizeof(entry)

    for row in matrix for entry in row)
    (88, 176, 179)
    >>> sum(_)

    443

    As you can see the outer list requires only a small portion of the total
    memory, and its relative size will decrease as the matrix grows.

    The above calculation may still be wrong because some of the strings could
    be identical. Collapsing identical strings into a single object is also a
    way to save memory if you have a significant number of repetitions. Try

    matrix = []
    with open(...) as f:
    for line in f:
    data = line.split("\t")
    if ...:
    matrix.append(map(intern, data))

    to see whether it sufficiently reduces the amount of memory needed.

    > there any methods how I can store all the info into a list of list? I
    > have tried creating such a matrix of equivalent size and it only uses
    > 35mb of memory but I am not sure why when using the code above, the
    > memory usage shot up so fast and exceeded 2GB.
    >
    > Any advice is greatly appreciated.
    >
    > Regards,
    > Jinxiang
     
    Peter Otten, Nov 30, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Klaus Neuner
    Replies:
    7
    Views:
    529
    Klaus Neuner
    Jul 26, 2004
  2. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    814
    Malcolm
    Jun 24, 2006
  3. toton
    Replies:
    11
    Views:
    742
    toton
    Oct 13, 2006
  4. Jonathan Wood
    Replies:
    1
    Views:
    531
    Jonathan Wood
    Jun 2, 2008
  5. Ramon F Herrera
    Replies:
    3
    Views:
    384
    Salt_Peter
    Jun 13, 2008
Loading...

Share This Page