RE: Python does not take up available physical memory

Discussion in 'Python' started by Pradipto Banerjee, Oct 19, 2012.

  1. Thanks, I tried that. Still got MemoryError, but at least this time python tried to use the physical memory. What I noticed is that before it gave me the error it used up to 1.5GB (of the 2.23 GB originally showed as available) - so in general, python takes up more memory than the size of the file itself.

    -----Original Message-----
    From: Python-list [mailto:python-list-bounces+pradipto.banerjee=] On Behalf Of Emile van Sebille
    Sent: Saturday, October 20, 2012 2:46 AM
    To:
    Subject: Re: Python does not take up available physical memory

    On 10/19/2012 10:08 AM, Pradipto Banerjee wrote:
    > Hi,
    >
    > I am trying to read a file into memory. The size of the file is around 1
    > GB. I have a 3GB memory PC and the Windows Task Manager shows 2.3 GB
    > available physical memory when I was trying to read the file. I tried to
    > read the file as follows:
    >
    >>>> fdata = open(filename, 'r').read()

    >
    > I got a "MemoryError". I was watching the Windows Task Manager while I
    > run the python command, and it appears that python **perhaps** never
    > even attempted to use more memory but gave me this error.
    >
    > Is there any reason why python can't read a 1GB file in memory even when
    > a 2.3 GB physical memory is available?


    The real issue is likely that there is more than one copy of the file in
    memory somewhere. I had a similar issue years back that I resolved by
    using numeric (now numpy?) as it had a more efficient method of
    importing content from disk.

    Also realize that windows may not allow the full memory to user space.
    I'm not sure what exactly the restrictions are, but a 4Gb windows box
    doesn't always get you 4Gb of memory.

    Emile


    --
    http://mail.python.org/mailman/listinfo/python-list

    This communication is for informational purposes only. It is not intended to be, nor should it be construed or used as, financial, legal, tax or investment advice or an offer to sell, or a solicitation of any offer to buy, an interest in any fund advised by Ada Investment Management LP, the Investment advisor. Any offer or solicitation of an investment in any of the Funds may be made only by delivery of such Funds confidential offering materials to authorized prospective investors. An investment in any of the Funds is not suitable for all investors. No representation is made that the Fundswill or are likely to achieve their objectives, or that any investor will or is likely to achieve results comparable to those shown, or will make anyprofit at all or will be able to avoid incurring substantial losses. Performance results are net of applicable fees, are unaudited and reflect reinvestment of income and profits. Past performance is no guarantee of future results. All financial data and other information are not warranted as to completeness or accuracy and are subject to change without notice.

    Any comments or statements made herein do not necessarily reflect those of Ada Investment Management LP and its affiliates. This transmission may contain information that is confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format.
     
    Pradipto Banerjee, Oct 19, 2012
    #1
    1. Advertising

  2. Am 19.10.2012 21:03 schrieb Pradipto Banerjee:

    > Thanks, I tried that.


    What is "that"? It would be helpful to quote in a reasonable way. Look
    how others do it.


    > Still got MemoryError, but at least this time python tried to use the
    > physical memory. What I noticed is that before it gave me the error
    > it used up to 1.5GB (of the 2.23 GB originally showed as available) -
    > so in general, python takes up more memory than the size of the file
    > itself.


    Of course - the file is not the only thing to be held by the process.

    I see several approaches here:

    * Process the file part by part - as the others already suggested,
    line-wise, but if you have e.g. a binary file format, other partings may
    be suitable as well - e.g. fixed block size, or parts given by the file
    format.

    * If you absolutely have to keep the whole file data in memory, split it
    up in several strings. Why? Well, the free space in virtual memory is
    not necessarily contiguous. So even if you have 1.5G free, you might not
    be able to read 1.5G at once, but you might succeed in reading 3*0.5G.



    Thomas
     
    Thomas Rachel, Oct 19, 2012
    #2
    1. Advertising

  3. On Fri, 19 Oct 2012 14:03:37 -0500, Pradipto Banerjee wrote:

    > Thanks, I tried that. Still got MemoryError, but at least this time
    > python tried to use the physical memory. What I noticed is that before
    > it gave me the error it used up to 1.5GB (of the 2.23 GB originally
    > showed as available) - so in general, python takes up more memory than
    > the size of the file itself.


    Well of course it does. Once you read the data into memory, it has its
    own overhead for the object structure.

    You haven't told us what the file is or how you are reading it. I'm going
    to assume it is ASCII text and you are using Python 2.

    py> open("test file", "w").write("abcde")
    py> os.stat("test file").st_size
    5L
    py> text = open("test file", "r").read()
    py> len(text)
    5
    py> sys.getsizeof(text)
    26

    So that confirms that a five byte ASCII string takes up five bytes on
    disk but 26 bytes in memory as an object.

    That overhead will depend on what sort of object, whether Unicode or not,
    the version of Python, and how you read the data.

    In general, if you have a huge amount of data to work with, you should
    try to work with it one line at a time:

    for line in open("some file"):
    process(line)


    rather than reading the whole file into memory at once:

    lines = open("some file").readlines()
    for line in lines:
    process(line)



    --
    Steven
     
    Steven D'Aprano, Oct 19, 2012
    #3
  4. Thanks, for the illustration. This seems to be one of the biggest shortcomings of Python vs. Matlab. A number of people told me to read one line at a time, but I have a need to run processes on the whole data, e.g. compare one line versus another. So that option doesn't work.

    -----Original Message-----
    From: Python-list [mailto:python-list-bounces+pradipto.banerjee=] On Behalf Of Steven D'Aprano
    Sent: Friday, October 19, 2012 6:12 PM
    To:
    Subject: Re: Python does not take up available physical memory

    On Fri, 19 Oct 2012 14:03:37 -0500, Pradipto Banerjee wrote:

    > Thanks, I tried that. Still got MemoryError, but at least this time
    > python tried to use the physical memory. What I noticed is that before
    > it gave me the error it used up to 1.5GB (of the 2.23 GB originally
    > showed as available) - so in general, python takes up more memory than
    > the size of the file itself.


    Well of course it does. Once you read the data into memory, it has its
    own overhead for the object structure.

    You haven't told us what the file is or how you are reading it. I'm going
    to assume it is ASCII text and you are using Python 2.

    py> open("test file", "w").write("abcde")
    py> os.stat("test file").st_size
    5L
    py> text = open("test file", "r").read()
    py> len(text)
    5
    py> sys.getsizeof(text)
    26

    So that confirms that a five byte ASCII string takes up five bytes on
    disk but 26 bytes in memory as an object.

    That overhead will depend on what sort of object, whether Unicode or not,
    the version of Python, and how you read the data.

    In general, if you have a huge amount of data to work with, you should
    try to work with it one line at a time:

    for line in open("some file"):
    process(line)


    rather than reading the whole file into memory at once:

    lines = open("some file").readlines()
    for line in lines:
    process(line)



    --
    Steven
    --
    http://mail.python.org/mailman/listinfo/python-list

    This communication is for informational purposes only. It is not intended to be, nor should it be construed or used as, financial, legal, tax or investment advice or an offer to sell, or a solicitation of any offer to buy, an interest in any fund advised by Ada Investment Management LP, the Investment advisor. Any offer or solicitation of an investment in any of the Funds may be made only by delivery of such Funds confidential offering materials to authorized prospective investors. An investment in any of the Funds is not suitable for all investors. No representation is made that the Fundswill or are likely to achieve their objectives, or that any investor will or is likely to achieve results comparable to those shown, or will make anyprofit at all or will be able to avoid incurring substantial losses. Performance results are net of applicable fees, are unaudited and reflect reinvestment of income and profits. Past performance is no guarantee of future results. All financial data and other information are not warranted as to completeness or accuracy and are subject to change without notice.

    Any comments or statements made herein do not necessarily reflect those of Ada Investment Management LP and its affiliates. This transmission may contain information that is confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format.
     
    Pradipto Banerjee, Oct 19, 2012
    #4
  5. On Fri, 19 Oct 2012 17:20:23 -0500, Pradipto Banerjee
    <> declaimed the following in
    gmane.comp.python.general:

    > Thanks, for the illustration. This seems to be one of the biggest shortcomings of Python vs. Matlab. A number of people told me to read one line at a time, but I have a need to run processes on the whole data, e.g. compare one line versus another. So that option doesn't work.


    And that requirement already suggests that reading the file en-mass
    is inappropriate... Reading a 1GB mass and THEN splitting it into lines
    means you have 2GB (not counting overhead) in memory for some period of
    time (assuming the OS found a 1GB contiguous chunk of memory).

    I suspect Matlab's read is internally parsing on lines. You don't
    show the related Matlab read statement but...
    http://www.mathworks.com/help/matlab/ref/fscanf.html does both the read
    AND the conversion to the binary array format -- it doesn't read the
    file as a chunk and THEN convert it to an array; it only reads enough to
    fulfill one "format" string, saves that conversion, then reads the next
    amount.

    Large data DIFF and SORT are seldom run as in-memory operations --
    they work line-by-line using files (in the case of some SORT algorithms,
    many files: load 50-100 lines from source, sort in-memory, write to
    file-1; repeat for file-2, -3, ... -n; when you have written to "n"
    files, start back with the first file... Then do an -n file merge to
    another n-files... Repeat until there is only one output file)
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Oct 20, 2012
    #5
  6. Thomas Rachel
    <>
    writes:

    > Am 19.10.2012 21:03 schrieb Pradipto Banerjee:


    [...]
    >> Still got MemoryError, but at least this time python tried to use the
    >> physical memory. What I noticed is that before it gave me the error
    >> it used up to 1.5GB (of the 2.23 GB originally showed as available) -
    >> so in general, python takes up more memory than the size of the file
    >> itself.

    >
    > Of course - the file is not the only thing to be held by the process.
    >
    > I see several approaches here:
    >
    > * Process the file part by part - as the others already suggested,
    > line-wise, but if you have e.g. a binary file format, other partings
    > may be suitable as well - e.g. fixed block size, or parts given by the
    > file format.
    >
    > * If you absolutely have to keep the whole file data in memory, split
    > it up in several strings. Why? Well, the free space in virtual memory
    > is not necessarily contiguous. So even if you have 1.5G free, you
    > might not be able to read 1.5G at once, but you might succeed in
    > reading 3*0.5G.


    * try mmap, if you're lucky it will give you access to your data.

    (Note that it is completely unreasonable to load several Gs of data in a
    32-bit address space, especially if this is text. So my real advice
    would be:

    * read the file line per line and pack the contents of every line into
    a list of objects; once you have all your stuff, process it

    -- Alain.
     
    Alain Ketterlin, Oct 20, 2012
    #6
  7. I tried this on a different PC with 12 GB RAM. As expected, this time, reading the data was no issue. I noticed that for large files, Python takes up 2.5x size in memory compared to size on disk, for the case when each line in the file is retained as a string within a Python list. As an anecdote, for MATLAB, the similar overhead is 2x, slightly lower than Python, and each line in the file was retained as string within a MATLAB cell. I'm curious, has any one compared the overhead of data in memory for other languages like for instance Ruby?


    -----Original Message-----
    From: Python-list [mailto:python-list-bounces+pradipto.banerjee=] On Behalf Of Steven D'Aprano
    Sent: Friday, October 19, 2012 6:12 PM
    To:
    Subject: Re: Python does not take up available physical memory

    On Fri, 19 Oct 2012 14:03:37 -0500, Pradipto Banerjee wrote:

    > Thanks, I tried that. Still got MemoryError, but at least this time
    > python tried to use the physical memory. What I noticed is that before
    > it gave me the error it used up to 1.5GB (of the 2.23 GB originally
    > showed as available) - so in general, python takes up more memory than
    > the size of the file itself.


    Well of course it does. Once you read the data into memory, it has its
    own overhead for the object structure.

    You haven't told us what the file is or how you are reading it. I'm going
    to assume it is ASCII text and you are using Python 2.

    py> open("test file", "w").write("abcde")
    py> os.stat("test file").st_size
    5L
    py> text = open("test file", "r").read()
    py> len(text)
    5
    py> sys.getsizeof(text)
    26

    So that confirms that a five byte ASCII string takes up five bytes on
    disk but 26 bytes in memory as an object.

    That overhead will depend on what sort of object, whether Unicode or not,
    the version of Python, and how you read the data.

    In general, if you have a huge amount of data to work with, you should
    try to work with it one line at a time:

    for line in open("some file"):
    process(line)


    rather than reading the whole file into memory at once:

    lines = open("some file").readlines()
    for line in lines:
    process(line)



    --
    Steven
    --
    http://mail.python.org/mailman/listinfo/python-list

    This communication is for informational purposes only. It is not intended to be, nor should it be construed or used as, financial, legal, tax or investment advice or an offer to sell, or a solicitation of any offer to buy, an interest in any fund advised by Ada Investment Management LP, the Investment advisor. Any offer or solicitation of an investment in any of the Funds may be made only by delivery of such Funds confidential offering materials to authorized prospective investors. An investment in any of the Funds is not suitable for all investors. No representation is made that the Fundswill or are likely to achieve their objectives, or that any investor will or is likely to achieve results comparable to those shown, or will make anyprofit at all or will be able to avoid incurring substantial losses. Performance results are net of applicable fees, are unaudited and reflect reinvestment of income and profits. Past performance is no guarantee of future results. All financial data and other information are not warranted as to completeness or accuracy and are subject to change without notice.

    Any comments or statements made herein do not necessarily reflect those of Ada Investment Management LP and its affiliates. This transmission may contain information that is confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is strictly prohibited. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format.
     
    Pradipto Banerjee, Oct 21, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Emile van Sebille
    Replies:
    0
    Views:
    253
    Emile van Sebille
    Oct 20, 2012
  2. Ian Kelly
    Replies:
    0
    Views:
    178
    Ian Kelly
    Oct 19, 2012
  3. Pradipto Banerjee
    Replies:
    1
    Views:
    158
    Steven D'Aprano
    Oct 19, 2012
  4. MRAB
    Replies:
    0
    Views:
    129
  5. Prasad, Ramit
    Replies:
    0
    Views:
    154
    Prasad, Ramit
    Oct 19, 2012
Loading...

Share This Page