Is there no compression support for large sized strings in Python?

F

Fredrik Lundh

Claudio said:
What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?

you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations >256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell,
all three libraries support that.

</F>
 
J

jepler

On this system (Linux 2.6.x, AMD64, 2 GB RAM, python2.4) I am able to
construct a 1 GB string by repetition, as well as compress a 512MB
string with gzip in one gulp.

$ cat claudio.py
s = '1234567890'*(1048576*50)

import zlib
c = zlib.compress(s)
print len(c)
open("/tmp/claudio.gz", "wb").write(c)

$ python claudio.py
1017769

$ python -c 'print len("m" * (1048576*1024))'
1073741824

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).

$ python -c 'print len("m" * 1024*1024*1024)'
1073741824

I agree with another poster that you may be hitting Windows limitations rather
than Python ones, but I am certainly not familiar with the details of Windows
memory allocation.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFDjvx7Jd01MZaTXX0RAo7HAKCEhtbvyS3GSfJPsqq0W5R5EOLgTwCfVb7o
OSlY79Rl7HCLLNQQ4axI6AA=
=qTA2
-----END PGP SIGNATURE-----
 
C

Claudio Grondi

What started as a simple test if it is better to load uncompressed data
directly from the harddisk or
load compressed data and uncompress it (Windows XP SP 2, Pentium4 3.0 GHz
system with 3 GByte RAM)
seems to show that none of the in Python available compression libraries
really works for large sized
(i.e. 500 MByte) strings.

Test the provided code and see yourself.

At least on my system:
zlib fails to decompress raising a memory error
pylzma fails to decompress running endlessly consuming 99% of CPU time
bz2 fails to compress running endlessly consuming 99% of CPU time

The same works with a 10 MByte string without any problem.

So what? Is there no compression support for large sized strings in Python?
Am I doing something the wrong way here?
Is there any and if yes, what is the theoretical upper limit of string size
which can be processed by each of the compression libraries?

The only limit I know about is 2 GByte for the python.exe process itself,
but this seems not to be the actual problem in this case.
There are also some other strange effects when trying to create large
strings using following code:
m = 'm'*1048576
# str1024MB = 1024*m # fails with memory error, but:
str512MB_01 = 512*m # works ok
# str512MB_02 = 512*m # fails with memory error, but:
str256MB_01 = 256*m # works ok
str256MB_02 = 256*m # works ok
etc. . etc. and so on
down to allocation of each single MB in separate string to push python.exe
to the experienced upper limit
of memory reported by Windows task manager available to python.exe of
2.065.352 KByte.
Is the question why did the str1024MB = 1024*m instruction fail,
when the memory is apparently there and the target size of 1 GByte can be
achieved
out of the scope of this discussion thread, or is this the same problem
causing
the compression libraries to fail? Why is no memory error raised then?

Any hints towards understanding what is going on and why and/or towards a
workaround are welcome.

Claudio

============================================================
# HDvsArchiveUnpackingSpeed_WriteFiles.py

strSize10MB = '1234567890'*1048576 # 10 MB
strSize500MB = 50*strSize10MB
fObj = file(r'c:\strSize500MB.dat', 'wb')
fObj.write(strSize500MB)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.zlib', 'wb')
import zlib
strSize500MBCompressed = zlib.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.pylzma', 'wb')
import pylzma
strSize500MBCompressed = pylzma.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

fObj = file(r'c:\strSize500MBCompressed.bz2', 'wb')
import bz2
strSize500MBCompressed = bz2.compress(strSize500MB)
fObj.write(strSize500MBCompressed)
fObj.close()

print
print ' Created files: '
print ' %s \n %s \n %s \n %s' %(
r'c:\strSize500MB.dat'
,r'c:\strSize500MBCompressed.zlib'
,r'c:\strSize500MBCompressed.pylzma'
,r'c:\strSize500MBCompressed.bz2'
)

raw_input(' EXIT with Enter /> ')

============================================================
# HDvsArchiveUnpackingSpeed_TestSpeed.py
import time

startTime = time.clock()
fObj = file(r'c:\strSize500MB.dat', 'rb')
strSize500MB = fObj.read()
fObj.close()
print
print ' loading uncompressed data from file: %7.3f
seconds'%(time.clock()-startTime,)

startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.zlib', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import zlib
try:
startTime = time.clock()
strSize500MB = zlib.decompress(strSize500MBCompressed)
print 'decompressing zlib data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing zlib data FAILED'


startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.pylzma', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import pylzma
try:
startTime = time.clock()
strSize500MB = pylzma.decompress(strSize500MBCompressed)
print 'decompressing pylzma data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing pylzma data FAILED'


startTime = time.clock()
fObj = file(r'c:\strSize500MBCompressed.bz2', 'rb')
strSize500MBCompressed = fObj.read()
fObj.close()
print
print 'loading compressed data from file: %7.3f
seconds'%(time.clock()-startTime,)
import bz2
try:
startTime = time.clock()
strSize500MB = bz2.decompress(strSize500MBCompressed)
print 'decompressing bz2 data: %7.3f
seconds'%(time.clock()-startTime,)
except:
print 'decompressing bz2 data FAILED'

raw_input(' EXIT with Enter /> ')
 
G

Gerald Klix

Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

HTH,
Gerald

Claudio said:
Claudio Grondi wrote:


GHz

Python?

you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I

can tell,
all three libraries support that.

</F>


Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)

lstCompressBz2.append(objBZ2Compressor.compress(strSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()

:)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio
 
C

Claudio Grondi

Fredrik Lundh said:
Python?

you're probably measuring windows' memory managment rather than the com-
pression libraries themselves (Python delegates all memory allocations
256 bytes
to the system).

I suggest using incremental (streaming) processing instead; from what I can tell,
all three libraries support that.

</F>

Have solved the problem with bz2 compression the way Frederic suggested:

fObj = file(r'd:\strSize500MBCompressed.bz2', 'wb')
import bz2
objBZ2Compressor = bz2.BZ2Compressor()
lstCompressBz2 = []
for indx in range(0, len(strSize500MB), 1048576):
lowerIndx = indx
upperIndx = indx+1048576
if(upperIndx > len(strSize500MB)): upperIndx = len(strSize500MB)

lstCompressBz2.append(objBZ2Compressor.compress(strSize500MB[lowerIndx:upper
Indx]))
#:for
lstCompressBz2.append(objBZ2Compressor.flush())
strSize500MBCompressed = ''.join(lstCompressBz2)
fObj.write(strSize500MBCompressed)
fObj.close()

:)

so I suppose, that the decompression problems can also be solved that way,
but :

This still doesn't for me answer the question what the core of the problem
was, how to avoid it and what are the memory request limits which should be
considered when working with large strings?
Is it actually so, that on other systems than Windows 2000/XP there is no
problem with the original code I have provided?
Maybe a good reason to go for Linux instead of Windows? Does e.g. Suse or
Mandriva Linux have also a memory limit a single Python process can use?
Please let me know about your experience.

Claudio
 
C

Christophe

Gerald Klix a écrit :
Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

As a matter of fact, it's Windows which reserved the upper 2 GB. There a
simple setting to change that value so that you have 3 GB available and
another setting which can even go as far as 3.5 GB available per process.
 
H

Harald Karner

Claudio said:
Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?

I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:
> python -c "print len('m' * 2048*1024*1024)"
Traceback (most recent call last):
> python -c "print len('m' * ((2048*1024*1024)-1))"
2147483647
 
C

Claudio Grondi

I was also able to create a 1GB string on a different system (Linux 2.4.x,
32-bit Dual Intel Xeon, 8GB RAM, python 2.2).
$ python -c 'print len("m" * 1024*1024*1024)'
1073741824
I agree with another poster that you may be hitting Windows limitations
rather
than Python ones, but I am certainly not familiar with the details of
Windows
memory allocation.
Jeff
----------

Here my experience with hunting after the memory limit exactly the way Jeff
did it (Windows 2000, Intel Pentium4, 3GB RAM, Python 2.4.2):

\>python -c "print len('m' * 1024*1024*1024)"
1073741824

\>python -c "print len('m' * 1136*1024*1024)"
1191182336

\>python -c "print len('m' * 1236*1024*1024)"
Traceback (most recent call last):
File "<string>", line 1, in ?
MemoryError

Anyone on a big Linux machine able to do e.g. :
\>python -c "print len('m' * 2500*1024*1024)"
or even more without a memory error?

I suppose, that on the Dual Intel Xeon, even with 8 GByte RAM the upper
limit for available memory will be not larger than 4 GByte.
Can someone point me to an Intel compatible PC which is able to provide more
than 4 GByte RAM to Python?

Claudio
 
F

Fredrik Lundh

Harald said:
I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:

Traceback (most recent call last):

2147483647

the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...

anyone out there with an ILP64 system?

</F>
 
C

Claudio Grondi

Harald Karner said:
I tried on a Sun with 16GB Ram (Python 2.3.2)
seems like 2GB is the limit for string size:

Traceback (most recent call last):

2147483647

In this context I am very curious how many of such
2 GByte strings is it possible to create within a
single Python process?
i.e. at which of the following lines executed
as one script is there a memory error?

dataStringA = 'A'*((2048*1024*1024)-1) # 2 GByte
dataStringB = 'B'*((2048*1024*1024)-1) # 4 GByte
dataStringC = 'C'*((2048*1024*1024)-1) # 6 GByte
dataStringD = 'D'*((2048*1024*1024)-1) # 8 GByte
dataStringE = 'E'*((2048*1024*1024)-1) # 10 GByte
dataStringF = 'F'*((2048*1024*1024)-1) # 12 GByte
dataStringG = 'G'*((2048*1024*1024)-1) # 14 GByte

let 2 GByte for the system on a 16 GByte machine ... ;-)

Claudio
 
C

Claudio Grondi

the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */

If this is what you mean,

#define PyObject_VAR_HEAD \
PyObject_HEAD \
int ob_size; /* Number of items in variable part */

and if I understand it the proper way
(i.e. that all Python types are derived from Python objects)
also the unlimited size integers are limited to integers which
fit into 2 GByte memory, right?
And also a list or dictionary are not designed to have
more than 2 Giga of elements, etc.

So the question which still remains open is, can Python by design
handle adress space larger than 2 GByte?

I can't check it out myself beeing on a Windows system which
limits already a single process to this address space.
With lists I hit the memory limit at around:
python -c "print len(280*1024*1024*[None])"
(where the required memory for this list is larger or
equal around 1.15 GByte - on Windows 2000, Pentium4,
with 3GByte RAM and Python 2.4.2).

Claudio
 
A

Alex Martelli

Claudio Grondi said:
In this context I am very curious how many of such
2 GByte strings is it possible to create within a
single Python process?

VM (Virtual Memory) may make the issue difficult to answer precisely.

With a Python build for 64-bit addressing (and running, of course, on a
64-bit machine), you could go on for a long time. If your virtual
memory space is large enough (say a nice entire terabyte RAID diskset),
and you don't use resource limiting to throttle the process, you could
be trashing (with about 1000 GB of VM backed by only 14 GB of physical
RAM, I predict *LOTS AND LOTS* of disk activity!) for a very, very long
time before you finally get an out-of-memory error.

Change the parameters and the answer will change, of course -- Python
has relatively little to do with it, as you can build it for either
64-bit or 32-bit addressing, on suitable CPUs; the OS's VM
implementation (and of course the CPU) essentially dominate this
"problem space".


Alex
 
C

Claudio Grondi

Gerald Klix said:
Did you consider the mmap library?
Perhaps it is possible to avoid to hold these big stings in memory.
BTW: AFAIK it is not possible in 32bit windows for an ordinary programm
to allocate more than 2 GB. That restriction comes from the jurrasic
MIPS-Processors, that reserved the upper 2 GB for the OS.

HTH,
Gerald
objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])

Traceback (most recent call last):
File "<pyshell#21>", line 1, in -toplevel-
objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
OverflowError: memory mapped size is too large (limited by C int)
4498001104L

Max. allowed value is here 256*256*256*128-1
i.e. 2147483647

'jurrasic' lets greet us also in Python.

The only existing 'workaround' seem to be,
to go for a 64 bit machine with a 64 bit Python version.

No other known way?
Can the Python code not be adjusted, so that C long long is used instead of
C int?

Claudio
 
C

Christopher Subich

Fredrik said:
Harald Karner wrote:


the string type uses the ob_size field to hold the string length, and
ob_size is an integer:

$ more Include/object.h
...
int ob_size; /* Number of items in variable part */
...

anyone out there with an ILP64 system?

I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python, though
(any easy way of checking?). Old version of Python, but I'm not the
sysadmin and "I want to play around with python" isn't a good enough
reason for an upgrade. :)


Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information.-2097152

Yes, that's a negative length. And I don't really care about rebinding
str for this demo. :)
Traceback (most recent call last):
File said:
Traceback (most recent call last):
File said:
Traceback (most recent call last):
File said:
>>> len(str[:]) -2097152
>>> l = list(str)
>>> len(l) 0
>>> l
[]

The string is actually created -- top reports 4.0GB of memory usage.
 
D

Dennis Lee Bieber

objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])

Traceback (most recent call last):
File "<pyshell#21>", line 1, in -toplevel-
objMmap = mmap.mmap(fHdl,os.fstat(fHdl)[6])
OverflowError: memory mapped size is too large (limited by C int)
4498001104L

Max. allowed value is here 256*256*256*128-1
i.e. 2147483647

'jurrasic' lets greet us also in Python.

The only existing 'workaround' seem to be,
to go for a 64 bit machine with a 64 bit Python version.

No other known way?
Can the Python code not be adjusted, so that C long long is used instead of
C int?
That looks like it might be a low-level run-time limitation --
especially if fseek() is used at some stage, since fseek(), as I recall,
permits negative seeks.
 
F

Fredrik Lundh

Christopher said:
I have access to an itanium system with a metric ton of memory. I
-think- that the Python version is still only a 32-bit python

an ILP64 system is a system where int, long, and pointer are all 64 bits,
so a 32-bit python on a 64-bit platform doesn't really qualify.

/... snip examples that show that python's string handling could need
some work for the len(s) > maxint case .../

</F>
 
C

Christopher Subich

Fredrik said:
Christopher Subich wrote:


an ILP64 system is a system where int, long, and pointer are all 64 bits,
so a 32-bit python on a 64-bit platform doesn't really qualify.

Did a quick check, and int is 32 bits, while long and pointer are each 64:
Python 2.2.3 (#1, Nov 12 2004, 13:02:04)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information.(4, 8, 8)

So, as of 2.2.3, there might still be a problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top