using mmap on large (> 2 Gig) files

?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Anyone ever done this? It looks like Python2.4 won't take a length arg

What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

On a 64-bit architecture, this is a known limitation of Python 2.4:
you can't have containers with more than 2Gi items. This limitation
was removed in Python 2.5, so I recommend to upgrade. Notice that
the code has seen little testing, due to lack of proper hardware,
so I shall suggest that you review the mmap code first before using
it (or just test it out and report bugs as you find them).

Regards,
Martin
 
T

Travis E. Oliphant

Martin said:
What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

On a 64-bit architecture, this is a known limitation of Python 2.4:
you can't have containers with more than 2Gi items. This limitation
was removed in Python 2.5, so I recommend to upgrade. Notice that
the code has seen little testing, due to lack of proper hardware,

NumPy uses the mmap object and I saw a paper at SciPy 2006 that used
Python 2.5 + mmap + numpy to do some pretty nice and relatively fast
manipulations of very large data sets.

So, the very useful changes by Martin have seen more testing than he is
probably aware of.

-Travis
 
S

sturlamolden

Anyone ever done this? It looks like Python2.4 won't take a length arg

http://docs.python.org/lib/module-mmap.html

It seems that Python does take a length argument, but not an offset
argument (unlike the Windows' CreateFileMapping/MapViewOfFile and UNIX'
mmap), so you always map from the beginning of the file. Of course if
you have ever worked with memory mapping files in C, you will probably
have experienced that mapping a large file from beginning to end is a
major slowdown. And if the file is big enough, it does not even fit
inside the 32 bit memory space of your process. Thus you have to limit
the portion of the file that is mapped, using the offset and the length
arguments.

But the question remains whether Python's "mmap" qualifies as a "memory
mapping" at all. Memory mapping a file means that the file is "mapped"
into the process address space. So if you access a certain address
(using a pointer type in C), you will actually read from or write to
the file. On Windows, this mechanism is even used to access "files"
that does not live on the file system. E.g. if CreateFileMapping is
called with the file handle set to INVALID_HANDLE_VALUE, creates a file
mapping backed by the OS paging file. That is, you actually obtain a
shared memory segment e.g. usable for for inter-process communication.
How would you use Python's mmap for something like this?

I haven't looked at the source, but I'd be surprised if Python actually
maps the file into the process image when mmap is called. I believe
Python is not memory mapping at all; rather, it just opens a file in
the file system and uses fseek to move around. That is, you can use
slicing operators on Python's "memory mapped file object" as if it were
a list or a string, but it's not really memory mapping, it's just a
syntactical convinience. Because of this, you even need to manually
"flush" the memory mapping object. If you were talking to a real memory
mapped file, flushing would obviously not be required.

This probably means that your problem is irrelevant. Even if the file
is too large to fit inside a 32 bit process image, Python's memory
mapping would not be affected by this, as it is not memory mapping the
file when "mmap" is called.
 
S

sturlamolden

Martin said:
What architecture are you on? On a 32-bit architecture, it's likely
impossible to map in 2GiB, anyway (since it likely won't fit into the
available address space).

Indeed. But why does Python's memory mapping need to be flushed? And
why doesn't Python's mmap take an offset argument to handle large
files? Is Python actually memory mapping with mmap or just faking it
with fseek? If Python isn't memory mapping, there would be no limit
imposed by the 32 bit address space.
 
S

sturlamolden

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg

Lookin at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.
 
S

sturlamolden

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.
 
S

sturlamolden

Hi
Anyone ever done this? It looks like Python2.4 won't take a length arg

Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows' MapViewOfFile and UNIX'
mmap to 0. This means that it is always mapping from the beginning of
the file. Thus, Python's mmap module is useless for large files. This
is really bad coding. The one that wrote mmapmodule.c didn't consider
the possibility that a 64 bit file system like NTFS can harbour files
to large to fit in a 32 address space. Thus, mmapmodule.c needs to be
fixed before it can be used for large files.
 
M

myeates

Well, compiling Python 2.5 on Solaris 10 on an x86 is no walk in the
park. pyconfig.h seems to think SIZEOF_LONG is 4 and I SEGV during my
build, even after modifying the Makefile and pyconfig.h.

Mathew
 
F

Fredrik Lundh

sturlamolden said:
Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows' MapViewOfFile and UNIX'
mmap to 0. This means that it is always mapping from the beginning of
the file. Thus, Python's mmap module is useless for large files. This
is really bad coding. The one that wrote mmapmodule.c didn't consider
the possibility that a 64 bit file system like NTFS can harbour files
to large to fit in a 32 address space. Thus, mmapmodule.c needs to be
fixed before it can be used for large files.

if you've gotten that far, maybe you could come up with a patch, instead
of stating that someone else "needs to fix it" ?

</F>
 
D

Donn Cave

"sturlamolden said:
It seems that Python does take a length argument, but not an offset
argument (unlike the Windows' CreateFileMapping/MapViewOfFile and UNIX'
mmap), so you always map from the beginning of the file. Of course if
you have ever worked with memory mapping files in C, you will probably
have experienced that mapping a large file from beginning to end is a
major slowdown.

I certainly have not experienced that. mmap itself takes nearly
no time, there should be no I/O. Access to mapped pages may
require I/O, but there is no way around that in any case.
I haven't looked at the source, but I'd be surprised if Python actually
maps the file into the process image when mmap is called. I believe
Python is not memory mapping at all; rather, it just opens a file in
the file system and uses fseek to move around.

Wow, you're sure a wizard! Most people would need to look before
making statements like that.

Donn Cave, (e-mail address removed)
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

sturlamolden said:
Indeed. But why does Python's memory mapping need to be flushed?

It doesn't need to, why do you think it does?
And why doesn't Python's mmap take an offset argument to handle large
files?

I don't know exactly; the most likely reason is that nobody has
contributed code to make it support that. That's, in turn, probably
because nobody had the problem yet, or nobody of those who did
cared enough to implement and contribute a patch.
Is Python actually memory mapping with mmap or just faking it
with fseek?

Read the source, Luke. It uses mmap or MapViewOfFile, depending
on the platform.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

sturlamolden said:
Looking at Python's source (mmapmodule.c), it seems that "mmap.mmap"
always sets the offset argument in Windows MapViewOfFile and UNIX to 0.
This means that it is always mapping from the beginning of the file.
Thus, Python's mmap module is useless for large files. This is really
bad coding. The one that wrote mmapmodule.c didn't consider the
posibility that a 64 bit file system like NTFS can harbour files to
large to fit in a 32 address space. Thus,
mmapmodule.c needs to be fixed before it can be used for large files.

You know this isn't true in general. It is true for a 32-bit address
space only.

Regards,
Martin
 
S

sturlamolden

Fredrik said:
if you've gotten that far, maybe you could come up with a patch, instead
of stating that someone else "needs to fix it" ?

I did not say "someone else" needs to fix it. I can patch it, but I am
busy until next weekend. This is a typical job for a cold, rainy
Saturday afternoon. Also I am not in a hurry to patch mmapmodule.c for
my own projects, as I am not using it (but I am going to).

A patch would involve an new object, say, "mmap.mmap2" that thakes the
additional offeset parameter. I don't want it to break any code
dependent on the existing "mmap.mmap" object. Also, I think mmap.mmap2
should allow the file object to be None, and in that case return a
shared memory segment backed by the OS' paging file. Calling
CreateFileMapping with the filehandle set to INVALID_HANDLE_VALUE is
how shared memory for IPC is created on Windows.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

sturlamolden said:
A patch would involve an new object, say, "mmap.mmap2" that thakes the
additional offeset parameter. I don't want it to break any code
dependent on the existing "mmap.mmap" object. Also, I think mmap.mmap2
should allow the file object to be None, and in that case return a
shared memory segment backed by the OS' paging file. Calling
CreateFileMapping with the filehandle set to INVALID_HANDLE_VALUE is
how shared memory for IPC is created on Windows.

Python has default parameters for that. Just add a new parameter,
and make it have a default value of 0. No need to add new functions
(let alone types).

In any case, take as much time as you need. Python 2.6 won't be
released until 2008.

Regards,
Martin
 
S

sturlamolden

Martin said:
You know this isn't true in general. It is true for a 32-bit address
space only.

Yes, but there are two other aspects:

1. Many of us use 32-bit architectures. The one who wrote the module
should have considered why UNIX' mmap and Windows' MapViewOfFile takes
an offset parameter. As it is now, "mmap.mmap" can be considered
inadequate on 32 bit architectures.

2. The OS may be stupid. Mapping a large file may be a major slowdown
simply because the memory mapping is implemented suboptimally inside
the OS. For example it may try to load and synchronise huge portions of
the file that you don't need. This will deplete the amout of free RAM,
and perhaps result in excessive swapping. "mmap.mmap" is therefore a
potential "tarpit" on any architecture. Thus, memory mapping more than
you need is not intelligent, even if you do have a 64 bit processor.
The missing offset argument is essential for getting adequate
performance from a memory-mapped file object.
 
S

sturlamolden

Donn said:
Wow, you're sure a wizard! Most people would need to look before
making statements like that.

I know, but your news-server doesn't honour cancel messages. :)

Python's mmap does indeed memory map the file into the process image.
It does not fake memory mapping by means of file seek operations.

However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile(). In Python, we don't always need
the file memory mapped, we normally just want to use slicing-operators,
for-loops and other goodies on the file object -- i.e. we just want to
treat the file as a Python container object. There are many ways of
achieving that.

We can implement a container object backed by a binary file just as
efficient (and possibly even more efficient) without using the OS'
memory mapping facilities. The major advantage is that we can
"pseudo-memory map" a lot more than a 32 bit address space can harbour.


However - as I wrote in another posting - memory-mapping may also be
used to create shared memory on Windows, and that doesn't fit easily
into the fseek scheme. But apart from that, I don't see why true memory
mapping has any real advantage on Python. As long as slicing operators
work, users will probably not be able to tell the difference.

There are in any case room for improving Python's mmap object.
 
S

sturlamolden

Martin v. Löwis wrote:

Your news server doesn't honour cancel as well...
It doesn't need to, why do you think it does?

This was an extremely stupid question on my side. It needs to be
flushed after a write because that's how the memory pages mapping the
file is synchronized with the file. Write ops to the memory mapping
addresses isn't immediately synchronized with the file on disk. Both
Windows and UNIX require this. I should think before I write, but I
realized this after posting and my cancel didn't reach you.
Read the source, Luke. It uses mmap or MapViewOfFile, depending
on the platform.

Yes, indeed.
 
S

Steve Holden

sturlamolden wrote:
[...]
This was an extremely stupid question on my side.

I take my hat off to anyone who's prepared to admit this. We all do it,
but most of us try to ignore the fact.

regards
Steve
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

sturlamolden said:
2. The OS may be stupid. Mapping a large file may be a major slowdown
simply because the memory mapping is implemented suboptimally inside
the OS. For example it may try to load and synchronise huge portions of
the file that you don't need.

Can you give an example of an operating system that behaves that way?
To my knowledge, all current systems integrating memory mapping somehow
with the page/buffer caches, using various strategies to write-back
(or just discard in case of no writes) pages that haven't been used
for a while.
The missing offset argument is essential for getting adequate
performance from a memory-mapped file object.

I very much question that statement. Do you have any numbers to
prove it?

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top