using mmap on large (> 2 Gig) files

T

Tim Roberts

sturlamolden said:
However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile().

My goodness, do I disagree with that! At least on Windows, I/O on a file
mapped with MapViewOfFile uses the virtual memory pager -- the same
mechanism used by the swap file. Because it is so heavily used, that is
some of the most well-optimized code in the system.
We can implement a container object backed by a binary file just as
efficient (and possibly even more efficient) without using the OS'
memory mapping facilities. The major advantage is that we can
"pseudo-memory map" a lot more than a 32 bit address space can harbour.

Both the Unix mmap and the Win32 MapViewOfFile allow a starting byte
offset. It wouldn't be rocket science to extend Python's mmap to allow
that.
There are in any case room for improving Python's mmap object.

Here we agree.
 
P

Paul Rubin

sturlamolden said:
However, "memory mapping" a file by means of fseek() is probably more
efficient than using UNIX' mmap() or Windows'
CreateFileMapping()/MapViewOfFile().

Why on would you think that?! It is counterintuitive. fseek beyond
whatever is buffered in stdio (usually no more than 1kbyte or so)
requires a system call, while mmap is just a memory access.
In Python, we don't always need the file memory mapped, we normally
just want to use slicing-operators, for-loops and other goodies on
the file object -- i.e. we just want to treat the file as a Python
container object. There are many ways of achieving that.

Some of the time we want to share the region with other processes.
Sometimes we just want random access to a big file on disk without
having to do a lot of context switches seeking around in the file.
There are in any case room for improving Python's mmap object.

IMO it should have some kind of IPC locking mechanism added, in
addition to the offset stuff suggested.
 
C

Chetan

Paul Rubin said:
Why on would you think that?! It is counterintuitive. fseek beyond
whatever is buffered in stdio (usually no more than 1kbyte or so)
requires a system call, while mmap is just a memory access.
And the buffer copy required with every I/O from/to the application.
Some of the time we want to share the region with other processes.
Sometimes we just want random access to a big file on disk without
having to do a lot of context switches seeking around in the file.


IMO it should have some kind of IPC locking mechanism added, in
addition to the offset stuff suggested.
The type of IPC required differs depending on who is using the shared region -
either another python process or another external program. Apart from the
spinlock primitives, other types of synchronization mechanisms are provided by
the OS. However, I do see value in providing a shared memory based spinlock
mechanism. These services can be built on top of the shared memory
infrastructure. I am not sure what kind or real world python applications use
it.

-Chetan
 
P

Paul Rubin

Chetan said:
And the buffer copy required with every I/O from/to the application.

Even that can probably be avoided since the mmap region has to start
on a page boundary, but anyway regular I/O definitely has to copy the
data. For mmap, I'm thinking mostly of the case where the entire file
is paged in through most of the program's execution though. That
obviously wouldn't apply to every application.
The type of IPC required differs depending on who is using the
shared region - either another python process or another external
program. Apart from the spinlock primitives, other types of
synchronization mechanisms are provided by the OS. However, I do see
value in providing a shared memory based spinlock mechanism.

I mean just have an interface to OS locks (Linux futex and whatever
the Windows counterpart is) and maybe also a utility function to do a
compare-and-swap in user space.
 
C

Chetan

Paul Rubin said:
I mean just have an interface to OS locks (Linux futex and whatever
the Windows counterpart is) and maybe also a utility function to do a
compare-and-swap in user space.
There is code for spinlocks, but it allocates the lockword in the process
memory. This can be used for thread synchronization, but not for IPC with
external python or non-python processes.
I found a PyIPC IPC package that seems to provide interface to Sys V shared
memory and semaphore - but I just found it, so cannot comment on it at this
time.
 
N

nnorwitz

Martin said:
I don't know exactly; the most likely reason is that nobody has
contributed code to make it support that. That's, in turn, probably
because nobody had the problem yet, or nobody of those who did
cared enough to implement and contribute a patch.

Or because no one cared enough to test a patch that was produced 2.5
years ago (not directed at Martin, just pointing out why the patch
stalled).

http://python.org/sf/708374

With just a little community support, this can go in. I suppose now
that we have the buildbots, we can check in untested code and test it
that way. The patch should be reviewed.

n
 
C

Chetan

Or because no one cared enough to test a patch that was produced 2.5
years ago (not directed at Martin, just pointing out why the patch
stalled).

http://python.org/sf/708374

With just a little community support, this can go in. I suppose now
that we have the buildbots, we can check in untested code and test it
that way. The patch should be reviewed.

n
I made the changes before I saw this. However, the patch seems to be quite
dated and some of the changes are very interesting, especially if they were
tested for the special conditions they are supposed to handle and
if they were made after some discussion.
I can submit my patch as it is, but I am working on making some of the other
changes I had in mind for the mmap to be useful.
Some of the other changes would make more sense for py3k, if it supports a byte
array object, but I haven't looked at py3k at all.

Chetan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top