Using virtual memory and/or disk to save reduce memory footprint

N

nick

Hi,

I am writing a C++ GUI tool. I sometimes run out of memory (exceed the
2GB limit) on a 32bit Linux machine. I have optimized my GUI's
database a lot (and still working on it) to reduce the runtime memory
footprint.

I was thinking of ways to off-load part of the database to virtual
memory or disk and reading it back in when required. Does anyone out
there have any papers or suggestions in that direction?

Regards
Nick
 
D

David Schwartz

I am writing a C++ GUI tool. I sometimes run out of memory (exceed the
2GB limit)  on a 32bit Linux machine. I have optimized my GUI's
database a lot (and still working on it) to reduce the runtime memory
footprint.
I was thinking of ways to off-load part of the database to virtual
memory or disk and reading it back in when required. Does anyone out
there have any papers or suggestions in that direction?

You are running out of process virtual address space. You want to find
techniques that conserve process virtual address space. So the first
question is, what's using up all your address space?

If it's memory that you've allocated, you need to allocate less
memory. One solution might be to use a file on disk instead of memory
mapped space. A memory-mapped file consumes process virtual address
space, but if you use 'pread' and 'pwrite' instead, no vm space is
needed.

DS
 
N

nick

Thanks for your tips David! My tool is schematic driven PCB design
tool. There are multiple levels of schematics that can be composed.
In my profiling, most of the memory is taken up during the elaboration
of the various schematics i.e. storing of the various design-related
informations and then connecting them up.


Regards
Nick
 
D

David Schwartz

Thanks for your tips David! My tool is schematic driven PCB design
tool. There are multiple levels of schematics that can be composed.
In my profiling, most of the memory is taken up during the elaboration
of the various schematics i.e. storing of the various design-related
informations and then connecting them up.

Rather than storing them in memory, why not store them in a file on
disk?

If there's enough physical memory, the file will stay in cache anyway.
If there isn't enough physical memory, trying to keep it in memory
would result in it swapping to disk anyway.

So it should be roughly performance neutral, but save a lot of vm
space.

DS
 
N

nick

Why are you flattening the whole thing in memory?  I'm not criticizing,
just asking.  (I spent about 6 months working full-time on a schematic
editor for VLSI designs.)

Yes, you are right. Do you have any publications that I can look up or
suggestions on how to avoid elaborating everything in memory?

I know what I have just asked would probably be proprietary
information but if there is any public domain information that you
could point me to, I would be eternally grateful :)

Regards
Nick
 
T

tharinda.gl

nick said:
Yes, you are right. Do you have any publications that I can look up or
suggestions on how to avoid elaborating everything in memory?

I know what I have just asked would probably be proprietary
information but if there is any public domain information that you
could point me to, I would be eternally grateful :)

Regards
Nick

I remember somebody suggest me this site, this might be helpful to
your work

http://stxxl.sourceforge.net/

Haven't used it though :)
 
J

James Kanze

What 2GB limit? Do you mean that the machine only has 2GB of
RAM? Or do you mean that you're exceeding the 3 or 4 GB limit
on address space?

I was wondering about that myself. I've had programs on a 32
bit Linux that used well over 3 MB.
The kernel already does that for you, automatically. Doing it manually
is called "overlaying."

The kernel can only do it when the entire image would fit into
the virtual address space (4 MB under 32 bit Linux).
"Overlaying" will allow a lot more. And it's only called
overlaying when you swap in and out code and named variables; if
you're just buffering data, the name doesn't apply. (As an
extreme example, programs like grep or sed easily work on data
sets that are in the Gigabyte range or larger; they only hold a
single line in memory at a time, however. And I'm sure you
wouldn't call this overlaying.)

FWIW: I don't think that the Linux linker supports overlay
generation, at least in the sense I knew it 25 or 30 years ago.
(Although explicitly load and unload dynamic objects, that
probably comes out to the same thing.)
 
J

James Kanze

[...]
Oddly, those kinds of overlays (IIUC) are coming back into
fashion, for processors with lots of little cores, each with
its own small, dedicated cache. On cell, the loading and
unloading have to be managed manually, by the programmer.

I don't know what the current situation is. (My current
applications run on dedicated machines with enough real memory
that they never have to page out.) But I remember talking to
people who worked on the mainframe Fortran when virtual memory
was being introduced; their experience was that programs using
virtual memory, instead of overlays, were often several orders
of magnitude slower. When you had to explicitly load an
overlay, you thought about it, and did it as infrequently as
possible. When the system decides to load or unload a virtual
page is invisible to you, and you can end up with a lot more
paging than you wanted. (On the other hand, increase the size
of the real memory, and the virtual memory based system will
cease paging entirely. Where as the overlays will still by
loaded and unloaded each time you request it.)
 
J

James Kanze

The Linux kernel needs about 1 GiB for housekeeping. User
code is typically stuck within the lower 3 GiB. If you need
more address space than that, either fork, or get a 64-bit
processor. :)

But not in your process address space, I hope. (I'm pretty sure
that I've had processes with more than 3GB mapped to the
process, but I don't remember the details---it was probably
under Solaris.)
 
J

James Kanze

Couldn't that be handled the same way we handle manual
swapping of data? I.e., couldn't unused code pages be
unloaded only conditionally? In fact, this sounds like a
tailor-made case for C++, with various template instantiations
for different amounts of RAM.

Probably. But in practice, no one is developing new code with
overlays, so it doesn't matter.
 
E

Eric Sosman

Jeff said:
James said:
[...]
(I'm pretty sure
that I've had processes with more than 3GB mapped to the
process, but I don't remember the details---it was probably
under Solaris.)

32-bit Solaris? 64-bit Linux is already a different ball of
address-space wax, and I'm not sure what Solaris does here. The last
time I developed for Solaris was five years ago, on 64-bit Sparc.

A 32-bit process on Solaris can use "almost all" of its
nominal 4GB address space. I forget precisely how much space
Solaris claims for its own purposes, but it's in the vicinity
of a couple megabytes.

A 64-bit process on Solaris can use somewhat more ...

(Disclaimer: I work for Sun, but don't speak for Sun.)
 
J

James Kanze

It is indeed the same address space, although user-space code
trying to access the uppermos GB directly will just get a
segv. The default allocation limit is 3056 MB for user-space.
I'm told there's a kernel patch that can override this limit,
but I've never used it.
32-bit Solaris?

That's a good question. My code was compiled in 32 bit mode,
but the OS was Solaris 2.8, running on a 64 bit Sparc.

Still, from a QoI point of view, I would not generally expect
the OS to take much of the users address space---a couple of KB,
at the most. My attitude might be influenced here by the
requirements when I worked on OS's... and the maximum user
address space was either 64KB or 1 MB. But if the address space
was 32 bits, I'd feel cheated if the OS didn't allow me to use
very close to 4GB (provided sufficient other resources, of
course).
 
P

Pawel Dziepak

James said:
But not in your process address space, I hope. (I'm pretty sure
that I've had processes with more than 3GB mapped to the
process, but I don't remember the details---it was probably
under Solaris.)

In the most common approach kernel is placed in the top part of address
space of *each* process. For example there is 3GB for user and 1GB for
kernel. That's not because it's the easiest way, that's because it's the
most efficient way. Copying data from kernel to user space or vice versa
is limited only by memory bandwidth. The main disadvantage is, of
course, reducing the size of space available for user mode process.

There is also another way where kernel is placed in a separated address
space, hence process is given 4GB address space. I don't know if it is
used by any other architecture and OS than Solaris on sun4u. That's
because sun4u provides mechanism called 'address space indicator' which
allows to efficiently copy data from one address space to another. Of
course, it is also possible to implement on x86 (or maybe other
architectures) but it's very slow and inefficient.

Additionally, I would like to mention that PAE is not increasing the
size of address space in any way. Limit of 4GB remains. What PAE gives
to the kernel is the possibility to use up to 64GB of *physical* memory.

Pawel Dziepak
 
R

Rainer Weikusat

James Kanze said:
James Kanze wrote:
[...]
32-bit Solaris?

That's a good question. My code was compiled in 32 bit mode,
but the OS was Solaris 2.8, running on a 64 bit Sparc.

Still, from a QoI point of view, I would not generally expect
the OS to take much of the users address space---a couple of KB,
at the most. My attitude might be influenced here by the
requirements when I worked on OS's... and the maximum user
address space was either 64KB or 1 MB. But if the address space
was 32 bits, I'd feel cheated if the OS didn't allow me to use
very close to 4GB (provided sufficient other resources, of
course).

Other people 'feel' that the overhead of a TLB- (and maybe even
cache-) flush is too high for a system call and because of this (as
already written by someone else), the kernel is mapped into the
address space of each process, like any other shared library would be.
 
J

James Kanze

In the most common approach kernel is placed in the top part
of address space of *each* process. For example there is 3GB
for user and 1GB for kernel. That's not because it's the
easiest way, that's because it's the most efficient way.

There's no difference in performance if the system knows how to
manage the virtual memory. At least on the processors I know.
Copying data from kernel to user space or vice versa is
limited only by memory bandwidth.

Copying data from kernel to user space suffers the same
constraints as copying it between two places in user space. If
the memory is already mapped, it is limited by memory bandwidth.
If it isn't, then you'll get a page fault. This is totally
independant of whether the memory is in the user address range
or not.
The main disadvantage is, of course, reducing the size of
space available for user mode process.
There is also another way where kernel is placed in a
separated address space, hence process is given 4GB address
space. I don't know if it is used by any other architecture
and OS than Solaris on sun4u. That's because sun4u provides
mechanism called 'address space indicator' which allows to
efficiently copy data from one address space to another. Of
course, it is also possible to implement on x86 (or maybe
other architectures) but it's very slow and inefficient.

I don't know about other architectures, but Intel certainly
supports memory in different segments being mapped at the same
time; it actually offers more possibilities here than the Sparc.
 
P

Pawel Dziepak

James said:
Copying data from kernel to user space suffers the same
constraints as copying it between two places in user space. If
the memory is already mapped, it is limited by memory bandwidth.
If it isn't, then you'll get a page fault. This is totally
independant of whether the memory is in the user address range
or not.

That's true only in the first approach when kernel is placed at the top
of address space of each process. Indeed, copying between kernel and
user space is as efficient as copying between to places in user space.
That's why this approach is used in all common kernels on x86 and
similar architectures.
If process was given whole address space (4GB), then kernel would have
to be in a separated address space what would cause copying between two
address spaces what is much less efficient (TLB overhead, TSS switches,
etc).
I don't know about other architectures, but Intel certainly
supports memory in different segments being mapped at the same
time; it actually offers more possibilities here than the Sparc.

But only sun4u allows to efficiently access two address spaces at the
same time. In all Intel (and similar) processors TLB would get invalidated.

Pawel Dziepak
 
R

Rainer Weikusat

James Kanze said:
There's no difference in performance if the system knows how to
manage the virtual memory. At least on the processors I know.

This isn't even true for the processors you claim to know, let alone
for others. Eg, ARM-CPUs (at least up to 9) have a virtually addressed
cache and this means that not only is the TLB flushed in case of an
address space switch (as on Intel, IIRC as side effect of writing to
CR3) but the complete content of the cache needs to be tanked as well.
 
J

James Kanze

That's true only in the first approach when kernel is placed
at the top of address space of each process. Indeed, copying
between kernel and user space is as efficient as copying
between to places in user space. That's why this approach is
used in all common kernels on x86 and similar architectures.
If process was given whole address space (4GB), then kernel
would have to be in a separated address space what would cause
copying between two address spaces what is much less efficient
(TLB overhead, TSS switches, etc).

That's simply not true, or at least it wasn't when I did my
evaluations (admittedly on an Intel 80386, quite some time ago).
And the address space of the 80386 was considerably more than
4GB; you could address 4GB per segment. (In theory, you could
have up to 64K segments, but IIRC, in practice, there were some
additional limitations.)

You will pay a performance hit when you first load a segment
register, but this is a one time affaire, normally taking place
when you switch modes.
But only sun4u allows to efficiently access two address spaces
at the same time. In all Intel (and similar) processors TLB
would get invalidated.

I'm not too sure what you mean by "two address spaces". If you
mean two separate segments, at least in older Intel processors,
the TLB would remain valid as long as the segment identifier
remained in a segment register; there was one per segment.

(I'll admit that I find both Windows and Linux unacceptable
here. The Intel processor allows accessing far more than 4GB;
limiting a single process to 4GB is an artificial constraint,
imposed by the OS. Limiting it to even less is practically
unacceptable.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top