[Slightly OT] Memory management and custom allocator

P

Phil Carmody

I don't know when the number Two starts to exercise its mystical
fascination for programmers, but the fascination certainly exists: You
find people reaching for powers of two "just because," with no actual
reason for the choice.

You have 'struct x', you add padding, test it, it is faster. You have
'struct y', you add padding, test it, it is faster. You have 'struct
z', you add padding, test it, it is faster. You have 'struct w', you
add padding, test it, it is slower [oops doesn't always work].

You have 'struct a', and your boss says 'can't test! just guess!'.
What would you do?

I'd ask my boss if he's heard of Knuth.

However, it's hypothetical, I'd probably not be working at such a
company in the first place.

Phil
 
8

88888 Dihedral

88888 Dihedralæ–¼ 2012å¹´1月3日星期二UTC+8下åˆ10時07分20秒寫é“:
Joe keaneæ–¼ 2012å¹´1月3日星期二UTC+8下åˆ9時14分46秒寫é“:

This is a system level problem. The allocator under a huge dram space with
a lot free space remained should be designed toward fast easily.

But if the dram is near fully allocated then a good heap walker that can
swap pages to the hard disk is important toward reliability not the speed
part in the allocator to degrade the system speed.

The heap managing module is very important in the OS.

Normally all allocations from malloc are allocated in sizes of 2's powers
that are 2 to S powers for S=4, 5, 6, ..12 to 16.

A page might be 2K bytes or 64K or even 4M(20 bits) bytes in various systems.

There are trivial details.

The heap walker part for each node in the heap space is more important.

After a heap walk, the buddy algorithm can be applied.

Do we need to discuss the details here?
 
P

Phil Carmody

BartC said:
Eric Sosman said:
If you have a forty-four-byte struct and allocate sixty-four
bytes to hold it, how have you avoided *either* sort of "waste?"

If the cache line is 16 bytes, we know that each object will take only
three lines; the fourth is never used [it may be worse with the VM].
For this case 48 would work just as well (the compiler will do that for
you if it thinks 8-byte alignment is a good idea). But then if the
cache line is 32 bytes, same argument applies.

All I can say is that you and I have divergent notions of
efficiency. I'm thinking "I can run over the entire array with
just N*44/C cache misses; Joe would prefer N*64/C." That is, I
think you'll incur 46% more cache misses than I will. Tell me
again, please, what I have wasted by using 31% fewer cache line
loads than you would?

I think he means that the last 16-bytes of the 64 might never be accessed,
so will never be fetched from memory in the first place; there is no cache
miss.

Were there to be no hits on google for "64 bytes (one cache line)", then
you might have a point. The fact that most of such hits refer to the single
most popular general purpose computer architecture in the world diminishes
your point significantly. Those bytes are being pulled into the cache whether
you access them or not.
(That does depend on how the 64-bytes are used; if the whole struct is
copied, the compiler might access all 64-bytes anyway.)

And if the cache blocks are 32-bytes, then it doesn't matter.

Nonsense. An adjacent bunch of Eric's structures being pulled into the
working set will displace fewer cache lines than the same number of your
structs. It's only if the cache lines are 16 bytes (i.e. not any of the
half dozen architectures I've worked with in recent years) that both
will cause 48 bytes to be churned.
No one was advocating wasting 20 bytes, but finding a use for them, perhaps
moving data there that would have been stored elsewhere.

How about the "use" of "still being available for subsequent allocations"?
If you had a heavily used struct, and it turned out it had 65 bytes, that
wouldn't bother you any? You wouldn't try and get it down to 64?

Straw man.

Phil
 
B

BartC

Phil Carmody said:
Were there to be no hits on google for "64 bytes (one cache line)", then
you might have a point. The fact that most of such hits refer to the
single
most popular general purpose computer architecture in the world diminishes
your point significantly. Those bytes are being pulled into the cache
whether
you access them or not.

I've no idea what cache-line sizes are being used these days. If it's 64
instead of 16 then perhaps we'd be talking about a 176-byte versus a
256-byte struct instead.
Nonsense. An adjacent bunch of Eric's structures being pulled into the
working set will displace fewer cache lines than the same number of your
structs.

That's true, if you touch any element of a 64-byte struct, and the cache
lines are 64-bytes, and the struct is aligned on 64-bytes, then a set of,
say, 1000 serial accesses will require 1000 memory fetches, instead of 688
if they were only 44 bytes. That's why I advocated not wasting that extra 20
bytes per struct. (In fact I suggested making do with 32 in the first
place.)

The problem with 44-bytes in this case, is that if you're touching two parts
of the struct at a time, and you're doing random access (which is what
arrays are for), you'll frequently be fetching two cache lines instead of
one, as the structs will straddle a 64-byte boundary.

And if you are only touching one member of the struct at a time, and are
doing serial accesses, then perhaps you'd be better off using a set of
parallel arrays, for at least some of the struct members. If a field is
4-bytes wide, then that would require only about 1/11 of the memory fetches
as the 44-byte struct.

You're quite likely to find these narrower array elements happen to have
power-of-two widths..

Anyway, look at this struct (compiled with default padding):

struct record{
char c1;
double d1;
char c2;
double d2;
char c3;
double d3;
char c4;
} x;

On my machine, these 28 bytes end up occupying 56, exactly double!

Rearrange these fields so all the small ones are at the end, and still you
get 32, not 28!

Wasting bytes is obviously OK when the compiler does it...

(BTW I put that struct into another language, and it consistently gave a
size of 28 bytes no matter how the fields are ordered.)
 
8

88888 Dihedral

FtMæ–¼ 2011å¹´12月31日星期六UTC+8下åˆ10時56分53秒寫é“:
[xpost comp.lang.c, comp.unix.programmer, comp.os.ms-
windows.programmer.win32]
Hello NG,
I have a very specific question, but first of all I describe you what
I'm facing as I'm open to alternative general solutions..
I'm refactoring some old C code that deals with very basilar data
structures (like lists, priority queues, hashtables and so on) that
pretends to be very portable. Regarding the memory management, I have
an internal wrapper that, on *nix and windows systems, maps the malloc/
calloc/realloc/free functions one-to-one to the default ones, but it's
possibile to switch to another custom allocator that works pretty much
as a standard allocator (a big chunk of memory splitted/joined, tought
for embedded systems).
I would like to implement my custom allocator for the *nix and windows
systems too (of course loosing some portability for performance gain),
and maybe implement something like the C++ allocators to improve the
memory management of the different algorithms (like rare large
allocations on vectors / frequent small allocations on lists), but
I've some questions:
- In these systems, to have decent performances, I was thinking to
allocate page-sized (and page-aligned) chunks to have small
allocations in the same page, but how is it possible to request some
memory with this characteristics? Will it help the OS memory manager?
- Does all of this make any sense? :)
Thanks for any suggestion!
Ciao!

P.S. I don't really have the necessity of a "performance boost", but
I'm dealing with a general, low-level and already used library, and as
long as I'm refactoring the code I'd like to do it the better possible
way ;-)

I mean the allocation requests for the heap space can be modeled
as a discrete random process X[n] for n=0,1,2,3... in an application
and so is the free process Y[m] for m=0, 1,2,3 ....

Therefore the amount of memory used can be counted as

for(i=0, s1=0;i<n;i++) s1+=X;
for(i=0, s2=0;i<m;i++) s2+=Y;

The number s1-s2 is the amount of space used is another random process, too..

The OS part does not know the two random processes better than the programmer
who is the author of the application. Thus a customized heap manager
can beat the OS as proved in the above.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top