Another crazy new language effort - Language #42

J

JC

Grr.  Sorry Jason.  I'm web-browser challenged this afternoon, which
may have to do with just finishing lunch over some really good beer
with a bunch of geek friends of mine to discuss things like 42, and
other topics that friends are working on.  Instead of posting the full
message I meant to, I posted the first half twice!  No, it wasn't
me... it was Google!  Really! (Bill slinks into cave for a while)

No problem, thanks for the instructions, will check it out this
evening. Just blame it on your news server or something. :)

Jason

After doing "make" in the examples/graph_benchmark directory, you can
run c_graph (the DataDraw backed benchmark) and compare it to
rawc_graph (the raw C benchmark).  The output:

[graph_benchmark] time ./rawc_graph

real    0m57.412s
user    0m56.664s
sys     0m0.700s
[graph_benchmark] time c_graph
Process completed in 0:00:08
Used 972.37 MB of memory

real    0m8.617s
user    0m7.832s
sys     0m0.768s
[graph_benchmark]

This isn't the worst benchmark for DataDraw.  It's one of the best.
However, I've done extensive testing for years on memory intensive
applications, and can confidently state that you can expect a 20%
speed improvement or better on almost any memory intensive algorithm.
It's not 7X in general, but still faster than C.

Bill
 
B

Bill Cox

Will do, thanks; the benchmarks were not in the non-svn tarball.

They're in examples/*_benchmark. I'm not so good at guessing what
will throw users off, like hiding benchmarks under an examples
directory!
I don't believe or disbelieve you, however if the benchmarks are
invalid, then they might as well not exist.

You say you do not know how to argue about speed other than running
benchmarks. One important thing to remember to do when arguing about
speed is to provide evidence that your benchmarks are both meaningful
and applicable to relevant situations. Also, the issues raised in my
post above were rather important, ignoring them entirely, as you did,
is not convincing. You may consider going back and responding to them.

I write EDA programs for a living. In general, we have large
databases full of circuit descriptions. This is where the DataDraw
memory organization shines. In particular, most of our algorithms
have inner loops that traverse local graph-like data structures. This
is why I wrote the graph benchmark.

The README file describes more, but basically L2 cache misses are far
greater in the C based code. This occurs for three reasons. First, C
pointer-to-struct based data structures typically have many fields
that get loaded on an L2 cache miss, and few if any of them are ever
used. Second, malloc allocates various object types in what seems
like random order, while DataDraw keeps all fields of a given class
together. Third, DataDraw uses indexes into arrays to access data,
allowing 32 bit unsigned integers to be used up to 4 billion objects,
while the raw C code used 64-bit pointers on my 64-bit Ubuntu machine,
which wastes a ton of memory and hurts cache performance.

It's quite intriguing to me to examine what limits inner-loop
performance for typical EDA applications now days. I was floored by
how much L2 cache performance dominates over details of inner loop
code efficiency. This is not just in benchmarks, but in several EDA
tools from the commercial EDA vendor where I work.

If you run into problems with the benchmarks, feel free to e-mail me
directly at (e-mail address removed). Happy hacking!

Bill
 
J

JC

They're in examples/*_benchmark.  I'm not so good at guessing what
will throw users off, like hiding benchmarks under an examples
directory!

As I said, they are not in the non-svn tarball (they do exist in SVN,
so it's not a major issue). Not including the code in the source
archive will likely throw users off :p . I downloaded it from here
(3.3.1):

http://sourceforge.net/project/showfiles.php?group_id=164872

And the examples directory contains the following subdirectories only:

array
attributes
extension
graph
hash
heap
sparse
sym

There is no file or directory with a name that contains or resembles
the word "benchmarks". There also are no files that appear to be C++
code.

You may wish to update your source archive with the most recent stable
contents of the repository, including the benchmark examples.

Jason
 
C

Chris M. Thomasson

Bill Cox said:
I challenge you to download DataDraw using:

I simply don't know how to argue about speed other than running
benchmarks. You can continue to deny all you like, but the memory
layout for C sucks. If you don't like to hear it, continue to pretend
that the benchmarks don't exist.

The memory layout for C sucks? You can basically lay data-structures out in
memory any way you want to in C. Here is a very simple example:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/7030f8ac1610a709


One can use alignment code hack to specifically pad and align shared
data-structures in memory such that it eliminates chances of false-sharing
between them. C is also flexible enough to allow one to create high
performance multi-threaded vector operations by taking advantage of specific
alignment, synchronization, cache-blocking and clever pre-fetching
techniques.

IMVHO, C is a flexible language indeed...
 
F

Flash Gordon

Bill Cox wrote:

together. Third, DataDraw uses indexes into arrays to access data,
allowing 32 bit unsigned integers to be used up to 4 billion objects,
while the raw C code used 64-bit pointers on my 64-bit Ubuntu machine,
which wastes a ton of memory and hurts cache performance.

I've not looked at the other points, but this one can probably be
addressed by using the appropriate switches on the compiler. I suggest
you investigate the -m32 switch and, if you need 64 bit integers, use
"long long" or the appropriate type from inttypes.h
 
B

Bill Cox

[...]
I challenge you to download DataDraw using:
[...]
I simply don't know how to argue about speed other than running
benchmarks.  You can continue to deny all you like, but the memory
layout for C sucks.  If you don't like to hear it, continue to pretend
that the benchmarks don't exist.

The memory layout for C sucks? You can basically lay data-structures out in
memory any way you want to in C. Here is a very simple example:

http://groups.google.com/group/comp.programming.threads/browse_frm/th...

One can use alignment code hack to specifically pad and align shared
data-structures in memory such that it eliminates chances of false-sharing
between them. C is also flexible enough to allow one to create high
performance multi-threaded vector operations by taking advantage of specific
alignment, synchronization, cache-blocking and clever pre-fetching
techniques.

IMVHO, C is a flexible language indeed...

I completely agree. DataDraw backed C would not be possible if it
were not for C's flexibility. You can control pretty much everything.

However, most of us user pointers to structures to represent objects.
We typically (but not always) use malloc/calloc to allocate memory for
them. In the good-old-days, that was fine, but now days L2 cache
performance dominates like never before. C gives us flexibility to
allocate arrays, or objects sequentially in memory, but we are
responsible for taking advantage of that flexibility. DataDraw simply
automates some of it.
 
B

Bill Cox

Bill Cox wrote:



I've not looked at the other points, but this one can probably be
addressed by using the appropriate switches on the compiler. I suggest
you investigate the -m32 switch and, if you need 64 bit integers, use
"long long" or the appropriate type from inttypes.h

That's true, but only if you don't need to support programs that use
more than 4 gig of memory. Truthfully, that's most applications, just
not the ones I work on. DataDraw allows any C int size to be used as
an object handle - from 8 bits to 64, and 32-bit handles allows up to
4 billion objects of any specific class. The memory savings is quite
significant, as is the resulting improvement in cache performance.
DataDraw has a mode designed to allow the C compiler to detect type
errors, where it uses pointers rather than ints for all object
references. This allowed us to determine how much memory we were
saving when compiling in 64-bit mode, and it was quite impressive: The
same commercial EDA program takes only 58% as much memory when
compiled using DataDraw ints for references rather than 64-bit
pointers.

It turns out I'm a rare breed - a speed freak. We're a dying breed,
and for good reasons. However, when it comes to running that inner
loop just a bit faster, I'm on home turf. You really need to think
about memory layout and how it affects cache performance. 42 will be
faster than plain-old C using pointers to structures for the vast
majority of memory-intensive applications. DataDraw backed C already
is.
 
D

David Thompson

On Fri, 2 Jan 2009 06:48:51 -0800 (PST), Bill Cox
The class of applications where 42 will beat C is most programs that
use a lot of memory. If you use DataDraw (http://datadraw.sf.net) to
program in C using a semi-OO style, your code typically speeds up
dramatically. It's because of better cache hit rates. If you have
typical objects in memory that are larger than a cache line, then on
an L2 cache miss, you typically load a whole line of crud you don't
need. DataDraw by default causes the cache line to be loaded with
like-properties from objects of the same type, with much higher chance
of being useful in most cases. Benchmarks have proved this out many
times.
This will be better for programs that uses only limited attributes of
a large number of objects, at least usually. (If it _never_ uses some
of the attributes, you could just omit them entirely and bring the
memory down that way.) It will be worse for programs that use most or
all attributes of each object but visit few(er) objects. (And here IME
it is often more difficult to predict which objects will be unused,
and omit them, although it is still worth trying more often than IME
people actually do.) Both occur in reality.

You assert elsethread this technique (parallel arrays) is difficult.
For anyone who learned in FORTRAN through at least the 70s, it was the
ONLY way, and quickly became second nature. And most/standard BASICs
also, although those were less often used for manipulating large data.
And APL! You mention 1983; although by that time PL/I-Pascal-C style
structures were more fashionable, and a common extension in FORTRAN
though not yet standard, I'm surprised you wouldn't have seen plenty
of existing code using the old (but perfectly serviceable) way.

Heck, it even provides a very rare arguably-defensible serious use of
C's famously silly commutative (aka 'reversible') subscripting:

struct { double b; char c[20]; } x [N], *p;
... p = &x ... p -> b ... p -> c ...

int a [N]; double b [N]; char c [N] [20]; int p;
... p = i ... /*'obj'*/p [ b ] ... /*'obj'*/p [ c ] ...

/* yes, better allocation usually needed;
this is just a toy example to show the similarity */

It is harder for nonhomogenous objects; your use of 'semi-OO' suggests
to me you have recognized that; and if I have time to spare, which
sadly isn't very likely, I'd be interested in looking at how. In some
simple but useful cases (up to maybe 4-6 significantly different
types) I have gotten away with compile-time partitioning of the
subscript space, but that gets yucky fast.

A compiler that does this automatically certainly can be useful; the
main point of tools like assemblers and compilers is for the machine
to handle the gruntwork. It makes the programmer's job easier, which
is good, but it doesn't create entirely new possibilities.
 
B

Bill

David said:
On Fri, 2 Jan 2009 06:48:51 -0800 (PST), Bill Cox
The class of applications where42will beat C is most programs that
use a lot of memory.  If you use DataDraw (http://datadraw.sf.net) to
program in C using a semi-OO style, your code typically speeds up
dramatically.  It's because of better cache hit rates.  If you have
typical objects in memory that are larger than a cache line, then on
an L2 cache miss, you typically load a whole line of crud you don't
need.  DataDraw by default causes the cache line to be loaded with
like-properties from objects of the same type, with much higher chance
of being useful in most cases.  Benchmarks have proved this out many
times.
[...]
You assert elsethread this technique (parallel arrays) is difficult.
For anyone who learned in FORTRAN through at least the 70s, it was the
ONLY way, and quickly became second nature. [...]
Heck, it even provides a very rare arguably-defensible serious use of
C's famously silly commutative (aka 'reversible') subscripting:
  struct { double b; char c[20]; } x [N], *p;
  ... p = &x ... p -> b ... p -> c ...

  int a [N]; double b [N]; char c [N] [20]; int p;
  ... p = i ... /*'obj'*/p [ b ] ... /*'obj'*/p [ c ] ...
  /* yes, better allocation usually needed;
    this is just a toy example to show the similarity */
It is harder for nonhomogenous objects; your use of 'semi-OO' suggests
to me you have recognized that; and if I have time to spare, which
sadly isn't very likely, I'd be interested in looking at how.

I am also interested. I came up with the same structure as you show
above, using arrays for each attribute. I was hoping that the scheme
didn't add any extra memory accesses. That is, the normal approach

    struct { int x, y } *objects;

    int x_times_y( int i )
    {
        return objects ->x * objects ->y;
    }

requires three memory accesses: one get the pointer to objects, one
for x, and one for y (3 total). Using separate arrays for each object

    int* objects_x;
    int* objects_y;

    int x_times_y( int i )
    {
        return objects_x * objects_y ;
    }

requires four memory accesses. I was thinking it might use something
like

    struct
    {
        int x [MAX_OBJ];
        int y [MAX_OBJ];
    } *objects;

    int x_times_y( int i )
    {
        return objects->x * objects->y ;
    }

which only requires three. But I guess the point is that like fields
of several objects will be accessed in a loop, so only one of the
array pointers is needed at any time, so

    struct { int x, y } *objects;

    int sum( int count )
    {
        int n = 0;
        int i;
        for ( i = 0; i < count; ++i )
            n += objects ->x;
        return n;
    }

will be inferior speed-wise in all ways to

    int* objects_x;
    int* objects_y;

    int sum( int count )
    {
        int n = 0;
        int i;
        for ( i = 0; i < count; ++i )
            n += objects_x ;
        return n;
    }


You've got the idea. It would be great if compilers did this sort of
transformation automatically. L42 will. However C exposes the layout
of structures to the user, and objects are normally allocated with
malloc, rather than arrays of structures. Cache performance with
malloc'ed structure based code, especially on 64-bit systems, can
totally suck.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top