The container abstraction and parallel programming

J

jacob navia

Today, more than 10 years tafter Intel introduced the first SIMD
instructions it is still not possible to use those instructions to do
SIMD programming when you use C and two mainstream compilers like gcc or
MSVC.

From the C side there is absolutely no support for this type of
programming, even though plain arrays could do the trick obviously.


The container abstraction, specifically the valarray container, could be
used to *mean* SIMD using operator overloading.

The use case could be

ValArray operator+(ValArray left,ValArray right);

This would allow an operation like adding two arrays element-wise
to be performed using SIMD instructions.


Obviously, the usual "optimizations" can be added to those
basic operations so that

ValArray a,b,c,d;

a = b+c+d;

doesn't have to produce an intermediate array but can be "compiled"
by an overloaded assignment operator to generate

a = b+c+d;

as is done in C++. Each overloaded operator returns a token operator
containing information about its arguments, and the overloaded
assignment operator compiles correct element-wise code by reparsing
the generated tree.

To prepare for this, the C Containers library proposes the ValArray
container, for each of the basic types (char, short, int, long long and
size_t).

In the sample implementation there is a generic module that
allows to generate code for ANY type, also those being left out like
long (since it is always either equal to int or equal to long long in
all implementations I know of). The type size_t was added to be able
to correctly index any sequantial container with a ValArray of size_t's.

The code proposed NOW is plain serial code, but can (and should) be
replaced in more sophisticated implementations than the sample
implementation to do SIMD programming, one of the most basic forms
of parallelism, already available in plain hardware since already 10
years...

Obviously the four basic operations should be there, but other kinds of
array operations could be interesting like

o REDUCE Apply binary operation to each element and the running total:
REDUCE + 1 2 3 4 5 --> 1+2+3+4+5 = 15

You apply operator + to the identity element for "+" and to the
first element: 0+1 --> 1
You apply the operator + to the second element and the result
of the last operation: 1+2-->3
ETC until the last element is reached.

o EXPAND Apply binary operation producing a ValArray of same length
containing the intermediate results:
EXPAND + 1 2 3 4 5 --> 1 3 6 10 15


o SELECT Apply boolean vector to ValArray selecting all elements that
correspond to 1 and eliminate those with zeros.

SELECT 1 1 0 1 0 0 / 10 20 30 40 50 60 --> 10 20 40

o EXTEND Apply boolean vector to ValArray leaving all elements with 1
and introducing the zero or blank (for character arrays) at the
zero positions:

EXTEND 1 1 0 1 0 0 / 10 20 40 --> 10 20 0 40 0 0

The idea is to select a set of operations that makes it easy to work
with arrays as objects of operations instead of working with scalar data.

What new operations should be in the set?
How could those operations be incoporated into mainstream C?


jacob
 
B

Ben Pfaff

jacob navia said:
Today, more than 10 years tafter Intel introduced the first SIMD
instructions it is still not possible to use those instructions to do
SIMD programming when you use C and two mainstream compilers like gcc
or MSVC.

From the C side there is absolutely no support for this type of
programming, even though plain arrays could do the trick obviously.

"absolutely no support"? You exaggerate:
http://gcc.gnu.org/onlinedocs/gcc-4...din-Functions.html#X86-Built_002din-Functions
 
J

jacob navia

Le 07/01/12 00:08, Ben Pfaff a écrit :

Hi Ben

Look, those functions exist of curse but that is not standard C, those
functions have no links to the rest of the language.

Besides, they are tied to the Intel architecture (what
is not wrong of course) but the idea behind my post is to try to
abstract a bit from a single architecture to a more *general* SIMD
programming. Note that the notions of container and operations
are NOT tied to Intel, IBM or Motorola vector proposals but could
be used with all three.

Even me with my limited ressources have some intrinsics for some
operations but that is really not *language* support Ben.
 
B

Ben Pfaff

jacob navia said:
Le 07/01/12 00:08, Ben Pfaff a écrit :

Look, those functions exist of curse but that is not standard C, those
functions have no links to the rest of the language.

If that's your purpose then I don't know why you started out by
saying that you can't do it with "mainstream compilers like GCC".
 
J

jacob navia

Le 07/01/12 00:30, Ben Pfaff a écrit :
If that's your purpose then I don't know why you started out by
saying that you can't do it with "mainstream compilers like GCC".

.... and MSVC, and ANY compiler now. It wasn't directed against gcc
specifically.


Because all those compilers propose
the same thing as gcc/Intel compilers do: a series of machine dependent
intrinsics that make your code not portable at all, and maybe because of
that are not accepted by a large user base.

What I wanted to propose is a more abstract and portable way of doing
SIMD than just using pseudo functions that map to one instructions of
the Intel instruction set.

Obviously the problem is there since all that those intrinsics propose
is very low level operations (what amounts to more or less returning to
àssembly language in C since all the intrinsics map to specific
Intel/AMD instructions).

What my aim is is to start a discussion about what kind of constructs
could be used to abstract from those operations into more general concepts.
 
K

Kaz Kylheku

Today, more than 10 years tafter Intel introduced the first SIMD
instructions it is still not possible to use those instructions to do
SIMD programming when you use C and two mainstream compilers like gcc or
MSVC.

If anything, that may be an indicator that maybe nobody cares that much about
Intel's SIMD instructions. It's not hard to see why.

SIMD instructions do not translate to a tangible benefit for the average
consumer of Intel chips. (Marketing 101: features need to connect with
benefits!)

There was a time when games and video playback stretched the ability of the CPU
to the core, and hardware for accelerating them was next to nonexistent. The
processing-intensive tasks in those kinds of applications are handled by
dedicated hardware these days.

SIMD will not fix the major, user-visible performance issues in a Wintel PC,
like long boot times, or browsers becoming unresponsive for seconds at a time.
 
I

Ian Collins

Today, more than 10 years tafter Intel introduced the first SIMD
instructions it is still not possible to use those instructions to do
SIMD programming when you use C and two mainstream compilers like gcc or
MSVC.

From the C side there is absolutely no support for this type of
programming, even though plain arrays could do the trick obviously.

That may be the case for those compilers, but not for others. For
example the Sun Studio compilers (I'd expect Intel to offer something as
well) have a number of vector processing optimisations which include simd.

I think this class of optimisation is best left as a compiler
implementation detail. The range of vector processing support
(including the likes of OpenMP) is simply too large (from nothing to a
GPU) to be considered for standardisation. If the demand is there, the
vendor will deliver.
The container abstraction, specifically the valarray container, could be
used to *mean* SIMD using operator overloading.

A similar approach has been used by some compilers for many years,
typically implemented as some form of "performance library" for vector
type operations.
 
J

jacob navia

Le 07/01/12 01:04, Kaz Kylheku a écrit :
If anything, that may be an indicator that maybe nobody cares that much about
Intel's SIMD instructions. It's not hard to see why.

SIMD instructions do not translate to a tangible benefit for the average
consumer of Intel chips. (Marketing 101: features need to connect with
benefits!)

But... how could those benefits be *USED* unless you program in assembly?

That is the whole point of my post. You can't say that users do not
want those features if there are no high level constructs that they
can use to make that happen!

There was a time when games and video playback stretched the ability of the CPU
to the core, and hardware for accelerating them was next to nonexistent. The
processing-intensive tasks in those kinds of applications are handled by
dedicated hardware these days.

But the problem *remains* since you can't address that dedicated
hardware from a high level language anyway!

SIMD will not fix the major, user-visible performance issues in a Wintel PC,
like long boot times, or browsers becoming unresponsive for seconds at a time.

SIMD needs a special kind of programming that is badly supported in
current serial languages like C or C++. Many parallel languages have
been proposed but they are of too limited scope to catch. What is needed
is a way to address SIMD programming from within a known general purpose
programming *context*.
 
J

jacob navia

Le 07/01/12 01:50, Ian Collins a écrit :
That may be the case for those compilers, but not for others. For
example the Sun Studio compilers (I'd expect Intel to offer something as
well) have a number of vector processing optimisations which include simd.

Yes, Intel and Sun, as hardware vendors, have a real interest in making
people use the features of their hardware and will try to propose
either automatically SIMD deduction or inline assembly that supports
their features directly.

Problem is, even after dozens of years and millions and millions of
dollars in research, automatic SIMD parallelization remains VERY
difficult for the genral case. The problem, basically, is as follows:

Automatic parallelization must PROVE that no dependencies exist
between the array members, no unforeseen side effects can possibly
occur and ONLY THEN the constructs in question can be parallelized.

All this needs extensive program analysis, what is made even more
difficult with the model of separate module compilation...

High level constructs designed to allow the programmer assert that
he/she wants parallel programming FOR THIS OBJECT simplify the
task of the compiler, allowing for simpler compilers and more
widespread use.
I think this class of optimisation is best left as a compiler
implementation detail. The range of vector processing support (including
the likes of OpenMP) is simply too large (from nothing to a GPU) to be
considered for standardisation. If the demand is there, the vendor will
deliver.

Well, SIMD processing is basically the same, only the SCALE changes from
an Intel SSE4 and a GPU with some thousand nodes. Using a higher level
construct you LEAVE the details to the compiler obviously, assuming a
compiler for Intel will make use of SSE4 and a compiler for a GPU will
use the GPU. What you do NOT leave to the compiler is the ANALYSIS as to
which constructs can be used with SIMD, allowing the compiler to
specialize in code generation without burdening it with a complete
analysis of all the possible data paths that could interfere with the
SIMD optimization.

A similar approach has been used by some compilers for many years,
typically implemented as some form of "performance library" for vector
type operations.

Yes, I just wanted to propose that idea not for a single compiler but
for the language in general.
 
J

jacob navia

Le 07/01/12 09:39, Jens Gustedt a écrit :
Am 01/07/2012 12:02 AM, schrieb jacob navia:

macros and _Generic

Jens

Hi Jens

"macros and _Generic"... can you maybe elaborate a bit?
That was far to cryptic for my small brain.

:)
 
I

Ian Collins

Le 07/01/12 01:50, Ian Collins a écrit :

Yes, Intel and Sun, as hardware vendors, have a real interest in making
people use the features of their hardware and will try to propose
either automatically SIMD deduction or inline assembly that supports
their features directly.

Problem is, even after dozens of years and millions and millions of
dollars in research, automatic SIMD parallelization remains VERY
difficult for the genral case. The problem, basically, is as follows:

Automatic parallelization must PROVE that no dependencies exist
between the array members, no unforeseen side effects can possibly
occur and ONLY THEN the constructs in question can be parallelized.

All this needs extensive program analysis, what is made even more
difficult with the model of separate module compilation...
High level constructs designed to allow the programmer assert that
he/she wants parallel programming FOR THIS OBJECT simplify the
task of the compiler, allowing for simpler compilers and more
widespread use.

I guess pragmas are another option, as used for OpenMP. We already have
the restrict keyword to simplify analysis with pointer parameters, maybe
something similar for parallelisation?
Yes, I just wanted to propose that idea not for a single compiler but
for the language in general.

I do see your point, but there are many CPU/OS specific tricks used to
get the optimum performance on a particular platform, which is way
vendor's compilers generally squeeze the best performance from a system.
Would this on its own do enough to justify its inclusion in the language?
 
J

Jens Gustedt

Salut Jacob,

Am 01/07/2012 09:54 AM, schrieb jacob navia:
Le 07/01/12 09:39, Jens Gustedt a écrit :
"macros and _Generic"... can you maybe elaborate a bit?
That was far to cryptic for my small brain.

probably a bit too terse, yes :)

what I meant is that you need a type generic interface for these things,
not necessarily a language extension

the idea would be to write a whole bunch of type specific functions that
cover the different cases (different primitive types, different
operations). Then you can put a type generic glue arround that by using
macros and the brand new _Generic from C11. This only would give you
function like syntax but would make it easily usable.

Then, after some time there could perhaps be some convergence on the
functionalities and one could think of adding keywords and/or operator
support.

I don't know how far you are for your compiler for implementing _Generic
(if you attacked it at all). Gcc already has equivalent features that
are usable and that I am currently using to implement the set of type
generic atomic_... interfaces of C11. I think the interface complexity
of that is about the same as what would be necessary for vector
operations. (BTW thinking about it, _Pragma can also be quite handy when
programming such things.)

Cheers
Jens
 
S

Stephen Sprunk

Le 07/01/12 01:50, Ian Collins a écrit :

Yes, Intel and Sun, as hardware vendors, have a real interest in making
people use the features of their hardware and will try to propose
either automatically SIMD deduction or inline assembly that supports
their features directly.

Problem is, even after dozens of years and millions and millions of
dollars in research, automatic SIMD parallelization remains VERY
difficult for the genral case. The problem, basically, is as follows:

Automatic parallelization must PROVE that no dependencies exist
between the array members, no unforeseen side effects can possibly
occur and ONLY THEN the constructs in question can be parallelized.

All this needs extensive program analysis, what is made even more
difficult with the model of separate module compilation...

There are no "dependencies" between array members. There is a possible
aliasing problem when dealing with pointers rather than arrays, but that
is easily enough solved with "restrict" qualifiers.

I don't understand what you think the problem is.

S
 
G

Gene

There are no "dependencies" between array members.  There is a possible
aliasing problem when dealing with pointers rather than arrays, but that
is easily enough solved with "restrict" qualifiers.

I don't understand what you think the problem is.

S

He has to be referring to loop dependencies as in

for (i = 0; i < n; i++) a = a[i-3] + 2;

The main reason I'm writing is to point out that late version C
compilers including gcc, Intel, and I believe though am not sure,
Microsoft all will emit vector instructions for loops coded with
commonsense idioms. I have a signal processing code that speeds up by
a factor of 3 with gcc when this opimization is turned on.

Cheers.
 
J

jacob navia

Le 08/01/12 09:08, Gareth Owen a écrit :
Nonsense. As well as the builtins already mentioned, GCC has
-ftree-vectorize which uses these SIMD instructions automatically when
the compiler can deduce that the aliasing rules allow it. Dot product
and matrix multiplication get a sizeable speed up when you use it.
The 'restrict' keyword helps a lot, also

Nonsense, gcc doesn't have a real vectorizer. For instance:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result[n-i]=L[n-i]+R[n-i];
}

This is exactly the same program as

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

But NO option (including -ftree-vectorize, as you suggest) will
convince gcc to use the parallel instructions in the first case.

Only Intel has a more stable version of a vectorizer, since
only Intel can afford to pay dozens of computer scientists
dozens of years full time to build it.

Gcc has invested a considerable effort and the results are
not very good, as you can see.

But of course, why make simple when you can do complicated?

Why is not possible that the PROGRAMMER declares a special
data type that should be used in SIMD programming?

That was the main argument of my post and nobody has given a sensible
answer to it. Most of the answer are like yours:

"Nonsense. gcc can do it automatically"
 
J

jacob navia

Le 07/01/12 14:39, Stephen Sprunk a écrit :
There are no "dependencies" between array members. There is a possible
aliasing problem when dealing with pointers rather than arrays, but that
is easily enough solved with "restrict" qualifiers.

I don't understand what you think the problem is.

S

The problem, Stephen, is as follows:

gcc will vectorize this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

but will NOT vectorize this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result[n-i]=L[n-i]+R[n-i];
}

Why?

As you (correctly) say:
There are no "dependencies" between array members.

All this huge machinery for automatic vectorizing has never worked
correctly for dozens of years since requires an incredible
amount of program analysis. My proposition is:

Why make complicated something that you can do in a simple way?

Let the programmer declare the data that should be handled in an SIMD
way.

If I write:

c = (a+b)/2;
if (c&1) fn(1);

You could deduce that c must be of integer type isn't it?
LET THE COMPILER DO THAT. Let's eliminate declarations for
integer and let the compiler deduce from the program text
the type of each variable.

Looks ridiculous isn't it?

Well, for SIMD the problem is the same.
 
J

jacob navia

Le 08/01/12 11:41, jacob navia a écrit :
void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result[n-i]=L[n-i]+R[n-i];
}

that should have been:

for(size_t i=0; i<n;i++) result[n-i-1]=L[n-i-1]+R[n-i-1];

The results are the same for gcc with following options:

gcc -O3 -c -S -std=c99 -ftree-vectorize maybe.c

inner loop:
movsd -8(%rdi), %xmm0
addsd -8(%rsi), %xmm0
movsd %xmm0, -8(%rax)

movsd = MOVe Scalar Double precision
 
J

jacob navia

Le 08/01/12 11:41, jacob navia a écrit :If you change the starting index to something Other than zero
the vectorization fails
void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=1; i<n;i++) result=L+R;
}
 
E

ec429

Nonsense, gcc doesn't have a real vectorizer. For instance:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result[n-i]=L[n-i]+R[n-i];
}

This is exactly the same program as

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Actually, those programs aren't "exactly the same", due to the lack of
the aforementioned "restrict" keyword. For instance:
double array[3]={0,1,2};
doADD(array+1, array+1, array, 2);
In the first case you will get array={8,4,2}; in the second, array={2,4,2}.
In general, the correct approach to parallel programming is giving the
compiler as few constraints as possible. This can sometimes be achieved
with 'restrict', but not much can be done in procedural languages unless
they have some explicit means of telling the compiler that normal order
constraints can be relaxed (which would be a /much/ more useful
extension to C than processor-specific vectorisation primitives, since
the latter would reduce the portable glory of C to the provincial
insularity of assembler (not that assembler isn't valuable in
appropriate circumstances, of course!)). In C as it stands, the
canonical way to do this is by means of threads (at least in C11; '99
and its predecessors don't mention threaded execution, but many
implementations support it). SIMD has its uses, of course; but unless
your code is perversely written (and so long as you have used 'restrict'
where appropriate) the compiler knows better than you do. Besides, your
time as a programmer is worth enough that you generally shouldn't be
expending it on non-portable performance hacks unless you /know/ your
code will be run a very large number of times. YAGNI.
If you really want to write strongly parallelisable programs (rather
than merely hand-optimising with a few vector ops that probably don't
gain you much anyway) try a functional language, such as LISP - with a
properly designed data-flow model parallel processing becomes something
that the implementation can do transparently.

I think you just don't like gcc, because you know (and don't want to
admit to yourself) that it's better than yours.
-e
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top