Performance of hand-optimised assembly

Discussion in 'C Programming' started by Ben Bacarisse, Dec 23, 2011.

  1. I'm starting a new thread, but it is prompted by this remark from 88888
    Dihedral <> about a short binary search
    function I posted:

    | Do you consider to convert 5 to 10 lines of C to assembly?
    | I did that 20 years ago on X86 cpu.
    | It was a piece of cake to gain the 20-40% speed required if paid
    | well.

    I've felt (it's only a feeling) that hand coding for speed (rather than
    to access features not available in C like extra wide multiplies and so
    on) was a thing of the past. But maybe I'm wrong. Is it still possible
    to get a significant speed-up over C by hand coding? How much depends
    on the processor? How good are the modern optimising compilers that
    would have to be beaten?

    Since I don't know enough about modern CPUs to stand much chance of
    improving on gcc, I thought I'd post this for others to try. Maybe
    you'll find you have some time on your hands over the next few days!

    Here is a test program:

    #include <stdio.h>
    #include <stdlib.h>

    size_t binsearch(int a[], size_t length, int key)
    {
    size_t low = 0, high = length;
    while (high > low) {
    size_t middle = low + (high - low)/2;
    if (key == a[middle])
    return middle;
    else if (key > a[middle])
    low = middle + 1;
    else high = middle;
    }
    return -1;
    }

    #ifdef TIME_LOOP
    #define binsearch(a, length, key) 1
    #endif

    int main(int argc, char **argv)
    {
    if (argc == 2) {
    int n = atoi(argv[1]);
    /* cast so C++ compiled code can timed */
    int *data = (int *)malloc(n * sizeof *data);
    if (data) {
    int i, length, step, key;
    long int sum;
    for (i = 0; i < n; i++)
    data = 2 * i + 1;

    /* Include one zero-length search for correctness testing */
    sum = binsearch(data, 0, data[0]);
    for (length = 1; length <= n; length++)
    for (step = 1; step <= length; step++)
    for (key = 0; key <= 2 * length; key += step)
    sum += binsearch(data, length, key);
    printf("%ld\n", sum);
    return EXIT_SUCCESS;
    }
    }
    return EXIT_FAILURE;
    }

    The pattern of calls it makes is not intended to be representative. If
    cache effects turn out to dominate, so be it -- this is a just for fun
    after all.

    On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
    times. clang's code is a little better at -O0, but the optimised code
    is a little worse:
    -O0 -O1 -O2 -O3
    gcc 0.1128 0.0633 0.0571 0.0562
    g++ 0.1167 0.0633 0.0572 0.0564
    clang 0.1033 0.0669 0.0672 0.0672

    These times are microseconds per call. The loop overhead has been
    subtracted[2]. On a low-power chip[3] I get:

    -O0 -O1 -O2 -O3
    gcc 0.7872 0.4296 0.4166 0.3688
    g++ 0.7842 0.4290 0.4160 0.3688

    which says nothing new (except how slow the CPU is).

    What about other compilers and processors? Can anyone produce code that
    is significantly better than gcc -O3 by hand? If you try, how long did
    it take to get it both correct and faster? Does your version handle the
    same range of values as the C version?

    [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
    [2] Crude timing script available on request.
    [3] Geode(TM) Integrated Processor by AMD PCS, 500MHz, 128KB cache.
    --
    Ben.
     
    Ben Bacarisse, Dec 23, 2011
    #1
    1. Advertising

  2. Ben Bacarisse

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...

    > int main(int argc, char **argv)
    > {
    > if (argc == 2) {


    What value of n (from argv) did you use for your test, and what was the
    output?

    --
    Bartc
     
    BartC, Dec 23, 2011
    #2
    1. Advertising

  3. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >
    >> int main(int argc, char **argv)
    >> {
    >> if (argc == 2) {

    >
    > What value of n (from argv) did you use for your test, and what was
    > the output?


    1500 on the laptop and 700 on the slower machine. The output was:

    $ gcc -O2 btime.c
    $ ./a.out 1500
    2297638170

    When compiled with -DTIME_LOOP, the output is the number of calls:

    $ gcc -O2 -DTIME_LOOP btime.c
    $ ./a.out 1500
    17258078

    which helps with calculating the time per call.

    --
    Ben.
     
    Ben Bacarisse, Dec 23, 2011
    #3
  4. Ben Bacarisse

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:
    >
    >> "Ben Bacarisse" <> wrote in message
    >> news:...
    >>
    >>> int main(int argc, char **argv)
    >>> {
    >>> if (argc == 2) {

    >>
    >> What value of n (from argv) did you use for your test, and what was
    >> the output?

    >
    > 1500 on the laptop and 700 on the slower machine. The output was:
    >
    > $ gcc -O2 btime.c
    > $ ./a.out 1500
    > 2297638170


    That's odd; the output I get across 3 compilers (under Windows)
    was -1997329126 (and using 'long long int sum' results in a totally
    different result again).

    Anyway, the results I got were as follows, with fastest/slowest times in
    seconds, all for x86-32 under Windows (and all fairly old):

    lcc-win32 1.4/1.8 (3.8 June 2010)
    DMC 1.2/1.7 (Digital Mars 8.42n)
    gcc 0.8/2.1 (Mingw 4.5.0, no special processor options used)

    My first assembly version of the body of binsearch() took about 1.0 seconds;
    pretty much anything else I tried, even taking out several instructions,
    made it slower!

    But then I'm not an expert on the internal workings of modern x86
    processors. Someone who is, I'm sure could at least have matched gcc.

    Your example though is probably perfect material for a compiler optimiser to
    work with: half-a-dozen variables which fit neatly into the half-dozen spare
    registers on the x86! Not much scope left to improve on.

    I do remember a few years ago implementing an interpreter in C, which had
    been previously been written in a mix of high-level code plus lots of inline
    assembly; the C code, even compiled under gcc, was half the speed (and my
    assembly code then was probably just as poor as it is now). However
    implementing languages is probably a special case..

    --
    Bartc
     
    BartC, Dec 23, 2011
    #4
  5. Ben Bacarisse

    Joe keane Guest

    In article <>,
    Ben Bacarisse <> wrote:
    >I've felt (it's only a feeling) that hand coding for speed (rather than
    >to access features not available in C like extra wide multiplies and so
    >on) was a thing of the past.


    It's not.

    If you can get your 0.01% of your code (and You're suRE it is really
    that) that it gets 40% faster then it may be a useful thing to do.

    [although when <chip manufacturer> comes out with a new chip you may
    have actually made it *worse*]
     
    Joe keane, Dec 24, 2011
    #5
  6. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> "BartC" <> writes:
    >>
    >>> "Ben Bacarisse" <> wrote in message
    >>> news:...
    >>>
    >>>> int main(int argc, char **argv)
    >>>> {
    >>>> if (argc == 2) {
    >>>
    >>> What value of n (from argv) did you use for your test, and what was
    >>> the output?

    >>
    >> 1500 on the laptop and 700 on the slower machine. The output was:
    >>
    >> $ gcc -O2 btime.c
    >> $ ./a.out 1500
    >> 2297638170

    >
    > That's odd; the output I get across 3 compilers (under Windows)
    > was -1997329126 (and using 'long long int sum' results in a totally
    > different result again).


    Yes, sadly the result depends on the value of SIZE_MAX and the behaviour
    of adding that to a long. It won't affect timing but it does mean the
    output is not portable. Making binsearch return int rather than size_t
    will, I think, fix that but it may alter the timing. (It has only a tiny
    effect on my system but I'd leave it alone none the less).

    > Anyway, the results I got were as follows, with fastest/slowest times in
    > seconds, all for x86-32 under Windows (and all fairly old):
    >
    > lcc-win32 1.4/1.8 (3.8 June 2010)
    > DMC 1.2/1.7 (Digital Mars 8.42n)
    > gcc 0.8/2.1 (Mingw 4.5.0, no special processor options used)


    Is this variation due to optimised and unoptimised code?

    > My first assembly version of the body of binsearch() took about 1.0 seconds;
    > pretty much anything else I tried, even taking out several instructions,
    > made it slower!


    Well, it's not bad that you can beat 2 of the three compilers.

    > But then I'm not an expert on the internal workings of modern x86
    > processors. Someone who is, I'm sure could at least have matched gcc.
    >
    > Your example though is probably perfect material for a compiler optimiser to
    > work with: half-a-dozen variables which fit neatly into the half-dozen spare
    > registers on the x86! Not much scope left to improve on.


    Yup. That was what rather surprised me about the suggestion of
    converting to assembler for speed. It looked an unlikely candidate.

    > I do remember a few years ago implementing an interpreter in C, which had
    > been previously been written in a mix of high-level code plus lots of inline
    > assembly; the C code, even compiled under gcc, was half the speed (and my
    > assembly code then was probably just as poor as it is now). However
    > implementing languages is probably a special case..


    In some sense everything is a special case. Language implementation is a
    perfectly reasonably role for C, so there's no reason to discard such
    data. The trouble is that without code to review, it's impossible to
    know where the speed was lost.

    --
    Ben.
     
    Ben Bacarisse, Dec 24, 2011
    #6
  7. 88888 Dihedral, Dec 24, 2011
    #7
  8. Ben Bacarisse

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    >> Anyway, the results I got were as follows, with fastest/slowest times in
    >> seconds, all for x86-32 under Windows (and all fairly old):
    >>
    >> lcc-win32 1.4/1.8 (3.8 June 2010)
    >> DMC 1.2/1.7 (Digital Mars 8.42n)
    >> gcc 0.8/2.1 (Mingw 4.5.0, no special processor options used)

    >
    > Is this variation due to optimised and unoptimised code?


    Yes. Either using -O/-o for the first two, or not using it; and using -O3
    or -O0 for gcc.

    >> My first assembly version of the body of binsearch() took about 1.0
    >> seconds;
    >> pretty much anything else I tried, even taking out several instructions,
    >> made it slower!

    >
    > Well, it's not bad that you can beat 2 of the three compilers.


    lcc-win32 is not known for it's fast code, so that wasn't so surprising.
    Looking at the fast gcc result in more detail, it eliminates the first call
    to binsearch() completely (but doesn't save so much because it doesn't enter
    the loop), and inlines the second call.

    If I did the same in assembly, and partially unrolled the loop (fiddly in
    this case), I got to within 50ms of the gcc timing. (My assembly code was
    inside a version in my own language and my own non-optimising compiler, so
    probably the rest of the code makes the difference. FWIW the timing in that
    high-level code would have been about 2.1 seconds).

    So for this example, I don't think gcc was demonstrating any special
    knowledge of my processor; maybe there are processor-specific switches that
    could have been used. But it would be interesting to know the results for an
    compiler like Intel's (or rather, AMD's, if they do a compiler, since that's
    what my machine uses).

    >> Your example though is probably perfect material for a compiler optimiser
    >> to
    >> work with: half-a-dozen variables which fit neatly into the half-dozen
    >> spare
    >> registers on the x86! Not much scope left to improve on.

    >
    > Yup. That was what rather surprised me about the suggestion of
    > converting to assembler for speed. It looked an unlikely candidate.


    It all depends. If this wasn't just an array of standard machine words, or
    there was some other unusual context, then there are more coding
    possibilities and perhaps more opportunities to improve on the compiler.

    > However
    >> implementing languages is probably a special case..

    >
    > In some sense everything is a special case. Language implementation is a
    > perfectly reasonably role for C, so there's no reason to discard such
    > data. The trouble is that without code to review, it's impossible to
    > know where the speed was lost.


    In this example, I was making global decisions not easily available to a C
    compiler, such as choosing to use some dedicated registers across a specific
    set of 'conforming' functions, which comprised the main dispatch loop of
    this interpreter (namely, program counter, frame pointer, and stack
    pointer). I needed to save and restore those registers when stepping out of,
    and back into, those conforming functions.

    I was also using threaded code for these functions (jumping from the end of
    one function, to the start of the next, with no call/return mechanism). Now
    gcc has a mechanism for threaded code via indirect labels, but seems
    restricted to labels within the one function, but I didn't really want a
    single 10,000-line function.

    And in a more recent version, I'm using one dedicated register, but making
    use of the *hardware* stack and frame pointers, by using the system stack
    for the interpreter's data stack. Now I'm sure you can't just tell C to push
    or pop some arbitrary data to the stack, then carry on as normal! (And since
    this is 50% faster, the straight-C version may well have been 1/3 the
    speed.)

    So I think language applications *are* somewhat more special.

    --
    Bartc
     
    BartC, Dec 24, 2011
    #8
  9. Ben Bacarisse

    Willem Guest

    Ben Bacarisse wrote:
    ) I'm starting a new thread, but it is prompted by this remark from 88888
    ) Dihedral <> about a short binary search
    ) function I posted:
    )
    ) <snip>
    ) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
    ) times. clang's code is a little better at -O0, but the optimised code
    ) is a little worse:
    ) -O0 -O1 -O2 -O3
    ) gcc 0.1128 0.0633 0.0571 0.0562
    ) g++ 0.1167 0.0633 0.0572 0.0564
    ) clang 0.1033 0.0669 0.0672 0.0672
    )
    ) These times are microseconds per call. The loop overhead has been
    ) subtracted[2].

    Have you tried enabling the processor-specific switches on gcc?

    ) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
    ) [2] Crude timing script available on request.

    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Dec 24, 2011
    #9
  10. On Saturday, December 24, 2011 8:48:05 PM UTC+8, Willem wrote:
    > Ben Bacarisse wrote:
    > ) I'm starting a new thread, but it is prompted by this remark from 88888
    > ) Dihedral <> about a short binary search
    > ) function I posted:
    > )
    > ) <snip>
    > ) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
    > ) times. clang's code is a little better at -O0, but the optimised code
    > ) is a little worse:
    > ) -O0 -O1 -O2 -O3
    > ) gcc 0.1128 0.0633 0.0571 0.0562
    > ) g++ 0.1167 0.0633 0.0572 0.0564
    > ) clang 0.1033 0.0669 0.0672 0.0672
    > )
    > ) These times are microseconds per call. The loop overhead has been
    > ) subtracted[2].
    >
    > Have you tried enabling the processor-specific switches on gcc?
    >
    > ) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
    > ) [2] Crude timing script available on request.
    >
    > SaSW, Willem
    > --
    > Disclaimer: I am in no way responsible for any of the statements
    > made in the above text. For all I know I might be
    > drugged or something..
    > No I'm not paranoid. You all think I'm paranoid, don't you !
    > #EOT


    I think pass by register is faster.
    But if just insert 10 to 20 lines of assembly can make the code
    to be faster. I'll go for assembly to eliminate the function call
    overheads.

    By the way swapping two registers with or without a tmp register
    in 3 instructions is really hardware dependent.
     
    88888 Dihedral, Dec 24, 2011
    #10
  11. Ben Bacarisse

    BartC Guest

    "88888 Dihedral" <> wrote in message
    news:11437228.11.1324701159954.JavaMail.geo-discussion-forums@prix23...
    > http://ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
    >
    > Check the site to combine C and assembly for GCC.


    Having to use that ghastly AT&T syntax is an argument *against* using
    assembly, even if it is faster...

    --
    Bartc
     
    BartC, Dec 24, 2011
    #11
  12. Ben Bacarisse

    James Harris Guest

    On Dec 24, 1:09 pm, "BartC" <> wrote:

    ....

    > Having to use that ghastly AT&T syntax is an argument *against* using
    > assembly, even if it is faster...


    Agreed!

    Can you give me any pointers on how to link gcc and nasm or another
    separate assembler? I've found a few web references with useful tips
    but IIRC one of them was your first attempt at this a few years ago.
    Maybe you found some more info since then. I know I can do it but
    maybe there are good ways and bad ways.

    James
     
    James Harris, Dec 24, 2011
    #12
  13. Ben Bacarisse

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    > Well, it's not bad that you can beat 2 of the three compilers.


    I've tried one more compiler, and the results, now normalised and for the
    fastest code (and using N=2500 for more accuracy), are:

    gcc -O3 1.00 (using inlining)
    My asm 1.12 (original without inlining or unrolling)
    Pelles C -Ot 1.26
    DMC -o 1.35
    lcc-win32 -O 1.64

    (Adjusted for loop overheads)

    So while modern compilers might be ultra-sophisticated, most of these
    aren't! Admittedly they are all free downloads..

    --
    Bartc
     
    BartC, Dec 24, 2011
    #13
  14. Ben Bacarisse

    James Harris Guest

    On Dec 23, 6:43 pm, Ben Bacarisse <> wrote:

    ....

    I don't understand some of your code. Queries below.

    > Here is a test program:
    >
    > #include <stdio.h>
    > #include <stdlib.h>
    >
    > size_t binsearch(int a[], size_t length, int key)


    Why use size_t instead of unsigned int or unsigned long int,
    especially as you add it to sum in the test harness which is long int?
    Why not use unsigned for sum in the test harness?

    > {
    >      size_t low = 0, high = length;


    Why not high = length - 1?

    >      while (high > low) {
    >           size_t middle = low + (high - low)/2;


    Why not middle = (high + low) / 2?

    >           if (key == a[middle])
    >                return middle;
    >           else if (key > a[middle])
    >                low = middle + 1;
    >           else high = middle;


    Why not high = middle - 1?

    >      }
    >      return -1;
    >
    > }


    The reason for the queries is I wrote an asm binsearch but when
    comparing it with your code realised that you had some differences in
    the approach.

    James
     
    James Harris, Dec 24, 2011
    #14
  15. Ben Bacarisse

    BartC Guest

    "James Harris" <> wrote in message
    news:...
    > On Dec 23, 6:43 pm, Ben Bacarisse <> wrote:


    >> while (high > low) {
    >> size_t middle = low + (high - low)/2;

    >
    > Why not middle = (high + low) / 2?


    That form can cause an overflow when the numbers are high. Otherwise
    (high+low)/2 is marginally faster.

    --
    Bartc
     
    BartC, Dec 24, 2011
    #15
  16. Ben Bacarisse

    BartC Guest


    > Can you give me any pointers on how to link gcc and nasm or another
    > separate assembler? I've found a few web references with useful tips
    > but IIRC one of them was your first attempt at this a few years ago.
    > Maybe you found some more info since then. I know I can do it but
    > maybe there are good ways and bad ways.


    I haven't made any more attempts since then. You still have the problem in
    that the assembly code with the nice syntax has to be in a separate module,
    not inline.

    To combine, you probably know how; it means running Nasm with the right
    object format (nasm -fwin32 under Windows), and presenting the resulting
    ..obj file to gcc in it's list of list of files.

    Oh, and you have to put "_" in front any exported names in the asm module,
    for some reason that I still don't know.

    And you have to sort out calling conventions between the two languages (C
    calling Asm, and Asm calling C).

    So it's still a bit fiddly.

    (I normally use assembly as inline code within my own language. The
    advantages are the use of Intel/Nasm syntax, with no need to write each line
    as a string literal either, and being able to write substantial amounts of
    code within a high-level language framework: functions, declarations, local
    namespaces in each function, etc.

    Also interaction with the high-level code is simple (optimisation doesn't
    get in the way, for a start!). A lot of other issues are taken care of too.

    I would call that a _good_ and easy way of using assembly. If anyone knows
    of a C compiler that offers the same, then I might start using C a lot
    more.)

    --
    bartc
     
    BartC, Dec 24, 2011
    #16
  17. Ben Bacarisse

    Jens Gustedt Guest

    Hello,

    Am 12/24/2011 01:42 PM, schrieb BartC:
    > Yes. Either using -O/-o for the first two, or not using it; and using -O3
    > or -O0 for gcc.
    > ...
    > So for this example, I don't think gcc was demonstrating any special
    > knowledge of my processor; maybe there are processor-specific switches that
    > could have been used.


    with modern gcc you can always use -march=native to optimize for the
    processor on which the compiler is running

    Jens
     
    Jens Gustedt, Dec 24, 2011
    #17
  18. Ben Bacarisse

    James Harris Guest

    On Dec 24, 3:30 pm, "BartC" <> wrote:

    ....

    > To combine, you probably know how; it means running Nasm with the right
    > object format (nasm -fwin32 under Windows), and presenting the resulting
    > .obj file to gcc in it's list of list of files.
    >
    > Oh, and you have to put "_" in front any exported names in the asm module,
    > for some reason that I still don't know.
    >
    > And you have to sort out calling conventions between the two languages (C
    > calling Asm, and Asm calling C).
    >
    > So it's still a bit fiddly.


    I'm aware there are various options for name mangling and parameter
    passing. I'm just not sure which of the many are best, if there is
    such a thing. I asked in response to Ben's initial post as you had got
    a test working but I'm thinking this is too big an issue for this
    subthread and needs a separate post, and a crosspost at that!
    Something to come back to.

    James
     
    James Harris, Dec 24, 2011
    #18
  19. Ben Bacarisse <> might have writ, in
    news::

    > I'm starting a new thread, but it is prompted by this remark...
    >| Do you consider to convert 5 to 10 lines of C to assembly?
    >| I did that 20 years ago on X86 cpu.
    >| It was a piece of cake to gain the 20-40% speed required if paid
    >| well.
    >
    > I've felt (it's only a feeling) that hand coding for speed (rather
    > than to access features not available in C like extra wide multiplies
    > and so on) was a thing of the past. But maybe I'm wrong. Is it still
    > possible to get a significant speed-up over C by hand coding? How
    > much depends on the processor?


    Don't know, but 20 years ago IIRC, good speedups should have been
    common.

    I did an experiment just yesterday on a snippet posted in
    comp.programming. I'll report my results here. The key line of code
    seems "wonderful" in its right: A bit-twiddler to replace a word with N
    bits set to the next such word in lexicographical order.

    Dave <> wrote, in
    news::

    > On Dec 17, 10:05 am, Andrew Tomazos <> wrote:
    >> I want to loop over every number from 0 to (2^m-1) inclusive whos
    >> binary representation contains exactly n bits.

    >
    > Try something like this:
    >
    > i = (1 << n) -1;
    > while( !(i >> m) )
    > {
    > do_something_with(i);
    > i = (i+(i&(-i)))|(((i^(i+(i&(-i))))>>2)/(i&(-i)));
    > }
    >
    > The first "i =" statement sets i to the smallest number with n 1-
    > bits.
    >
    > The second "i =" statement is some bit manipulation "magic" that sets
    > i to the next sequential number that has the same number of 1-bits.
    > For an explanation, see
    > http://groups.google.com/group/programming-puzzles/msg/3a05b3c8b4042d5b
    > , which points to a PowerPoint presentation that gives a
    > multi-statement expression of the one-liner that I constructed from
    > it.
    >
    > The "while" statement loops as long as i remains less than 2^m.
    >
    > Dave


    With gcc -O{2,4} Pentium I got speedup over
    > i = (i+(i&(-i)))|(((i^(i+(i&(-i))))>>2)/(i&(-i)));

    just by factoring out common subexpressions and assigning them to temp
    variables. (One might think gcc could do that; perhaps this line is
    just too complicated.)

    Buried in the expression is a '/' operator which, of course, is time
    consuming. I got a big speedup replacing that divide.
    while (!(lb & 15)) i >>= 4, lb >>= 4;
    while (!(lb & 1)) i >>= 1, lb >>= 1;
    Pentium has an OPcode, I think, accessible from gcc with `fsf()' which
    seems like it *might* allow better speedup, but what I tried was slower
    than the while's.

    Note that neither of these speedup ideas (while or fsf()) required using
    assembler instead of C. IIRC it's not unusual to be able to speedup C
    by catering to the C compiler's code generator, often in non-obvious
    ways.

    James
     
    James Dow Allen, Dec 24, 2011
    #19
  20. Ben Bacarisse

    BartC Guest

    "James Harris" <> wrote in message
    news:...
    > On Dec 24, 3:30 pm, "BartC" <> wrote:


    >> And you have to sort out calling conventions between the two languages (C
    >> calling Asm, and Asm calling C).
    >>
    >> So it's still a bit fiddly.

    >
    > I'm aware there are various options for name mangling and parameter
    > passing. I'm just not sure which of the many are best, if there is
    > such a thing.


    If you're interfacing with C, the possibilities for parameter passing are
    limited, usually you have to go along with what the specific C compiler
    does, or whatever foreign interfaces are supported.

    Apparently for x86-64, parameters would be passed in registers, which is
    best for this example, but on x86-32, the stack seems to be favoured. In any
    case the 'winning entry' (gcc -O3) bypassed the whole issue by inlining the
    code (which I considered cheating).

    > I asked in response to Ben's initial post as you had got
    > a test working


    The test proposed actually would have worked well written in a separate
    module, but I couldn't remember all the fiddly details I needed to know, and
    used inline assembly in a language I was familiar with.

    > but I'm thinking this is too big an issue for this
    > subthread and needs a separate post, and a crosspost at that!
    > Something to come back to.


    The bigger issue is not so much getting these things to work, as whether it
    is still worthwhile using assembly at all, in these days of
    'super-intelligent' compilers (which I haven't really come across myself). I
    think that was the point of the thread.

    --
    Bartc
     
    BartC, Dec 24, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mark | r
    Replies:
    1
    Views:
    323
    Amber
    Nov 7, 2003
  2. Kamal Advani

    Optimised String Concatenation?

    Kamal Advani, Mar 8, 2006, in forum: Java
    Replies:
    2
    Views:
    403
    Chris Uppal
    Mar 8, 2006
  3. Thuswise Webmaster
    Replies:
    0
    Views:
    833
    Thuswise Webmaster
    Jun 28, 2003
  4. Emmanuel
    Replies:
    3
    Views:
    345
    John J. Lee
    Feb 7, 2004
  5. wrinting an optimised code

    , May 17, 2005, in forum: C Programming
    Replies:
    11
    Views:
    581
    Kevin D. Quitt
    May 19, 2005
Loading...

Share This Page