Can anyone improve this any further......

Discussion in 'C Programming' started by Sean Kenwrick, Feb 6, 2004.

  1. I am writing a byte-code interpreter/emulator for a language that
    exclusively uses strings for variables (i.e all variables are pushed onto
    the stack as strings). Therefore for arithmetic operations I need to
    convert the string to a 32 bit integer, carry out the arithmetic operation,
    then convert it back to a string before pushing it back onto the stack.
    This can happen millions of times per second so it is essential that I have
    optimised (for speed) atol() and ltoa() functions.

    Here is my attempt at ltoa()... (The __fastcall keyword is to tell my
    compiler to pass arguments through registers and is used instead of inline
    because it seems to be just as fast and this function gets called from
    various modules ..).

    static char * __fastcall myltoa(long n,char * s)
    {
    register long r, k;
    register int flag = 0;
    register int next;

    next = 0;
    if (n == 0) {
    s[next++] = '0';
    }
    else {
    if (n < 0) {
    s[next++] = '-';
    n = -n;
    }

    if(n < 100) goto label2;
    if(n < 100000) goto label1;

    k = 1000000000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 100000000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 10000000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 1000000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 100000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;
    label1:
    k = 10000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 1000;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    k = 100;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;
    label2:
    k = 10;
    r = n/k;
    if(flag) s[next++] = '0' + r;
    else if(r > 0){ s[next++] = '0' + r;flag = 1;}
    n -= r * k;

    r=n;
    s[next++] = '0' + r;
    }
    s[next] = '\0';
    return(s);
    }


    The goto's are there because I recognised that the majority of numbers are
    less than 100 (mainly return codes from functions) and those that are not
    less than 100 tend to be less than 100,000 (this language is not usually
    used for mathemetics). This did give me another 15-20% improvement in
    speed on a tight loop going some arithmetic..

    By the way, I have tried pre-calculating integer values for all strings
    pushed onto the stack (by always appending the 4 byte integer value on the
    stack at the end of the string, but it actually caused a slowdown since the
    I have to do a atol() on every function return value or string concatenation
    operation even if the value is not going to be used in an arithmetic or
    comparison operation.

    Any hints would be welcome...

    Sean
     
    Sean Kenwrick, Feb 6, 2004
    #1
    1. Advertising

  2. Sean Kenwrick

    pete Guest

    Sean Kenwrick wrote:
    >
    > I am writing a byte-code interpreter/emulator for a language that
    > exclusively uses strings for variables (i.e all variables are pushed onto
    > the stack as strings). Therefore for arithmetic operations I need to
    > convert the string to a 32 bit integer, carry out the arithmetic operation,
    > then convert it back to a string before pushing it back onto the stack.
    > This can happen millions of times per second so it is essential that I have
    > optimised (for speed) atol() and ltoa() functions.
    >
    > Here is my attempt at ltoa()... (The __fastcall keyword is to tell my
    > compiler to pass arguments through registers and is used instead of inline
    > because it seems to be just as fast and this function gets called from
    > various modules ..).
    >
    > static char * __fastcall myltoa(long n,char * s)
    > {
    > register long r, k;
    > register int flag = 0;
    > register int next;
    >
    > next = 0;
    > if (n == 0) {
    > s[next++] = '0';
    > }
    > else {
    > if (n < 0) {
    > s[next++] = '-';
    > n = -n;


    Undefined behavior if (-LONG_MAX > n)

    > }
    >
    > if(n < 100) goto label2;
    > if(n < 100000) goto label1;
    >
    > k = 1000000000;
    > r = n/k;
    > if(flag) s[next++] = '0' + r;


    if(flag)
    flag is already known to be zero, at this point.

    > else if(r > 0){ s[next++] = '0' + r;flag = 1;}



    > Any hints would be welcome...


    Please time this for me:

    #include <limits.h>
    char *lto_a(long n, char *s)
    {
    char c, *p, *q;
    int flag = 0;

    q = p = s;
    if (0 > n) {
    *p++ = '-';
    ++s;
    if (-LONG_MAX > n) {
    flag = 1;
    n = LONG_MAX;
    } else {
    n = -n;
    }
    }
    do {
    *p++ = (char)(n % 10 + '0');
    n /= 10;
    } while (n != 0);
    if (flag) {
    ++*s;
    }
    for (*p = '\0'; --p > s; ++s) {
    c = *s;
    *s = *p;
    *p = c;
    }
    return q;
    }

    --
    pete
     
    pete, Feb 6, 2004
    #2
    1. Advertising

  3. "pete" <> wrote in message
    news:...
    > Sean Kenwrick wrote:
    > >
    > > I am writing a byte-code interpreter/emulator for a language that
    > > exclusively uses strings for variables (i.e all variables are pushed

    onto
    > > the stack as strings). Therefore for arithmetic operations I need to
    > > convert the string to a 32 bit integer, carry out the arithmetic

    operation,
    > > then convert it back to a string before pushing it back onto the stack.
    > > This can happen millions of times per second so it is essential that I

    have
    > > optimised (for speed) atol() and ltoa() functions.
    > >
    > > Here is my attempt at ltoa()... (The __fastcall keyword is to tell my
    > > compiler to pass arguments through registers and is used instead of

    inline
    > > because it seems to be just as fast and this function gets called from
    > > various modules ..).
    > >
    > > static char * __fastcall myltoa(long n,char * s)
    > > {
    > > register long r, k;
    > > register int flag = 0;
    > > register int next;
    > >
    > > next = 0;
    > > if (n == 0) {
    > > s[next++] = '0';
    > > }
    > > else {
    > > if (n < 0) {
    > > s[next++] = '-';
    > > n = -n;

    >
    > Undefined behavior if (-LONG_MAX > n)
    >
    > > }
    > >
    > > if(n < 100) goto label2;
    > > if(n < 100000) goto label1;
    > >
    > > k = 1000000000;
    > > r = n/k;
    > > if(flag) s[next++] = '0' + r;

    >
    > if(flag)
    > flag is already known to be zero, at this point.
    >
    > > else if(r > 0){ s[next++] = '0' + r;flag = 1;}

    >
    >
    > > Any hints would be welcome...

    >
    > Please time this for me:
    >
    > #include <limits.h>
    > char *lto_a(long n, char *s)
    > {
    > char c, *p, *q;
    > int flag = 0;
    >
    > q = p = s;
    > if (0 > n)


    > *p++ = '-';
    > ++s;
    > if (-LONG_MAX > n) {
    > flag = 1;
    > n = LONG_MAX;
    > } else {
    > n = -n;
    > }
    > }
    > do {
    > *p++ = (char)(n % 10 + '0');
    > n /= 10;
    > } while (n != 0);
    > if (flag) {
    > ++*s;
    > }
    > for (*p = '\0'; --p > s; ++s) {
    > c = *s;
    > *s = *p;
    > *p = c;
    > }
    > return q;
    > }
    >
    > --
    > pete


    After I change the call to __fastcall and use register variables It's very
    slightly slower (mine took 4650ms, yours took 4860ms for a 1,000,000 times
    loop doing some simple arithmetic). The difference is probably in the
    strrev() required at the end - but I like the solution of going from the
    right to left and I think I have a version that might not need the strrev()
    at the end..... I will post my attempt and the timing later...

    Thanks
    Sean
     
    Sean Kenwrick, Feb 6, 2004
    #3
  4. What about this?

    #include <limits.h>

    #if INT_MAX==0x7FFFFFFF && INT_MIN+2147483647>=-1
    /* 32 bit signed arithmetic assumed */
    char *myltoa(long n, char *s)
    {
    unsigned short j = 1;
    if (n<0)
    { /* handle negatives */
    s[0] = '-'; ++j;
    if (INT_MIN+2147483647==-1 && n==INT_MIN)
    { /* handle -2147483648 the hard way */
    s[1] = '2'; ++j;
    n = -147483648;
    }
    n = -n;
    }
    /* find length using kinda dichotomy */
    s[j += (n<10000)?(n<100)?(n>=10):(n>=1000)+2:
    (n<100000000)?(n<1000000)?(n>=100000)+4:
    (n>=10000000)+6:(n>=1000000000)+8] = '\0';
    /* convert from right to left */
    do s[--j] = n%10+'0'; while ((n/=10)!=0);
    /* all done */
    return(s);
    }
    #endif


    François Grieu
     
    Francois Grieu, Feb 6, 2004
    #4
  5. Sean Kenwrick

    pete Guest

    Sean Kenwrick wrote:

    > After I change the call to __fastcall and use register variables
    > It's very slightly slower (mine took 4650ms,
    > yours took 4860ms for a 1,000,000 times
    > loop doing some simple arithmetic).
    > The difference is probably in the strrev() required at the end
    > - but I like the solution of going from the
    > right to left and I think I have a version that might not need
    > the strrev() at the end.....
    > I will post my attempt and the timing later...


    Don't forget the (-LONG_MAX > n) case.
    Thank you.

    --
    pete
     
    pete, Feb 6, 2004
    #5
  6. "Sean Kenwrick" <> wrote in message
    news:c00ing$4dc$...
    >
    > "pete" <> wrote in message
    > news:...
    > > Sean Kenwrick wrote:
    > > >
    > > > I am writing a byte-code interpreter/emulator for a language that
    > > > exclusively uses strings for variables (i.e all variables are pushed

    > onto
    > > > the stack as strings). Therefore for arithmetic operations I need

    to
    > > > convert the string to a 32 bit integer, carry out the arithmetic

    > operation,
    > > > then convert it back to a string before pushing it back onto the

    stack.
    > > > This can happen millions of times per second so it is essential that I

    > have
    > > > optimised (for speed) atol() and ltoa() functions.
    > > >
    > > > Here is my attempt at ltoa()... (The __fastcall keyword is to tell my
    > > > compiler to pass arguments through registers and is used instead of

    > inline
    > > > because it seems to be just as fast and this function gets called from
    > > > various modules ..).
    > > >
    > > > static char * __fastcall myltoa(long n,char * s)
    > > > {
    > > > register long r, k;
    > > > register int flag = 0;
    > > > register int next;
    > > >
    > > > next = 0;
    > > > if (n == 0) {
    > > > s[next++] = '0';
    > > > }
    > > > else {
    > > > if (n < 0) {
    > > > s[next++] = '-';
    > > > n = -n;

    > >
    > > Undefined behavior if (-LONG_MAX > n)
    > >
    > > > }
    > > >
    > > > if(n < 100) goto label2;
    > > > if(n < 100000) goto label1;
    > > >
    > > > k = 1000000000;
    > > > r = n/k;
    > > > if(flag) s[next++] = '0' + r;

    > >
    > > if(flag)
    > > flag is already known to be zero, at this point.
    > >
    > > > else if(r > 0){ s[next++] = '0' + r;flag = 1;}

    > >
    > >
    > > > Any hints would be welcome...

    > >
    > > Please time this for me:
    > >
    > > #include <limits.h>
    > > char *lto_a(long n, char *s)
    > > {
    > > char c, *p, *q;
    > > int flag = 0;
    > >
    > > q = p = s;
    > > if (0 > n)

    >
    > > *p++ = '-';
    > > ++s;
    > > if (-LONG_MAX > n) {
    > > flag = 1;
    > > n = LONG_MAX;
    > > } else {
    > > n = -n;
    > > }
    > > }
    > > do {
    > > *p++ = (char)(n % 10 + '0');
    > > n /= 10;
    > > } while (n != 0);
    > > if (flag) {
    > > ++*s;
    > > }
    > > for (*p = '\0'; --p > s; ++s) {
    > > c = *s;
    > > *s = *p;
    > > *p = c;
    > > }
    > > return q;
    > > }
    > >
    > > --
    > > pete

    >
    > After I change the call to __fastcall and use register variables It's very
    > slightly slower (mine took 4650ms, yours took 4860ms for a 1,000,000 times
    > loop doing some simple arithmetic). The difference is probably in the
    > strrev() required at the end - but I like the solution of going from the
    > right to left and I think I have a version that might not need the

    strrev()
    > at the end..... I will post my attempt and the timing later...
    >
    > Thanks
    > Sean
    >
    >


    By the way I mis-remembered the timings above, they were actually 2650 and
    2860ms...

    Here is the version I ended up with of the above algorithm. I can be
    certain that the long passed does not exceed +/- LONG_MAX so I don't need
    these tests, and I've done away with the strrev() at the end by filling the
    string backwards from the 11th char and then returning a pointer to where
    the actual number starts in the string..

    char * __fastcall myltoa(long n, char *s)
    {
    register char *p;
    register int flag=0;

    p=s+12;
    *p--='\0'; /* Null terminate */

    // If n is negative
    if (0 > n) {
    flag=1;
    n = -n;
    }


    do {
    // Next char is n%10 (Rightmost digit)
    *p-- = (char)(n % 10 + '0');
    // Strip off the rightmost digit
    n /= 10;
    } while (n != 0); // Until n is 0

    if(flag)
    *p-- = '-';
    // Return the offset into the string where the number starts
    return p+1;
    }

    However there was no significant change in the timings so I suspect that the
    problem with the above is that it is doing two divides per digit (in fact
    one modulo and 1 divide) - and divides are typically alot slower than all
    the other arithmetic operations.. In my version I do 1 divide, 1 multiply
    and 1 subtraction which might be a saving of quite a few cycles per digit...

    I can't see any way of optimising this version any firther unless there is a
    way to get rid of one of the divides...
    Sean
     
    Sean Kenwrick, Feb 6, 2004
    #6
  7. "Francois Grieu" <> wrote in message
    news:...
    > What about this?
    >
    > #include <limits.h>
    >
    > #if INT_MAX==0x7FFFFFFF && INT_MIN+2147483647>=-1
    > /* 32 bit signed arithmetic assumed */
    > char *myltoa(long n, char *s)
    > {
    > unsigned short j = 1;
    > if (n<0)
    > { /* handle negatives */
    > s[0] = '-'; ++j;
    > if (INT_MIN+2147483647==-1 && n==INT_MIN)
    > { /* handle -2147483648 the hard way */
    > s[1] = '2'; ++j;
    > n = -147483648;
    > }
    > n = -n;
    > }
    > /* find length using kinda dichotomy */
    > s[j += (n<10000)?(n<100)?(n>=10):(n>=1000)+2:
    > (n<100000000)?(n<1000000)?(n>=100000)+4:
    > (n>=10000000)+6:(n>=1000000000)+8] = '\0';
    > /* convert from right to left */
    > do s[--j] = n%10+'0'; while ((n/=10)!=0);
    > /* all done */
    > return(s);
    > }
    > #endif
    >
    >
    > François Grieu


    This is similar to my solution expect that I didn't bother checking against
    INT_MIN/MAX (because I know that these can't be exceeded) and I didn't
    bother filling the string from the first character (so no need for the
    length calculation (I just return a pointer to the offset in the string).
    But this suffers from the same problem that I had with my version (See
    previous post) in that it effectively does two divides per digit and this
    seems to be causing the function to be slower than my orignal by about 10%..

    If there is any way to get rid of one of the divides then this would
    probably be faster..

    Sean
     
    Sean Kenwrick, Feb 6, 2004
    #7
  8. Sean Kenwrick

    pete Guest

    Sean Kenwrick wrote:

    > However there was no significant change in the timings
    > so I suspect that the problem with the above is that it
    > is doing two divides per digit
    > (in fact one modulo and 1 divide)
    > - and divides are typically alot slower than all
    > the other arithmetic operations.. In my version I do 1 divide,
    > 1 multiply and 1 subtraction which might be a saving of quite
    > a few cycles per digit...
    >
    > I can't see any way of optimising this version any
    > firther unless there is a way to get rid of one of the divides...


    There is a way, but is it faster ?

    #include <stdlib.h>
    ldiv_t ldiv(long numer, long denom);

    --
    pete
     
    pete, Feb 6, 2004
    #8
  9. "pete" <> wrote in message
    news:...
    > Sean Kenwrick wrote:
    >
    > > However there was no significant change in the timings
    > > so I suspect that the problem with the above is that it
    > > is doing two divides per digit
    > > (in fact one modulo and 1 divide)
    > > - and divides are typically alot slower than all
    > > the other arithmetic operations.. In my version I do 1 divide,
    > > 1 multiply and 1 subtraction which might be a saving of quite
    > > a few cycles per digit...
    > >
    > > I can't see any way of optimising this version any
    > > firther unless there is a way to get rid of one of the divides...

    >
    > There is a way, but is it faster ?
    >
    > #include <stdlib.h>
    > ldiv_t ldiv(long numer, long denom);
    >
    > --
    > pete


    I got rid of the extra divide and replaced it with a multiply and subtract
    (to do the modulo bit). It is now as fast as my original, and even
    slightly faster in some cases since I don't need to short-cut numbers less
    than 100,000 or 100 for example... It is also a much cleaner solution so I
    will use this instead. Here is my final version...

    char * __fastcall myltoa(long n, char *s)
    {
    register int flag=0;
    register int x;

    s+=12;
    *s--='\0'; /* Null terminate */

    /* If n is negative */
    if (0 > n) {
    flag=1;
    n = -n;
    }


    do {
    /* Next char is n%10 (Rightmost digit) */
    x=n/10;
    *s-- = n -(x*10) + '0';
    /* Strip off the rightmost digit */
    n = x;
    } while (n != 0); /* Until n is 0 */

    if(flag)
    *s-- = '-';
    /* Return the offset into the string where the number starts */
    return s+1;
    }


    Sean
     
    Sean Kenwrick, Feb 6, 2004
    #9
  10. In article <c00tva$s3c$>,
    "Sean Kenwrick" <> wrote:

    > I got rid of the extra divide and replaced it with a multiply and subtract
    > (to do the modulo bit). It is now (faster)


    In my experience, is unusual that % is slower than / is for unsigned
    quatities. I guess your system has a problem with modulo of signed
    quantities.
    Since apparently you have no requirement that the string returned
    starts where s does, maybe try:

    /* 32 bit signed arithmetic assumed */
    /* does not handle -2147483648 */
    char *myltoa(long n, char *s)
    {
    char neg;
    *(s += 11) = neg = '\0';
    if (n<0)
    {
    n = -n; /* does not handle -2147483648 */
    neg = '-';
    }
    do *--s = (unsigned)n%10+'0';
    while ((n = (unsigned)n/10)!=0);
    if (neg!='\0')
    *--s = neg;
    return(s);
    }


    François Grieu
     
    Francois Grieu, Feb 7, 2004
    #10
  11. "Francois Grieu" <> wrote in message
    news:...
    > In article <c00tva$s3c$>,
    > "Sean Kenwrick" <> wrote:
    >
    > > I got rid of the extra divide and replaced it with a multiply and

    subtract
    > > (to do the modulo bit). It is now (faster)

    >
    > In my experience, is unusual that % is slower than / is for unsigned
    > quatities. I guess your system has a problem with modulo of signed
    > quantities.
    > Since apparently you have no requirement that the string returned
    > starts where s does, maybe try:
    >
    > /* 32 bit signed arithmetic assumed */
    > /* does not handle -2147483648 */
    > char *myltoa(long n, char *s)
    > {
    > char neg;
    > *(s += 11) = neg = '\0';
    > if (n<0)
    > {
    > n = -n; /* does not handle -2147483648 */
    > neg = '-';
    > }
    > do *--s = (unsigned)n%10+'0';
    > while ((n = (unsigned)n/10)!=0);
    > if (neg!='\0')
    > *--s = neg;
    > return(s);
    > }
    >
    >
    > François Grieu


    The problem is that divide is typically ALOT slower than multiply or any
    other arithmetic operations and this is probably the case for all CPUs.
    And since the modulo operator requires a divide (I've checked the assembler
    code) then it is just as slow as a normal divide. Your algorithm above
    therefore still effectively does 2 divides per digit which is about 10%
    slower than my very similar version of your algorithm that I wrote (see
    previous post) which uses a multiple and a subtract instead of a modulo...

    Sean
     
    Sean Kenwrick, Feb 7, 2004
    #11
  12. Sean Kenwrick

    Johan Lindh Guest

    Sean Kenwrick wrote:

    > "Francois Grieu" <> wrote in message
    > news:...
    >
    >>In article <c00tva$s3c$>,
    >> "Sean Kenwrick" <> wrote:
    >>
    >>
    >>>I got rid of the extra divide and replaced it with a multiply and

    >
    > subtract
    >
    >>>(to do the modulo bit). It is now (faster)

    >>
    >>In my experience, is unusual that % is slower than / is for unsigned
    >>quatities. I guess your system has a problem with modulo of signed
    >>quantities.
    >>Since apparently you have no requirement that the string returned
    >>starts where s does, maybe try:
    >>
    >>/* 32 bit signed arithmetic assumed */
    >>/* does not handle -2147483648 */
    >>char *myltoa(long n, char *s)
    >> {
    >> char neg;
    >> *(s += 11) = neg = '\0';
    >> if (n<0)
    >> {
    >> n = -n; /* does not handle -2147483648 */
    >> neg = '-';
    >> }
    >> do *--s = (unsigned)n%10+'0';
    >> while ((n = (unsigned)n/10)!=0);
    >> if (neg!='\0')
    >> *--s = neg;
    >> return(s);
    >> }
    >>
    >>
    >> François Grieu

    >
    >
    > The problem is that divide is typically ALOT slower than multiply or any
    > other arithmetic operations and this is probably the case for all CPUs.
    > And since the modulo operator requires a divide (I've checked the assembler
    > code) then it is just as slow as a normal divide. Your algorithm above
    > therefore still effectively does 2 divides per digit which is about 10%
    > slower than my very similar version of your algorithm that I wrote (see
    > previous post) which uses a multiple and a subtract instead of a modulo...
    >
    > Sean
    >
    >


    static long mask_list[] = { 1000000000, 100000000, 10000000, 1000000,
    100000, 10000, 1000, 100, 10, 1 };
    char * __fastcall myltoa2(long n, char *s)
    {
    char *org_s = s;
    char c;
    long mask;
    int mask_index = 0;

    if( n <= 0 )
    {
    if( !n )
    {
    *s++ = '0';
    *s = '\0';
    return org_s;
    }
    n = -n;
    *s++ = '-';
    }

    while( mask_list[mask_index] > n )
    mask_index ++;

    do
    {
    c = '0';
    mask = mask_list[mask_index];
    while( n >= mask )
    {
    c++;
    n -= mask;
    }
    *s++ = c;
    mask_index ++;
    } while( n );

    *s = '\0';

    return org_s;
    }


    /Johan
     
    Johan Lindh, Feb 7, 2004
    #12
  13. Sean Kenwrick

    pete Guest

    Sean Kenwrick wrote:

    > Here is my final version...
    >
    > char * __fastcall myltoa(long n, char *s)


    Function identifiers starting with two underscores like that,
    are reserved for the implementation.
    Are you implementing C ?

    N869
    7.1.3 Reserved identifiers
    [#1]

    -- All identifiers that begin with an underscore and
    either an uppercase letter or another underscore are
    always reserved for any use.

    --
    pete
     
    pete, Feb 7, 2004
    #13
  14. Sean Kenwrick

    Old Wolf Guest

    > > char * __fastcall myltoa(long n, char *s)
    >
    > Function identifiers starting with two underscores like that,
    > are reserved for the implementation.
    > Are you implementing C ?


    No, but his compiler is implementing C-with-extensions,
    one of which is the keyword "__fastcall".
     
    Old Wolf, Feb 8, 2004
    #14
  15. "Johan Lindh" <> wrote in message
    news:...
    > Sean Kenwrick wrote:
    >
    > > "Francois Grieu" <> wrote in message
    > > news:...
    > >
    > >>In article <c00tva$s3c$>,
    > >> "Sean Kenwrick" <> wrote:
    > >>
    > >>
    > >>>I got rid of the extra divide and replaced it with a multiply and

    > >
    > > subtract
    > >
    > >>>(to do the modulo bit). It is now (faster)
    > >>
    > >>In my experience, is unusual that % is slower than / is for unsigned
    > >>quatities. I guess your system has a problem with modulo of signed
    > >>quantities.
    > >>Since apparently you have no requirement that the string returned
    > >>starts where s does, maybe try:
    > >>
    > >>/* 32 bit signed arithmetic assumed */
    > >>/* does not handle -2147483648 */
    > >>char *myltoa(long n, char *s)
    > >> {
    > >> char neg;
    > >> *(s += 11) = neg = '\0';
    > >> if (n<0)
    > >> {
    > >> n = -n; /* does not handle -2147483648 */
    > >> neg = '-';
    > >> }
    > >> do *--s = (unsigned)n%10+'0';
    > >> while ((n = (unsigned)n/10)!=0);
    > >> if (neg!='\0')
    > >> *--s = neg;
    > >> return(s);
    > >> }
    > >>
    > >>
    > >> François Grieu

    > >
    > >
    > > The problem is that divide is typically ALOT slower than multiply or any
    > > other arithmetic operations and this is probably the case for all CPUs.
    > > And since the modulo operator requires a divide (I've checked the

    assembler
    > > code) then it is just as slow as a normal divide. Your algorithm

    above
    > > therefore still effectively does 2 divides per digit which is about 10%
    > > slower than my very similar version of your algorithm that I wrote (see
    > > previous post) which uses a multiple and a subtract instead of a

    modulo...
    > >
    > > Sean
    > >
    > >

    >
    > static long mask_list[] = { 1000000000, 100000000, 10000000, 1000000,
    > 100000, 10000, 1000, 100, 10, 1 };
    > char * __fastcall myltoa2(long n, char *s)
    > {
    > char *org_s = s;
    > char c;
    > long mask;
    > int mask_index = 0;
    >
    > if( n <= 0 )
    > {
    > if( !n )
    > {
    > *s++ = '0';
    > *s = '\0';
    > return org_s;
    > }
    > n = -n;
    > *s++ = '-';
    > }
    >
    > while( mask_list[mask_index] > n )
    > mask_index ++;
    >
    > do
    > {
    > c = '0';
    > mask = mask_list[mask_index];
    > while( n >= mask )
    > {
    > c++;
    > n -= mask;
    > }
    > *s++ = c;
    > mask_index ++;
    > } while( n );
    >
    > *s = '\0';
    >
    > return org_s;
    > }
    >
    >
    > /Johan
    >
    >


    Your algorithm looked promising (faster) at first until I realised that I
    was using numbers containing low-digit combinations in my test code (E.g.
    a=1021020/324) . As soon as I used numbers with higher digit combinations
    like 83894758 then the algorithm suffered with performance since it would
    require 9 subtractions for the digit '9' etc. There is also a bug in the
    algorithm as it was presented above. It need to take care of the case
    where the number ends with trailing zeros (E.g 234000), so another loop
    needed to be placed at the end as follows:

    while(mask_index++ < 10)
    *s++='0';


    Although the performance on average is quite similar to the current best
    algorithm, I think I will stick to the version which has consistent
    performance for any number and does not rely on digits being of low value
    which could cause unexpected slowdowns...

    Sean
     
    Sean Kenwrick, Feb 9, 2004
    #15
  16. I now get the "* faster than %" point.
    Can you try this, which I hope more pipeline-friendly?

    #include <limits.h>
    #if LONG_MAX==0x7FFFFFFFu && ULONG_MAX==0xFFFFFFFFu
    /* 32 bit signed arithmetic assumed */
    /* may not handle -2147483648, but will on most modern machines */
    char *myltoa(long n, char *s)
    {
    unsigned long m;
    *(s += 11) = '\0';
    if (n>=0)
    {
    do
    {
    *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
    if (m==0)
    break;
    *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
    }
    while (n!=0);
    return s;
    }
    n = -n; /* may not handle -2147483648 */
    do
    {
    *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
    if (m==0)
    break;
    *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
    }
    while (n!=0);
    *--s = '-';
    return(s);
    }
    #endif


    Next try will use floating point arithmetic.

    François Grieu
     
    Francois Grieu, Feb 10, 2004
    #16
  17. "Francois Grieu" <> wrote in message
    news:...
    > I now get the "* faster than %" point.
    > Can you try this, which I hope more pipeline-friendly?
    >
    > #include <limits.h>
    > #if LONG_MAX==0x7FFFFFFFu && ULONG_MAX==0xFFFFFFFFu
    > /* 32 bit signed arithmetic assumed */
    > /* may not handle -2147483648, but will on most modern machines */
    > char *myltoa(long n, char *s)
    > {
    > unsigned long m;
    > *(s += 11) = '\0';
    > if (n>=0)
    > {
    > do
    > {
    > *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
    > if (m==0)
    > break;
    > *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
    > }
    > while (n!=0);
    > return s;
    > }
    > n = -n; /* may not handle -2147483648 */
    > do
    > {
    > *--s = '0'+n-(m = (unsigned long)n/10u)*10u;
    > if (m==0)
    > break;
    > *--s = '0'+m-(unsigned long)(n = m/10u)*10u;
    > }
    > while (n!=0);
    > *--s = '-';
    > return(s);
    > }
    > #endif
    >


    For some reason the casts to unsigned caused this to slow down by 6% over
    the current fastest. After removing the casts which appeared to be
    unnecessary then it matched the time of my previous best exactly. Thus I
    don't think the attempts to take advantage of pipelining had any effect and
    the code thus became equivalent to my version (but slightly less
    readable)...

    >
    > Next try will use floating point arithmetic.
    >
    > François Grieu


    I would be interested to see a solution based on FP arithmetic - perhaps FP
    divides are quicker than the CPU integer divides?? I would be surprised
    though...

    Sean
     
    Sean Kenwrick, Feb 10, 2004
    #17
  18. Sean Kenwrick

    Chris Torek Guest

    [code using integer division etc]

    In article <news:c0ai4a$5fh$>
    Sean Kenwrick <> writes:
    >For some reason the casts to unsigned caused this to slow down by 6% over
    >the current fastest.


    Some machines only have a "native" signed integer division, so
    unsigned integer division requires "undoing" some work the
    signed-divide instruction performed.

    On other machines, there is no difference, or unsigned integer
    division is slightly faster (e.g., pre-V8 SPARC, where there is
    no divide instruction at all, and the .udiv routine can skip
    the sign-fiddling).

    >After removing the casts which appeared to be
    >unnecessary then it matched the time of my previous best exactly.


    The "first" cast may be necessary, or at least useful, depending
    on the machine. If we ignore bizarre (but apparently legal)
    implementations in which UINT_MAX is strictly less than INT_MAX or
    -INT_MIN or both (where -INT_MIN means the mathematical value, not
    the "C value"), we still have the very common case of two's complement
    machines, in which -INT_MAX is off by one from INT_MIN: e.g.,
    INT_MAX is 32767 while INT_MIN is (numerically) -32768, or INT_MAX
    is 2147483647 while INT_MIN is -2147483648.

    (Aside: if, in this example, INT_MIN is -32768, it has to be
    expressed in C in some other form, such as (-32767 - 1), because
    the integral constant expression -32768 consists of the unary "-"
    followed by the constant 32768. But we just said INT_MAX is
    32767 -- so the integral constant 32768 has type "unsigned int",
    and negating it tells the compiler to compute the value mod 65536,
    which happens to continue to be (unsigned int)32768. So -32768
    and 32768U are the same number, on this machine. For 32-bit int
    two's complement machines, one must write (-2147483647 - 1) or
    similar, due to the same problem. The problem repeats for "long"
    as well -- I have chosen to address only "int" here, even though
    I think the original code uses "long".)

    In any case, getting back to the point at hand, in C89 integer
    division is allowed to round either towards 0 or towards -infinity,
    so that (-3)/2 is either -1 (round towards 0) or -2 (round towards
    -inf). The only constraint here is that ((a / b) * b) + (a % b)
    should produce the original value in "a" (assuming b nonzero),
    which in turn means that if (-3)/2 is -1, (-3)%2 must be -1 as
    well, while if (-3)/2 is -2, (-3)%2 must be 1. (In C99, implementations
    must round towards zero, I believe.)

    Suppose, then, that INT_MIN is (-32767-1) and INT_MAX is 32767 and
    the implementation rounds towards zero. Then suppose i is -32768
    initially, and "i = -i" leaves it set to -32768 (which is quite
    common even if it is undefined):

    int i, j, rest;
    ...
    rest = i / 10;
    j = i % 10; /* or: j = i - (rest * 10); */

    Since i is still -32768, this sets rest to -3276 (rounded towards
    0) and j is -8. j cannot be converted to a digit by adding 0.

    On the other hand, suppose the implementation rounds towards -inf.
    Then rest is set to -3277 and j is 2, and converting j to a digit
    gets '2'. If we do this in a loop, the next digit produced is '3'
    (with rest set to -328), then '2' (-33), then '7' (-4), then '6';
    and the number we will print is "-67232"!

    By using unsigned arithmetic for the first division, we guarantee
    the correct answer:

    int i, j, rest;
    ...
    rest = -(unsigned)i / 10U;
    j = (unsigned)i - (rest * 10U);

    Now we divide 32768U by 10U giving 3276U, and then subtract 32760U
    from 32768U to set j to 8. Converting to a digit gives '8' and
    rest is now a positive integer well out of the problematic range.
    The rest of the arithmetic can be done using signed divides, if
    those are faster.

    Since (I think) C99 guarantees rounding towards zero, the other
    technique that can be used is to leave the negative number negative,
    and simply adjust for the fact that j will be negative:

    int i, j, rest;
    bool negative = false; /* remember to #include <stdbool.h> */
    char *p;
    char buf[X]; /* use the log-base-8 trick to derive X */
    ...
    p = buf + X;
    *--p = '\0';
    if (i < 0) {
    negative = true;
    rest = i / 10;
    j = i - (rest * 10);

    Now "rest" is at most INT_MIN/10 and j is (e.g.) -8 as before, so:

    *--p = '0' + (-j);
    i = -rest;
    }
    do {
    rest = i / 10;
    j = i - rest * 10;
    *--p = '0' + j;
    i = rest;
    } while (i != 0);
    if (negative)
    *--p = '-';
    printf("the result of conversion is %s\n", p);

    This does assume that INT_MIN is less than -9, but the C standards
    guarantee it. :)

    >Thus I don't think the attempts to take advantage of pipelining
    >[by swapping variables inside a doubled-up loop] had any effect and
    >the code thus became equivalent to my version (but slightly less
    >readable)...


    If this kind of register-renaming pipelining will help, a good
    compiler should expand the loop for you automatically anyway.

    >I would be interested to see a solution based on FP arithmetic - perhaps FP
    >divides are quicker than the CPU integer divides?? I would be surprised
    >though...


    Division is fundamentally more difficult than multiplication (see
    comp.arch for gory hardware details), and for various reasons, many
    hardware designers are much more willing to "spend" the transisitors
    required to make hardware FP divide fast, than they are to spend
    the transistors to make integer divide fast. So it can be the case
    that FP divide happens in (say) 10 clocks while integer divide
    takes (say) 100. But there are other tricks to compensate (such
    as "reciprocal multiplication", providing the division is by a
    constant), and again a good compiler really should do them for you.
    The short version of the answer is "the speed of the operation is
    off-topic in comp.lang.c." :) The long version starts with at
    least this paragraph, and goes on for many more to discuss the
    details of the CPU(s) and/or compiler(s) in question.
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Feb 11, 2004
    #18
  19. On Fri, 6 Feb 2004 12:24:01 +0000 (UTC), "Sean Kenwrick"
    <> wrote:

    > I am writing a byte-code interpreter/emulator for a language that
    > exclusively uses strings for variables (i.e all variables are pushed onto
    > the stack as strings). Therefore for arithmetic operations I need to
    > convert the string to a 32 bit integer, carry out the arithmetic operation,
    > then convert it back to a string before pushing it back onto the stack.
    > This can happen millions of times per second so it is essential that I have
    > optimised (for speed) atol() and ltoa() functions.

    <snip>
    > By the way, I have tried pre-calculating integer values for all strings
    > pushed onto the stack (by always appending the 4 byte integer value on the
    > stack at the end of the string, but it actually caused a slowdown since the
    > I have to do a atol() on every function return value or string concatenation
    > operation even if the value is not going to be used in an arithmetic or
    > comparison operation.
    >

    If you can store numeric value as well as string value in variables --
    assuming your language has variables (not pure functional or FORTHy)
    -- how about pushing/storing etc. the numeric value *if known* and a
    dummy value (or flag) if not; operations which need numeric value(s)
    do the atol(s) if/when they encounter the dummy; similarly push/store
    a dummy string, such as NULL, if only the numeric value is known, and
    operations needing a string value do the ltoa if/when needed.

    If you really do need to do ltoa an awful lot, your absolute best
    performance is more likely achievable only by writing in assembler.
    Especially since even if you write in C, the microoptimization results
    often vary for different architectures, and even different models --
    especially for x86, where division and different ways of
    multiplication have bounced up and down through its history -- and
    thus while you may have portable code, it isn't actually a portable
    solution to your stated requirements.

    For x86, AIR the last time this was discussed on comp.lang.asm.x86,
    the consensus for (all?) current CPUs was to use fstbp or else split
    into 4-digit (< 16 - epsilon bit) chunks and 16:32 multiply by 64K/10
    to form each digit (within each chunk). Or possibly for some models by
    64K/100 and AAM to form digit pairs.

    - David.Thompson1 at worldnet.att.net
     
    Dave Thompson, Feb 14, 2004
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gustavo Campanelli

    Newbie question: Any way to improve this code?

    Gustavo Campanelli, Dec 5, 2003, in forum: Python
    Replies:
    10
    Views:
    445
    Gustavo Campanelli
    Dec 7, 2003
  2. luser- -droog

    any tricks to golf this code further?

    luser- -droog, Aug 3, 2011, in forum: C Programming
    Replies:
    7
    Views:
    906
    Moshbear dot Net
    Sep 11, 2011
  3. luser- -droog

    any tricks to golf this code further?

    luser- -droog, Aug 4, 2011, in forum: C Programming
    Replies:
    0
    Views:
    340
    luser- -droog
    Aug 4, 2011
  4. Chiyuan Zhang
    Replies:
    13
    Views:
    168
    MonkeeSage
    Jan 23, 2008
  5. Stuart Cullum
    Replies:
    4
    Views:
    101
    Jesús Gabriel y Galán
    Aug 14, 2009
Loading...

Share This Page