D.E. Knuth's strlen

J

jacob navia

<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>


I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));
}
#endif
 
J

Jean-Marc Bourguet

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}

I'd first align and then use that. You may get a trap with unaligned
access even on machine where unaligned access doesn't normally trap if you
stump on a page boundary with the next page not mapped in memory. And some
machine allows unaligned access but at a performance penality, which would
be against the goal of using that trick.

Yours,
 
J

jacob navia

Jean-Marc Bourguet said:
I'd first align and then use that. You may get a trap with unaligned
access even on machine where unaligned access doesn't normally trap if you
stump on a page boundary with the next page not mapped in memory. And some
machine allows unaligned access but at a performance penality, which would
be against the goal of using that trick.

Yours,

Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}
 
R

robertwessel2

    <quote>
    ... one of the most common programming tasks is to search
    through a long string of characters in order to find a
    particular byte value. For example strings are often
    represented as a sequence of nonzero bytes terminated
    by 0. In order to locate the end of a string quickly,
    we need a fast way to determine whether all eight bytes of a
    given word x are nonzero (because theu usually are).
    <end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
        unsigned long long t;
        char *save = s;

        while (1) {
                // This supposes that the input string is aligned
                // or that the machine doesn't trap when reading
                // a 8 byte integer at a random position like the
                // x86
                t = *(unsigned long long *)s;
                if (H & (t - L) & ~t)
                        break;
                s += sizeof(long long);
        }
        // This loop will be executed at most 7 times
        while (*s) {
                s++;
        }
        return s - save;


Nice. And the adaptations to other size operands (32 bit longs for
platforms not supporting anything bigger than that) or wide characters
are obvious. Too bad I don't see a way of generating the constants in
a platform independent way.
 
K

Keith Thompson

jacob navia said:
Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}

The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?
 
K

Keith Thompson

jacob navia said:
Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL

H and L are used only within myStrlen; I'd declare them inside the
function. Yes, I know macro definitions aren't scoped that way, but
it's useful for documentation.
size_t myStrlen(char *s)
{ [snip]
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));

You're using a %d format for a size_t value.

And I'd add a "return 0;" here, even though it isn't strictly
necessary in C99.

As I mentioned elsethread, the algorithm (which I snipped here) is not
portable due to alignment issues, and I believe there's no way to make
it 100% portable. But there are *mostly* portable tricks that can
detect pointer alignment on *most* systems (and testing can detect
systems where the tricks don't work). And, of course, an
implementation of strlen() doesn't have to be written in 100% portable
C; it can use whatever system-specific tricks it likes.
 
I

Ian Collins

Keith said:
H and L are used only within myStrlen; I'd declare them inside the
function. Yes, I know macro definitions aren't scoped that way, but
it's useful for documentation.
So make them const unsigned long long.
 
K

Keith Thompson

Ian Collins said:
So make them const unsigned long long.

Yes, of course. There are cases where a declaration of a const object
can't replace a macro definition; I should have realized this isn't
one of them.
 
J

jacob navia

Eric said:
Because it's the right choice, of course.

In any case it can't be bigger than 7 or smaller
than 0, so it would fit in ANY (char,short,int,long,long long)
 
J

jacob navia

Keith said:
The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?

n is an int and not a size_t because if it were a size_t
I would have written an infinite loop!

while (n > 0) n--;

is an infinite loop if n is unsigned and you should know that.
 
C

Chris Dollin

jacob said:
n is an int and not a size_t because if it were a size_t
I would have written an infinite loop!

while (n > 0) n--;

is an infinite loop if n is unsigned and you should know that.

More sleep or more coffee: you need at least one of them.

I think you meant to write `>=`, not `>`.
 
J

jacob navia

Chris said:
More sleep or more coffee: you need at least one of them.

I think you meant to write `>=`, not `>`.

The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

Sorry about that.
 
C

Chris Dollin

jacob said:
The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

You're surely not saying that `while (n) n -= 1;` is an infinite
loop for unsigned n?

[Note: my use of `n -= 1` rather than `n--` is purely stylistic
and not part of my point.]

More sleep or more coffee. Perhaps it's me?
 
J

jacob navia

Chris said:
jacob said:
The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

You're surely not saying that `while (n) n -= 1;` is an infinite
loop for unsigned n?

[Note: my use of `n -= 1` rather than `n--` is purely stylistic
and not part of my point.]

More sleep or more coffee. Perhaps it's me?

You are right. More sleep and less coffee...
 
K

Keith Thompson

jacob navia said:
Eric said:
Keith said:
[...]
Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}

The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?
Because it's the right choice, of course.

In any case it can't be bigger than 7 or smaller
than 0, so it would fit in ANY (char,short,int,long,long long)

You're right, int is the best choice.
 
S

stdazi

jacob said:
<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>


I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));
}
#endif



--
jacob navia
jacob at jacob point remcomp point fr
logiciels/informatique
http://www.cs.virginia.edu/~lcc-win32


I'd like to see some benchmarks!
 
N

Nate Eldredge

Eric Sosman <[email protected]> said:
jacob navia wrote: [...]
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86

It also supposes that there's no harm in accidentally reading
as many as seven bytes beyond the '\0'. An unaligned string with
its '\0' close to the end of a memory page could case trouble.

Taken together, the suppositions are one reason why most
real-world applications of this technique begin with a one-by-
one loop to byte off (sorry) any unaligned prefix before using
the "big steps" method once alignment is achieved.
[...]

Even reading aligned words would fail when using a debugging memory
allocator that protects bytes just past the end of the allocation, down to
the byte boundary rather than just the page boundary (as one I've written
does, and I'm guessing valgrind would do as well).


Out of curiosity, how did you implement this? Note that the code here
only reads, doesn't write, and this version will never cross a page
boundary unnecessarily (assuming the page size is a multiple of
sizeof(unsigned long)). AFAIK most CPUs don't have a mechanism to trap
such a thing, except maybe via debug watchpoints or something (of which
there may be a limited number), or by single-stepping the program and
disassembling each instruction (which takes a lot of work). valgrind
emulates the CPU, which is essentially the latter.

If I were writing a compiler or standard library, I'd probably be
inclined to go ahead and make such accesses if convenient, on the
grounds that a "normally" running CPU won't notice that it happened. I
wouldn't be surprised if there are implementations that do this. Even
if it could be detected through special instrumentation, I'd say that
was up to the instrumentation suite to deal with, since it's violating
the compiler's assumptions. Maybe I'd provide an option to avoid such
accesses, for such a situation, but it wouldn't be the default.

In some sense, the instrumented code is running on a different machine
than the one the compiler is trying to target. So I would just say that
this strlen implementation is not portable to such a machine, and thus
should not be used *in that case*.
 
P

Paul Hsieh

    <quote>
    ... one of the most common programming tasks is to search
    through a long string of characters in order to find a
    particular byte value. For example strings are often
    represented as a sequence of nonzero bytes terminated
    by 0. In order to locate the end of a string quickly,
    we need a fast way to determine whether all eight bytes of a
    given word x are nonzero (because theu usually are).
    <end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
        unsigned long long t;
        char *save = s;

        while (1) {
                // This supposes that the input string is aligned
                // or that the machine doesn't trap when reading
                // a 8 byte integer at a random position like the
                // x86
                t = *(unsigned long long *)s;
                if (H & (t - L) & ~t)
                        break;
                s += sizeof(long long);
        }
        // This loop will be executed at most 7 times
        while (*s) {
                s++;
        }
        return s - save;
}

This does not deal with unaligned strings very well. If its unaligned
in its beginning you abort immediately and go do it the slow way.
What you want to do is do it char by char until you are aligned, then
use the aligned mass byte check until you find the zero, then escape
out and calculate the final bytes.

I demonstrated this a while ago on my page on assembly language:

http://www.pobox.com/~qed/asmexample.html

If you search for xstrlen() I give some pure C code that does the
trick on 32 bit integers (adapting it to 64 bit is obvious.)

But as long as you are using platform specific tricks, its a little
hard to beat the MMX based solution.
 
J

James Dow Allen

#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
[snip]
               if (H & (t - L) & ~t)

This clever method seemed familiar; I checked FSF's
strlen() and indeed it uses (almost!) the same cleverness.

I compared the routines. One of the differences between
Mr. navia's strlen() and FSF's seemed so amazing I
felt obligated to mention it here. (I'll also mention
a few other differences that might be of interest.)
I've modified the FSF code excerpts for brevity or
consistency with Mr. navia's; in particular I
write "Ulong" where FSF writes "unsigned long int".

+ whereas Mr. navia's code terminates with
while (*s) s++;
fsf unrolls this into (sizeof (Ulong)) tests and
continues in the big loop when all tests fail.

Now the amazing difference:
+ whereas Mr. navia uses
(H & (t - L) & ~t)
as his "magic" test, the fsf version omits the
" & ~t " here! There is some more complicated code,
presumably for a similar purpose, but it's
disabled via "#if 0".

This rather startled me! Were Knuth and navia spendthrifts,
throwing away cycles on some ideological whim? Or had
the millions of FSF strlen() invocations I've done to
build my website all been flawed? Was it all some
cuteness to allow the same code to run on both ones-
and twos-complement machines? Worst of all, since
it takes me at least three cups of coffee to understand
such bit-twiddling code, would I end up with
irritable bowel syndrome again?

I followed the advice occasionally offered in the ng,
and ran a simple test program. The strings I use for
my website are printable and have no "negative" characters
(or greater than 0x7f, for those not too lazy to type
"unsigned"). FSF's abbreviated magic seemed to work OK
with those. Whew!, my website was safe!

It's on "negative" characters that Mr. navia's "& ~t"
demonstrates its utility. Why is FSF's code OK then?
When the magic test fails, it performs the (sizeof (Ulong))
individual tests and if no null-character is found,
it simply resumes the main loop! Why this peculiarity??
I'll leave c.l.c denizens to their own conclusions,
but it *could* be a clever speed optimization, useful
when at least 99.2% of characters passed to strlen() are
"positive"!

In addition to the obvious correction Mr. navia admits to
Yes, aligning could be done first ...
other differences of FSF's code include:

+ fsf uses 'const' in each pointer declaration
+ fsf sets its constants via (essentially) the odd-looking:
Ulong himagic;
himagic = 0x80808080L;
if (sizeof (Ulong) > 4) {
/* Do the shift in two steps to avoid a
* warning if long has 32 bits.
*/
himagic = ((himagic << 16) << 16) | himagic;
}
(I guess the warning suppressed by the double shift
would apply only when that code is unexecuted anyway.)

I suppose Mr. navia didn't check the FSF implementation.
Is this one of those deals where if the lawyers prove
you read certain source code, you're bound by certain
license restrictions?

(BTW, The only reason I had glibc source on my laptop
is that I wanted to inspect their hsearch() and see
if they ever fixed its *very* egregious performance
bug. The only file I really wanted was misc/hsearch_r.c,
but this is a *huge* file, 6410 bytes. Fortunately,
FSF is smart enough to use compression to reduce
download effort and doesn't allow access to this huge
uncompressed file. All I needed to download was
glibc-2.7.tar.gz, which is a mere 21,241,912 bytes.)

Since I've pointed out bugs in his hashlib before, I
hope it will cheer Chuck up to learn that FSF hsearch()
*still* has a design defect so laughable as to make Chuck's
seem almost well-written by comparison! The first probe
after a collision in their fancy "based-on-Knuth" reprobing
is *always* to the last table entry. (Uh, the whole idea
of fancy reprobing is to avoid secondary collisions.)

James Hussein Allen
 
N

Nate Eldredge

blargg said:
I used the MMU (as you might expect), which has a page size of 4096
bytes. If the allocated size is a multiple of 4096, it's trivial. If
not, then the last page of the allocation will need to be partially
protected. To handle that, I have an array of integers, one for each
page, denoting the maximum valid offset for that page. I set the last
page to no-access. Then, in the page violation exception handler, I
compare the page offset of the attempted access with the maximum valid
offset from my array. If the access is valid, I execute the access
normally, then resume normal execution, otherwise I break into the
debugger. This scheme catches both reads and writes past the end and
before the beginning, and hardly slows down the program (unless it's
making lots of accesses in a partially-protected page at the end of an
allocation). It's implemented under Mac OS 9 for the PowerPC.

Aha. Clever. So it sort of single-steps only the instructions that deal
with that last page, and the CPU tells you the address instead of you
having to disassemble it. Very nice.
If you're writing either of these, you can (and should) take advantage
of platform-specific things like this, if they allow significant
performance gains. But if you're writing portable code, you'd rather (at
least I would) not access even one byte past the end of any allocated
memory.

Agreed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top