D.E. Knuth's strlen

jacob navia · Oct 16, 2008

<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));
}
#endif

Jean-Marc Bourguet · Oct 16, 2008

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}

I'd first align and then use that. You may get a trap with unaligned
access even on machine where unaligned access doesn't normally trap if you
stump on a page boundary with the next page not mapped in memory. And some
machine allows unaligned access but at a performance penality, which would
be against the goal of using that trick.

Yours,

jacob navia · Oct 16, 2008

Jean-Marc Bourguet said:
I'd first align and then use that. You may get a trap with unaligned
access even on machine where unaligned access doesn't normally trap if you
stump on a page boundary with the next page not mapped in memory. And some
machine allows unaligned access but at a performance penality, which would
be against the goal of using that trick.

Yours,

Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}

robertwessel2 · Oct 16, 2008

<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;

Nice. And the adaptations to other size operands (32 bit longs for
platforms not supporting anything bigger than that) or wide characters
are obvious. Too bad I don't see a way of generating the constants in
a platform independent way.

Keith Thompson · Oct 17, 2008

jacob navia said:
Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}

The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?

Keith Thompson · Oct 17, 2008

jacob navia said:
Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL

H and L are used only within myStrlen; I'd declare them inside the
function. Yes, I know macro definitions aren't scoped that way, but
it's useful for documentation.

size_t myStrlen(char *s)
{ [snip]
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));

You're using a %d format for a size_t value.

And I'd add a "return 0;" here, even though it isn't strictly
necessary in C99.

}
#endif

As I mentioned elsethread, the algorithm (which I snipped here) is not
portable due to alignment issues, and I believe there's no way to make
it 100% portable. But there are *mostly* portable tricks that can
detect pointer alignment on *most* systems (and testing can detect
systems where the tricks don't work). And, of course, an
implementation of strlen() doesn't have to be written in 100% portable
C; it can use whatever system-specific tricks it likes.

Ian Collins · Oct 17, 2008

Keith said:
H and L are used only within myStrlen; I'd declare them inside the
function. Yes, I know macro definitions aren't scoped that way, but
it's useful for documentation.

So make them const unsigned long long.

Keith Thompson · Oct 17, 2008

Ian Collins said:
So make them const unsigned long long.

Yes, of course. There are cases where a declaration of a const object
can't replace a macro definition; I should have realized this isn't
one of them.

jacob navia · Oct 17, 2008

Eric said:
Because it's the right choice, of course.

In any case it can't be bigger than 7 or smaller
than 0, so it would fit in ANY (char,short,int,long,long long)

jacob navia · Oct 17, 2008

Keith said:
The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?

n is an int and not a size_t because if it were a size_t
I would have written an infinite loop!

while (n > 0) n--;

is an infinite loop if n is unsigned and you should know that.

Chris Dollin · Oct 17, 2008

jacob said:
n is an int and not a size_t because if it were a size_t
I would have written an infinite loop!

while (n > 0) n--;

is an infinite loop if n is unsigned and you should know that.

More sleep or more coffee: you need at least one of them.

I think you meant to write `>=`, not `>`.

jacob navia · Oct 17, 2008

Chris said:
More sleep or more coffee: you need at least one of them.

I think you meant to write `>=`, not `>`.

The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

Sorry about that.

Chris Dollin · Oct 17, 2008

jacob said:
The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

You're surely not saying that `while (n) n -= 1;` is an infinite
loop for unsigned n?

[Note: my use of `n -= 1` rather than `n--` is purely stylistic
and not part of my point.]

More sleep or more coffee. Perhaps it's me?

jacob navia · Oct 17, 2008

Chris said:
jacob said:

The original loop I wrote is

while (n)

and that will infinite loop. I copied it wrongly
into my message

Click to expand...

You're surely not saying that `while (n) n -= 1;` is an infinite
loop for unsigned n?

[Note: my use of `n -= 1` rather than `n--` is purely stylistic
and not part of my point.]

More sleep or more coffee. Perhaps it's me?

You are right. More sleep and less coffee...

Keith Thompson · Oct 17, 2008

jacob navia said:
Eric said:

Keith said:

[...]
Yes, aligning could be done first:
{
intptr_t i = (intptr_t)s;
int n = i&(sizeof(long long)-1);
while (n > 0 && *s)
s++,n--;
}

The first line is non-portable, of course. But if your goal is to
implement the fastest possible strlen *for a given system*, that's not
necessarily a problem.

Why is n an int rather than a size_t?

Click to expand...

Because it's the right choice, of course.

Click to expand...

In any case it can't be bigger than 7 or smaller
than 0, so it would fit in ANY (char,short,int,long,long long)

You're right, int is the best choice.

stdazi · Oct 17, 2008

jacob said:
<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;
}

#ifdef TEST
int main(int argc,char *argv[])
{
char *str = "The lazy fox jumped over the slow dog";

if (argc > 1) {
str = argv[1];
}
printf(
"Strlen of '%s' is %d (%d)\n",
str,strlen(str),myStrlen(str));
}
#endif

--
jacob navia
jacob at jacob point remcomp point fr
logiciels/informatique
http://www.cs.virginia.edu/~lcc-win32

I'd like to see some benchmarks!

Nate Eldredge · Oct 17, 2008

Eric Sosman <[email protected]> said:
Eric Sosman <[email protected]> said:

jacob navia wrote: [...]

// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86

Click to expand...

It also supposes that there's no harm in accidentally reading
as many as seven bytes beyond the '\0'. An unaligned string with
its '\0' close to the end of a memory page could case trouble.

Taken together, the suppositions are one reason why most
real-world applications of this technique begin with a one-by-
one loop to byte off (sorry) any unaligned prefix before using
the "big steps" method once alignment is achieved.

Click to expand...

[...]

Even reading aligned words would fail when using a debugging memory
allocator that protects bytes just past the end of the allocation, down to
the byte boundary rather than just the page boundary (as one I've written
does, and I'm guessing valgrind would do as well).

Out of curiosity, how did you implement this? Note that the code here
only reads, doesn't write, and this version will never cross a page
boundary unnecessarily (assuming the page size is a multiple of
sizeof(unsigned long)). AFAIK most CPUs don't have a mechanism to trap
such a thing, except maybe via debug watchpoints or something (of which
there may be a limited number), or by single-stepping the program and
disassembling each instruction (which takes a lot of work). valgrind
emulates the CPU, which is essentially the latter.

If I were writing a compiler or standard library, I'd probably be
inclined to go ahead and make such accesses if convenient, on the
grounds that a "normally" running CPU won't notice that it happened. I
wouldn't be surprised if there are implementations that do this. Even
if it could be detected through special instrumentation, I'd say that
was up to the instrumentation suite to deal with, since it's violating
the compiler's assumptions. Maybe I'd provide an option to avoid such
accesses, for such a situation, but it wouldn't be the default.

In some sense, the instrumented code is running on a different machine
than the one the compiler is trying to target. So I would just say that
this strlen implementation is not portable to such a machine, and thus
should not be used *in that case*.

Paul Hsieh · Oct 17, 2008

<quote>
... one of the most common programming tasks is to search
through a long string of characters in order to find a
particular byte value. For example strings are often
represented as a sequence of nonzero bytes terminated
by 0. In order to locate the end of a string quickly,
we need a fast way to determine whether all eight bytes of a
given word x are nonzero (because theu usually are).
<end quote>

I discovered that quote above in the fascicle 1a in

"Bitwise Tricks and Techniques"http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

In that document, Knuth explains many boolean tricks, giving
the mathematical background for them too, what many other
books fail to do.

I adapted its algorithm to C. It took me a while because of a
stupid problem, but now it seems to work OK.

The idea is to read 8 bytes at a time and use some
boolean operations that can be done very fast in assembly
to skip over most of the chain, stopping only when a zero
byte appears in one of the eight bytes read.

Knuth's strlen then, looks like this.

#include <stdio.h>
#include <string.h>
#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
size_t myStrlen(char *s)
{
unsigned long long t;
char *save = s;

while (1) {
// This supposes that the input string is aligned
// or that the machine doesn't trap when reading
// a 8 byte integer at a random position like the
// x86
t = *(unsigned long long *)s;
if (H & (t - L) & ~t)
break;
s += sizeof(long long);
}
// This loop will be executed at most 7 times
while (*s) {
s++;
}
return s - save;
}

This does not deal with unaligned strings very well. If its unaligned
in its beginning you abort immediately and go do it the slow way.
What you want to do is do it char by char until you are aligned, then
use the aligned mass byte check until you find the zero, then escape
out and calculate the final bytes.

I demonstrated this a while ago on my page on assembly language:

http://www.pobox.com/~qed/asmexample.html

If you search for xstrlen() I give some pure C code that does the
trick on 32 bit integers (adapting it to 64 bit is obvious.)

But as long as you are using platform specific tricks, its a little
hard to beat the MMX based solution.

James Dow Allen · Oct 18, 2008

#define H 0x8080808080808080ULL
#define L 0x0101010101010101ULL
[snip]
if (H & (t - L) & ~t)

Click to expand...

Click to expand...

This clever method seemed familiar; I checked FSF's
strlen() and indeed it uses (almost!) the same cleverness.

I compared the routines. One of the differences between
Mr. navia's strlen() and FSF's seemed so amazing I
felt obligated to mention it here. (I'll also mention
a few other differences that might be of interest.)
I've modified the FSF code excerpts for brevity or
consistency with Mr. navia's; in particular I
write "Ulong" where FSF writes "unsigned long int".

+ whereas Mr. navia's code terminates with
while (*s) s++;
fsf unrolls this into (sizeof (Ulong)) tests and
continues in the big loop when all tests fail.

Now the amazing difference:
+ whereas Mr. navia uses
(H & (t - L) & ~t)
as his "magic" test, the fsf version omits the
" & ~t " here! There is some more complicated code,
presumably for a similar purpose, but it's
disabled via "#if 0".

This rather startled me! Were Knuth and navia spendthrifts,
throwing away cycles on some ideological whim? Or had
the millions of FSF strlen() invocations I've done to
build my website all been flawed? Was it all some
cuteness to allow the same code to run on both ones-
and twos-complement machines? Worst of all, since
it takes me at least three cups of coffee to understand
such bit-twiddling code, would I end up with
irritable bowel syndrome again?

I followed the advice occasionally offered in the ng,
and ran a simple test program. The strings I use for
my website are printable and have no "negative" characters
(or greater than 0x7f, for those not too lazy to type
"unsigned"). FSF's abbreviated magic seemed to work OK
with those. Whew!, my website was safe!

It's on "negative" characters that Mr. navia's "& ~t"
demonstrates its utility. Why is FSF's code OK then?
When the magic test fails, it performs the (sizeof (Ulong))
individual tests and if no null-character is found,
it simply resumes the main loop! Why this peculiarity??
I'll leave c.l.c denizens to their own conclusions,
but it *could* be a clever speed optimization, useful
when at least 99.2% of characters passed to strlen() are
"positive"!

In addition to the obvious correction Mr. navia admits to

Yes, aligning could be done first ...

other differences of FSF's code include:

+ fsf uses 'const' in each pointer declaration
+ fsf sets its constants via (essentially) the odd-looking:
Ulong himagic;
himagic = 0x80808080L;
if (sizeof (Ulong) > 4) {
/* Do the shift in two steps to avoid a
* warning if long has 32 bits.
*/
himagic = ((himagic << 16) << 16) | himagic;
}
(I guess the warning suppressed by the double shift
would apply only when that code is unexecuted anyway.)

I suppose Mr. navia didn't check the FSF implementation.
Is this one of those deals where if the lawyers prove
you read certain source code, you're bound by certain
license restrictions?

(BTW, The only reason I had glibc source on my laptop
is that I wanted to inspect their hsearch() and see
if they ever fixed its *very* egregious performance
bug. The only file I really wanted was misc/hsearch_r.c,
but this is a *huge* file, 6410 bytes. Fortunately,
FSF is smart enough to use compression to reduce
download effort and doesn't allow access to this huge
uncompressed file. All I needed to download was
glibc-2.7.tar.gz, which is a mere 21,241,912 bytes.)

Since I've pointed out bugs in his hashlib before, I
hope it will cheer Chuck up to learn that FSF hsearch()
*still* has a design defect so laughable as to make Chuck's
seem almost well-written by comparison! The first probe
after a collision in their fancy "based-on-Knuth" reprobing
is *always* to the last table entry. (Uh, the whole idea
of fancy reprobing is to avoid secondary collisions.)

James Hussein Allen

Nate Eldredge · Oct 18, 2008

blargg said:
I used the MMU (as you might expect), which has a page size of 4096
bytes. If the allocated size is a multiple of 4096, it's trivial. If
not, then the last page of the allocation will need to be partially
protected. To handle that, I have an array of integers, one for each
page, denoting the maximum valid offset for that page. I set the last
page to no-access. Then, in the page violation exception handler, I
compare the page offset of the attempted access with the maximum valid
offset from my array. If the access is valid, I execute the access
normally, then resume normal execution, otherwise I break into the
debugger. This scheme catches both reads and writes past the end and
before the beginning, and hardly slows down the program (unless it's
making lots of accesses in a partially-protected page at the end of an
allocation). It's implemented under Mac OS 9 for the PowerPC.

Aha. Clever. So it sort of single-steps only the instructions that deal
with that last page, and the CPU tells you the address instead of you
having to disassemble it. Very nice.

If you're writing either of these, you can (and should) take advantage
of platform-specific things like this, if they allow significant
performance gains. But if you're writing portable code, you'd rather (at
least I would) not access even one byte past the end of any allocated
memory.

Agreed.

C pipe	1	Dec 9, 2021
Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 19, 2023
'strlen' : Why valgrind reports invalid read ?	12	Jan 18, 2013
array-size/malloc limit and strlen() failure	26	Apr 2, 2014
Command Line Arguments	0	Mar 7, 2023
C language. work with text	3	Dec 10, 2021
strlen	5	Apr 25, 2006
For loop and strlen	8	Sep 20, 2009

D.E. Knuth's strlen

jacob navia

Jean-Marc Bourguet

jacob navia

robertwessel2

Keith Thompson

Keith Thompson

Ian Collins

Keith Thompson

jacob navia

jacob navia

Chris Dollin

jacob navia

Chris Dollin

jacob navia

Keith Thompson

stdazi

Nate Eldredge

Paul Hsieh

James Dow Allen

Nate Eldredge

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads