strtok and strtok_r

CBFalconer · Sep 16, 2007

Tor said:
My argument back then, was similar to yours now. However, to my
big surprise, we had to put this into the micro-optimalization
category, after measurements on quite a number of different
compilers and platforms.

Me too.

Making good measurements are usually a challenge, I have rarely
seen measurement code without some serious flaws or defects. I
can't remember the quality of the benchmark we used years ago,
but I would expect it to be a better starting point, than writing
a new one from scratch.

I don't think I'll bother doing it, but I would make something that
reads a file into a linked list, and then duplicates that linked
list. Time tha duplication, and keep the result. Change
something, such as the dupstr routine, or the malloc package, etc.
and repeat. Ensure that the malloced memory is always in a proper
condition for the timed run. The idea is to eliminate the disk
i/o, since the data is in memory in the first place. Each read
sequence should have similar, if not identical, timing penalties.

By changing the input file you can change the data manipulated, and
detect data sensitivity of the algorithms.

Keith Thompson · Sep 16, 2007

Joe Wright said:
I don't want to check strlen for error. SIZE_MAX may well be
valid. Passing in a NULL should be a NOP in my view.

size_t strlen(const char *s) {
size_t r = 0;
if (s) while (*s++) ++r;
return r;
}

How can it be a NOP? strlen() has to return some value.

I the example you've shown, strlen(NULL) isn't a NOP; it lies to you
and pretends that you passed it a valid empty string.

Tor Rustad · Sep 17, 2007

CBFalconer said:
[...]

Making good measurements are usually a challenge, I have rarely
seen measurement code without some serious flaws or defects. I
can't remember the quality of the benchmark we used years ago,
but I would expect it to be a better starting point, than writing
a new one from scratch.

Click to expand...

I don't think I'll bother doing it, but I would make something that
reads a file into a linked list, and then duplicates that linked
list. Time tha duplication, and keep the result. Change
something, such as the dupstr routine, or the malloc package, etc.
and repeat. Ensure that the malloced memory is always in a proper
condition for the timed run. The idea is to eliminate the disk
i/o, since the data is in memory in the first place. Each read
sequence should have similar, if not identical, timing penalties.

By changing the input file you can change the data manipulated, and
detect data sensitivity of the algorithms.

I don't bother diving into this either, perhaps compiler X has come up
with a great memcpy() optimization trick since last time, perhaps not.

First thing to do, would be to check the asm generated and read the CPU
optimization guides, without understanding whats's going on, a benchmark
would be rather meaningless to me. For example, if pulling in some
floating-point ops, would make the x86 optimization trick via MMX
registers, a disaster.

Pssst... one of the oldest tricks in the book to crush the Java/C++
folks, is to avoid malloc'ed memory. So, strdup() heavy code, should
rather be optimized by minimizing the usage of strdup().

Joachim Schmitz · Sep 17, 2007

Joachim Schmitz said:
Open System Services (OSS), the POSIX personality of NonStop Kernel (a
propriatary OS from HP, formerly Tandem).
Apparently it does not regeard strlen(NULL) an error... wonder what else
it would regard an error, if I don't forget and find the time, I'll check
the source to find out...

Hmm, well, as far as I can deciffer the assemly that is used for this
streln, it calculates the lenth of the string by substracting the start
address from the end address, end address being found by traversing the
tring till \0 is found. No idea how or whether that may ever lead to a
(sise_t)-1 and for sure the function does not set errno, so apparently the
documentaton is plain wrong.
If I have the time, I'll file a bug report.

Bye, Jojo

Joachim Schmitz · Sep 17, 2007

Joachim Schmitz said:
Well, that version of strlen doesn't distinguish between a NULL and an
empty string. This would:
size_t strlen(const char *s) {
size_t r = (size_t)-1;
if (s) while (*s++) ++r;
return ++r;
}

Nonsense, thinko...
if (s) { r=0; while (*s++) ++r; }
return r;
or simpler
size_t strlen(const char *s) {
size_t r = 0;
if (s) while (*s++) ++r;
else r = (size_t)-1; /* and possibly set errno, e.g. to EINVAL */
return r;
}

James Antill · Sep 17, 2007

Well, some 5 years ago, I made a similar comment on your code Richard,
which was using strcpy() at the time. We had a rather "long" argument
about it, and in the end, I tried to make my point by measuring memcpy()
vs strcpy() performance.

IIRC, the result of those tests, was rather humiliating for me, as your
strcpy() performed excellent!

Are you sure you weren't arguing about:

strcpy(dst, src);
vs.
memcpy(dst, src, strlen(src) + 1);

....as I find it hard to believe that Richard was arguing that strcpy
would be faster (possibly on certain platforms it's very close, but
not faster).

Richard Heathfield · Sep 17, 2007

James Antill said:

Are you sure you weren't arguing about:

strcpy(dst, src);
vs.
memcpy(dst, src, strlen(src) + 1);

...as I find it hard to believe that Richard was arguing that strcpy
would be faster (possibly on certain platforms it's very close, but
not faster).

Kind of you, James, and I must admit I find it hard to imagine arguing that
way as well. Unfortunately, I remember that I had some pretty strange
misconceptions about C a decade or so ago, so it's not utterly impossible.
But no, I don't remember this particular debate. Your mod seems plausible,
however. (I can't think of any reason why I'd want to call memcpy in that
way, however.)

Tor Rustad · Sep 17, 2007

James said:
]

Well, some 5 years ago, I made a similar comment on your code Richard,
which was using strcpy() at the time. We had a rather "long" argument
about it, and in the end, I tried to make my point by measuring memcpy()
vs strcpy() performance.

IIRC, the result of those tests, was rather humiliating for me, as your
strcpy() performed excellent!

Click to expand...

Are you sure you weren't arguing about:

strcpy(dst, src);
vs.
memcpy(dst, src, strlen(src) + 1);

Yes, 100% sure.

...as I find it hard to believe that Richard was arguing that strcpy
would be faster (possibly on certain platforms it's very close, but
not faster).

No no, Richard didn't say that strcpy was faster. IIRC, he just said
something along the lines, that my claims would be compiler- and system
dependent. *grr*

The burden of proof, was very much on my side, to show that memcpy
performed better. *duh*

For short strings, it isn't that strange if strcpy outperform memcpy,
since an optimized memcpy implementation, typically will have some
overhead (e.g. alignment code).

Tim Hagan · Sep 17, 2007

Richard Heathfield said:
Willem said:

It is a great shame that computer programming has become such a
commoditised task, with individual excellence being suppressed
by artificial deadlines and ludicrous budgets, so that nobody
who actually cares about the quality of the source code they
produce is able to spend the necessary time on it to get it
right, if they wish to compete in the market-place against
those who are perfectly content to churn out any old junk
as long as it can pass a badly-designed UAT. Our society
gets the bugs it deserves, by failing to insist on only
the highest quality code. This may explain the current
trend towards bozo-friendly languages that eschew all
pretence of high performance in favour of protecting
the programmer against his own silly mistakes. This
is why we are saddled with Gates's Law ("the speed
of software halves every eighteen months"). If we
insisted on programmers knowing their subject as
we insist on brain surgeons knowing theirs, the
whole of our society would have better, faster
software; software on which it could rely. To
buck the market, though, is becoming far too
expensive, and so it is unlikely that we'll
ever get high quality software, unless the
market manages to find a way to encourage
quality across the industry, rather than
punishing those companies that have the
courage and integrity to turn out high
quality programs, albeit at a greater
initial cost and therefore at prices
that seem unattractive to the naive
software purchaser. If we are able
to discover such a way, the whole
of society will be better off as
a result. It will not be simple
to accomplish such a change in
the current market-place, but
if we do not do so, then the
software we continue to use
day by day will remain, as
now, broken by mis-design
and an embarrassing wart
on an advanced society.

It is a great shame that
you could not elaborate
further on the subject
at hand. Continuation
of these well-formed
thoughts on current
trends in software
development would
have brought the
discussion to a
more effective
conclusion. I
am therefore
adding some
lines such
as these.
Yet more
words I
append
so it
will
end
in
a
..

user923005 · Sep 17, 2007

"Richard Heathfield" <[email protected]> a écrit dans le message de (e-mail address removed)...

If he would give a different answer on a different group, one of these
statements would be a lie or a joke.
So he is a fundamentalist, ostracist, extremist...

Not if it is true in one forum and false in another.

CBFalconer · Sep 17, 2007

Tor said:
James Antill wrote:
.... snip ...

For short strings, it isn't that strange if strcpy outperform
memcpy, since an optimized memcpy implementation, typically will
have some overhead (e.g. alignment code).

Which is all I was claiming in my reply two days ago, except I
included the cost of actually doing the memcpy call.

user923005 · Sep 17, 2007

"Richard Heathfield" <[email protected]> a écrit dans le message de (e-mail address removed)...

Except CBFalconer is no John F. Kennedy ;-)
His blunt rethoric does not bolster any one's morale, sarcasm does no good.

You are right, I should not have attributed to malice that which can be
adequately explained by plain ignorance. But I repeat: to not use strtok
anymore, check for availability of strtok_r or implement it locally from the
public domain source that has been posted above.

It's trivial to write one:

#include <string.h>
#include <limits.h>
#include <stdlib.h>
#include <ctype.h>

/* The default delimiters are chosen as some ordinary white space
characters: */
static const char default_delimiters[] =
{' ', '\n', '\t', '\r', '\f', 0};

/*
* The tokenize() function is similar to a reentrant version of
strtok().
* It parses tokens from 's', where tokens are substrings separated
by
* characters from 'delimiter_list'.
* To get the first token from 's', tokenize() is called with 's' as
its first
* parameter.
* Remaining tokens from 's' are obtained by calling tokenize() with
NULL for
* the first parameter.
* The s of delimiters, identified by 'delimiter_list', can change
from call
* to call.
* If the list of delimiters is NULL, then the standard list
'default_delimiters'
* (see above) is used.
* tokenize() modifies the memory pointed to by 's', because it writes
null
* characters into the buffer.
*/
char *tokenize(char *s, const char *delimiter_list, char
**placeholder)
{
if (delimiter_list == NULL)
delimiter_list = default_delimiters;

if (delimiter_list[0] == 0)
delimiter_list = default_delimiters;

if (s == NULL)
s = *placeholder;

if (s == NULL)
return NULL;
/*
* The strspn() function computes the length of the initial segment of
the first
* string that consists entirely of characters contained in the second
string.
*/
s += strspn(s, delimiter_list);
if (!s[0]) {
*placeholder = s;
return NULL;
} else {
char *token;
token = s;
/*
* The strpbrk() function finds the first occurrence of any character
contained in
* the second string found in the first string.
*/
s = strpbrk(token, delimiter_list);
if (s == NULL)
*placeholder = token + strlen(token);
else {
*s++ = 0;
*placeholder = s;
}
return token;
}
}

#ifdef UNIT_TEST
char ts0[] =
"This is a test. This is only a test. If it were an actual emergency,
you would"
" be dead.";
char ts1[] =
"This is a also a test. This is only a test. If it were an actual
emergency, you"
" would be dead. 12345";
char ts2[] =
"The quick brown fox jumped over the lazy dog's back 1234567890
times.";
char ts3[] =
" \t\r\n\fThe quick brown fox jumped over the lazy dog's back
1234567890 times.";
char ts4[] =
"This is a test. This is only a test. If it were an actual emergency,
you would"
" be dead.";
char ts5[] =
"This is a also a test. This is only a test. If it were an actual
emergency, you"
" would be dead. 12345";
char ts6[] =
"The quick brown fox jumped over the lazy dog's back 1234567890
times.";
char ts7[] =
" \t\r\n\fThe quick brown fox jumped over the lazy dog's back
1234567890 times.";

#include <stdio.h>

char whitespace[UCHAR_MAX + 1];

/*
This test will create token separators as any whitespace or any
punctuation
marks:
*/
void init_whitespace()
{
int i;
int index = 0;
for (i = 0; i < UCHAR_MAX; i++) {
if (isspace(i)) {
whitespace[index++] = (char) i;
}
if (ispunct(i)) {
whitespace[index++] = (char) i;
}
}
}

/*
TNX Gerd.
*/
void spin_test(char *ts, char *white)
{
char *p = NULL;
char *token;
token = tokenize(ts, white, &p);
while (token) {
puts(token);
token = tokenize(NULL, white, &p);
}
}

int main(void)
{
init_whitespace();
puts("Whitespace is whitespace+punctuation");
spin_test(ts0, whitespace);
spin_test(ts1, whitespace);
spin_test(ts2, whitespace);
spin_test(ts3, whitespace);
puts("Whitespace is simple whitespace");
spin_test(ts4, NULL);
spin_test(ts5, NULL);
spin_test(ts6, NULL);
spin_test(ts7, NULL);
return 0;
}
#endif

user923005 · Sep 17, 2007

You posted an extraordinarily crappy code example.

True, but his rendition of "Somewhere Over The Rainbow" was
tremendous.

CBFalconer · Sep 17, 2007

Joachim said:
Nonsense, thinko...
if (s) { r=0; while (*s++) ++r; }
return r;
or simpler
size_t strlen(const char *s) {
size_t r = 0;
if (s) while (*s++) ++r;
else r = (size_t)-1; /* and possibly set errno, e.g. to EINVAL */
return r;
}

If you bother to stop and think about it, I think you will see that
all four routines do the same thing. The only difference is the
amount of code executed, and there the original is the best. It
may not be the most obvious. Think about the value of ++r when the
while loop is never executed.

CBFalconer · Sep 18, 2007

user923005 said:
.... big snip ...

It's trivial to write one:

.... snip code ...

Why such a monster? I posted tknsplit (not a strtok clone) in 21
lines last Friday in this thread, and the remainder of the post was
testing code. tknsplit won't overwrite anything, will detect
omitted params, and won't modify the parameter line.

CBFalconer · Sep 18, 2007

user923005 said:
Not if it is true in one forum and false in another.

Mr Gordon seems to be unaware of the fact that this newsgroup deals
_strictly_ with standard C, as defined in the various ISO C
standards, and K&R for pre-standardization versions. Material not
included in those standards is off-topic, unless full std C code to
implement them is included. For system dependent material, go to a
newsgroup that deals with that system.

Ben Bacarisse · Sep 18, 2007

James Antill said:
Are you sure you weren't arguing about:

strcpy(dst, src);
vs.
memcpy(dst, src, strlen(src) + 1);

...as I find it hard to believe that Richard was arguing that strcpy
would be faster (possibly on certain platforms it's very close, but
not faster).

I was surprised to find it was. I decided to write a small test, and
for short strings, strdup written with strcpy is faster than strdup
written with memcpy (they both use strlen first to find out how much
to copy). The cut-off point on my laptop is at 164 bytes without any
optimisation. With -O2 the cutoff goes down to 16 bytes, but for
small strings, strcpy seems faster. The traditional YMMV seems to
fall short of the mark here -- it almost certainly will.

Yucky code available, if anyone else wants to try without writing
their own.

Ben Bacarisse · Sep 18, 2007

CBFalconer said:
If you bother to stop and think about it, I think you will see that
all four routines do the same thing.

Nope. The third and fourth correctly implement the specification
given for the second (which simply re-implements the behaviour of
the first).

user923005 · Sep 18, 2007

... big snip ...

... snip code ...

Why such a monster? I posted tknsplit (not a strtok clone) in 21
lines last Friday in this thread, and the remainder of the post was
testing code. tknsplit won't overwrite anything, will detect
omitted params, and won't modify the parameter line.

Mine is about 21 lines also. The rest of the code is a unit test
driver.
It overwrites the original, but that avoids memory allocation.

CBFalconer · Sep 18, 2007

Ben said:
.... snip ...

I was surprised to find it was. I decided to write a small test,
and for short strings, strdup written with strcpy is faster than
strdup written with memcpy (they both use strlen first to find
out how much to copy). The cut-off point on my laptop is at 164
bytes without any optimisation. With -O2 the cutoff goes down
to 16 bytes, but for small strings, strcpy seems faster. The
traditional YMMV seems to fall short of the mark here -- it
almost certainly will.

Just for fun, since you have the test code available, try comparing
it against the following small piece. Use the same optimization
levels etc.

char *dupstr(const char *s) {
char *p, *q;

if (p = q = malloc(1 + strlen(s)))
while (*p++ = *s++) continue;
return q;
} /* dupstr, untested */

which needs "#include <string.h>" and "#include <stdlib.h".

strtok and strtok_r	6	Sep 14, 2007
Why does strcat mess up the tokens in strtok (and strtok_r)?	92	Jun 11, 2014
strtok	6	Nov 25, 2005
Can't solve problems! please Help	0	Sep 26, 2022
How to compare these variables by use operator 'AND'	5	Sep 22, 2003
Tokenizer Function (plus rant on strtok documentation)	18	Jul 11, 2006
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023

strtok and strtok_r

CBFalconer

Keith Thompson

Tor Rustad

Joachim Schmitz

Joachim Schmitz

James Antill

Richard Heathfield

Tor Rustad

Tim Hagan

user923005

CBFalconer

user923005

user923005

CBFalconer

CBFalconer

CBFalconer

Ben Bacarisse

Ben Bacarisse

user923005

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads