strtok and strtok_r

Charlie Gordon · Sep 15, 2007

Joachim Schmitz said:
Well, in one implementation that I have at hand strlen is documneted
return -1 on error (and sets errno), which led me to believe that
strlen(NULL) would do that too. Apparently not, just tested: it segfaults.

Gee, I'm curious what implementation documents this behaviour.

Joachim Schmitz · Sep 15, 2007

Charlie Gordon said:
Gee, I'm curious what implementation documents this behaviour.

Open System Services (OSS), the POSIX personality of NonStop Kernel (a
propriatary OS from HP, formerly Tandem).
Apparently it does not regeard strlen(NULL) an error... wonder what else it
would regard an error, if I don't forget and find the time, I'll check the
source to find out...

Bye, Jojo

Charlie Gordon · Sep 15, 2007

CBFalconer said:
Charlie said:

"pete" <[email protected]> a écrit:
... snip ...

Intsead of using realloc in a loop, I think most programmers
would write strdup with one function call to strlen and one to
malloc and one to strcpy.

Click to expand...

Or more efficiently calling memcpy instead of strcpy.

char *strdup(const char *str) {
size_t len;
char *dest = NULL;

if (str) {
len = strlen(str);
dest = malloc(len + 1);
if (dest) {
memcpy(dest, str, len);
dest[len] = '\0';
}
}
return dest;
}

Click to expand...

I challenge the 'more efficient'. It will be highly dependent on
the compiler, but at the simplest you would be trading the effort
of an extra procedure call against the possible efficiency
improvement. Since most strings are short (in my case, probably
under 10 or 20 chars) this 'improvement' is a chimera.

I agree with you, I am just used to memcpy, strlen and strcpy all expanding
inline.
But performance is better measured with a profiler than estimated by mental
projection.

Also bearing in mind that strdup is a system reserved name,

I'm aware of that, most systems I use provide it.

my version
(with a #include <stdlib.h>) is:

char *dupstr(const char *str) {
char *dest, *temp;

if (dest = malloc(1 + strlen(str))) {
temp = dest;
while (*temp++ = *str++) continue;
}
return dest;
}

and I am willing to let it go boom when str is NULL, for early
warning etc. of problems.

That's a valid choice. Specifying htat strdup(NULL) -> NULL would make
sense too.
I think it should be documented either way.

Flash Gordon · Sep 15, 2007

Charlie Gordon wrote, On 15/09/07 13:08:

pete said:
pete said:

Charlie said:

"Sam Harris" <[email protected]> a écrit dans le message de (e-mail address removed)...
On 15 Sep 2007 at 1:28, Charlie Gordon wrote:
You can easily write your own version
of strdup in a couple lines. I use
the following:

char *strdup(char *s)
{
char *r=0;
int i=0;
do {
r=(char *) realloc(r,++i * sizeof(char));
} while(r[i-1]=s[i-1]);
return r;
}
This proves my point.

Adding useful functions like strdup
would prevent newbies and jokers from
re-inventing them in the most cumbersome,
inefficient, ugly error prone ways.

Your function should take a const char *.
sizeof(char) is 1 by definition
Why do you cast the result of realloc ?
Your function invokes undefined behaviour
when running out of memory, it
should return NULL instead.

Click to expand...

Intsead of using realloc in a loop,
I think most programmers would write strdup with
one function call to strlen and one to malloc and one to strcpy.

Click to expand...

Or more efficiently calling memcpy instead of strcpy.

char *strdup(const char *str) {
size_t len;
char *dest = NULL;

if (str) {
len = strlen(str);
dest = malloc(len + 1);
if (dest) {
memcpy(dest, str, len);
dest[len] = '\0';
}
}
return dest;
}

I prefer
char *strdup(const char *str)
{
char *dest = NULL;

if (str) {
size_t size = strlen(str)+1;
dest = malloc(size);
if (dest)
memcpy(dest, str, size);
}
return dest;
}

Or
char *strdup(const char *str)
{
if (!str)
return NULL;
else {
size_t size = strlen(str)+1;
char *dest = malloc(size);
if (dest)
memcpy(dest, str, size);
return dest;
}
}

Joe Wright · Sep 15, 2007

Richard said:
pete said:

memcpy, surely? Why measure the string twice?

Huh? strcpy doesn't measure anything.

Keith Thompson · Sep 15, 2007

Charlie Gordon said:
But I repeat: to not use
strtok anymore, check for availability of strtok_r or implement it
locally from the public domain source that has been posted above.

Well, maybe.

The standard function strtok() is non-reentrant *and* it has some
other -- well, not bugs necessarily, but quirks. For example, the
fact that it merges multiple adjacent delimiters can be inconvenient,
though it might be just what you need. (In practice, you usually want
this behavior if the delimiter is whitespace, but not if it's
something else.)

If you're using strtok() and it already does exactly what you want
except for the lack of reentrancy, then strtok_r() (if it's available
on your system -- and if not, you can compile it yourself) is just the
thing. If your requirements are less specific, then tknsplit() might
turn out to be perfect for you -- or some other non-standard function
might be better.

Richard Heathfield · Sep 15, 2007

Joe Wright said:

Huh? strcpy doesn't measure anything.

Sorry, Joe - I was guilty of truncated exegesis. What I meant was this:
that strcpy must keep going until it hits a null terminator, and it
doesn't know in advance where that null terminator will be found, so it
must test every character. So, although it isn't measuring the string
as such, that's only because it doesn't bother to write down how long
the string is. It's still ploughing through the string, character by
character. But we've already done that with our strlen call. By using
memcpy, we can take advantage of the fact that the string has already
been measured - memcpy can use any number of platform-specific tricks
for copying multiple bytes at a time. Therefore, if the length of the
string to be copied is known in advance, it is (likely to be) more
efficient to use memcpy than strcpy.

Keith Thompson · Sep 15, 2007

CBFalconer said:
Joachim Schmitz wrote:
... snip ...
<

That is not a C system. strlen returns a size_t, which is
unsigned, and thus can never return -1.

Sure it can:

size_t strlen(const char *s)
{
if (s == NULL) return -1;
/* ... */
}

(Of course the value -1 will be converted to size_t.)

Keith Thompson · Sep 15, 2007

Joachim Schmitz said:
Fair enough, but I'd prefer my own functions to do better than that, so I
like Charlie Gordon's implementation better.

Damn, here too...
anwyway: see above

How is returning (size_t)-1 better than a seg fault?

If I pass NULL to strlen(), there's a bug in my program. I'd like to
find out about it as early as possible. If strlen() quietly returns
-1, I might not detect the error until much later.

Chris Torek · Sep 15, 2007

Ask the people who were on the committees.

(Seriously, I do not
know why the non-reentrant versions were retained without at least
some sort of cleanup.)

C99 did not change ANY of the bugs of the standard library

I am not sure I would call all of these "bugs". ("Misfeatures",
perhaps, especially trigraphs

. More seriously, just two
points here

o non reentrant functions like strtok remained and no alternative
was proposed even if POSIX had developed one.

While strtok_r() is an improvement on strtok(), it leaves one of
strtok()'s fundamental flaws in place. If one is going to "improve"
strtok(), one should at least look at the BSD strsep().

Still, importing the whole set of POSIX "_r" functions would, I
think, have been better than doing nothing.

o Buffer overflows were written into the standard itself.

This is, at best, an overstatement.

I had a lengthy discussion in comp.std.c about asctime()
and the fixed buffer of 26 position it says it needs. It
suffices to put some wrong values into the input structure
and you have a buffer overflow.

If you "put some wrong values" in, you have little hope of expecting
*anything* -- what happens in lcc-win32, for instance, if I write:

struct big { int a[1000]; };
struct big main(double oops) {
short x = strlen((char *)0x98766542);
... /* more "wrong values" as inputs as needed */
return *(struct big *)42;
}

? If you want to protect against bad inputs, you need to think
hard about which kinds of "bad inputs" to guard against, and do
some serious cost/benefit analysis.

Moreover, if your objection is that values of .tm_year greater
than 8100 (or less than or equal to some negative number) cause
problems, you can always test for that in your own implementation:

__internal_return_type __internal_worker_function_for_times(...) {
...
if (OUT_OF_RANGE(tm->tm_year)) ... signal error ...
...
}

which might be used as, e.g.:

char *asctime(const struct tm *tm) {
...
if (__internal_worker_function_for_times(...) == ERROR)
__runtime_error_trap_report("invalid parameter to asctime()");
...
}

and thus demonstrate the superior Quality of Implementation of
lcc-win32, with regard to this particular possibility. (Presumably
__runtime_error_trap_report saves the state of the program for use
in the debugger, prints a stack trace, and/or does whatever else
is good for fixing program bugs.)

rafaelc · Sep 15, 2007

How is returning (size_t)-1 better than a seg fault?

If I pass NULL to strlen(), there's a bug in my program. I'd like to
find out about it as early as possible. If strlen() quietly returns
-1, I might not detect the error until much later.

By returning -1 strlen is not being so quiet. If you check function's return
value then you can catch the error at least as good as if it'd just segfault.
It could not segfault for some reason, but you'd always be able to check the
return value and tell yourself there's something wrong.

pete · Sep 15, 2007

Richard said:
pete said:

memcpy, surely? Why measure the string twice?

I suppose I was unduly influenced by the strdup example in K&R2.

jacob navia · Sep 15, 2007

Chris said:
This is, at best, an overstatement.

A buffer overflow happens when a fixed size memory area is defined
but a program writes PAST the fixed size buffer. This is a buffer
overflow.

Now, the standard specifies a buffer length of 26 for the buffer of
asctime.

In the official C standard of 1999 we find the specifications of the
“asctime” function, page 341:
char *asctime(const struct tm *timeptr)
{
static const char wday_name[7][3] = {
"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"
};
static const char mon_name[12][3] = {
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
};
static char result[26]; // <<<<<<<------------------------!!
sprintf(result, "%.3s %.3s%3d %.2d:%.2d:%.2d %d\n",
wday_name[timeptr->tm_wday],
mon_name[timeptr->tm_mon],
timeptr->tm_mday, timeptr->tm_hour,
timeptr->tm_min, timeptr->tm_sec,
1900 + timeptr->tm_year);
return result;
}

Nowhere is specified that the year value should be less than 8900.

If you "put some wrong values" in, you have little hope of expecting
*anything*

Of course. This is exactly the kind of sloppy specifications
attitude where anything goes, and no error analysis is ever done!

Shouldn't a seriously designed function have some way of
indicating an error when some of its inputs are wrong instead of
just making a buffer overflow?

Shouldn't a standard specify either:

o a bigger buffer to accommodate ANY year up to INT_MAX?

o a maximum year where the standard says (at least) that years must be
smaller than 8100 and sets upper and lower bounds for the input
data???

THAT would be a correctly specified function. UB would be clearly
signaled. In the text as it stands in the standard there is NO MENTION
of any limit!!!

I am not the first one to discover this.

Mr Clive Feather submitted a defect report saying in substance the
same thing as I said. The committee answer was:

<quote>
Thus, asctime() may exhibit undefined behavior if any of the members of
timeptr produce undefined behavior in the sample algorithm (for example,
if the timeptr->tm_wday is outside the range 0 to 6 the function may
index beyond the end of an array).

As always, the range of undefined behavior permitted includes:
Corrupting memory
Aborting the program
Range checking the argument and returning a failure indicator (e.g., a
null pointer)
Returning truncated results within the traditional 26 byte buffer.
There is no consensus to make the suggested change or any change along
this line.
<end quote>

You read correctly. Corrupting memory (i.e. a buffer overflow) is
within the range of undefined behavior acceptable!!!!

I have the right then, to name a buffer overflow for what it is, a
buffer overflow in the C standard with all the committee behind it.

-- what happens in lcc-win32, for instance, if I write:

struct big { int a[1000]; };
struct big main(double oops) {
short x = strlen((char *)0x98766542);
... /* more "wrong values" as inputs as needed */
return *(struct big *)42;
}

? If you want to protect against bad inputs, you need to think
hard about which kinds of "bad inputs" to guard against, and do
some serious cost/benefit analysis.

Yes. Let's do this ok?

The number of bytes needed is very easy to calculate. I explained
how to do this in my tutorial about the C language page 122:

<quote>
1.26.1.1 Getting rid of buffer overflows
How much buffer space we would need to protect asctime from buffer
overflows in the worst case?

This is very easy to calculate. We know that in all cases, %d can't
output more characters than the maximum numbers of characters an integer
can hold. This is INT_MAX, and taking into account the possible negative
sign we know that:

Number of digits N = 1 + ceil(log10((double)INT_MAX));

For a 32 bit system this is 11, for a 64 bit system this is 21.

In the asctime specification there are 5 %d format specifications,
meaning that we have as the size for the buffer the expression:
26+5*N bytes

In a 32 bit system this is 26+55=81.

This is a worst case oversized buffer, since we have already counted
some of those digits in the original calculation, where we have allowed
for 3+2+2+2+4 = 13 characters for the digits. A tighter calculation can
be done like this:

Number of characters besides specifications (%d or %s) in the string: 6.
Number of %d specs 5
Total = 6+5*11 = 61 + terminating zero 62.
The correct buffer size for a 32 bit system is 62.
<end quote>

COST and BENEFIT ANALYSIS:
--------------------------
The difference between 62 and 26 is 36. For sparing 36 bytes we have a
buffer overflow. Now do your cost/benefit analysis. Since I paid
120 euros for 2GB (2*1024*1024*1024) bytes, each byte costs

0.0000000004656612873077392578125 euros. Times 36 gives:
0.00000001676380634307861328125 euros.

Is this too expensive for you?

Moreover, if your objection is that values of .tm_year greater
than 8100 (or less than or equal to some negative number) cause
problems, you can always test for that in your own implementation:

A buffer of 62 bytes will handle ANY POSSIBLE INPUT in a 32 bit
implementation, there isn't even any need for testing!!!

Are we developing software in 2007?

Or we are still living in the PDP-11?

Why this myopic attitude towards error analysis, that has led to
people leaving C as a reasonable language forever?

C == buffer overflow...

Many people thing like this already. Do we need to furnish a proof with
a buffer overflow in the text of the C standard?

Yours truly.

jacob

Keith Thompson · Sep 15, 2007

By returning -1 strlen is not being so quiet. If you check function's return
value then you can catch the error at least as good as if it'd just segfault.
It could not segfault for some reason, but you'd always be able to check the
return value and tell yourself there's something wrong.

Since returning (size_t)-1 is non-standard behavior (though it's
allowed), I'm not likely to check for it.

Sam Harris · Sep 15, 2007

Your function should take a const char *.
sizeof(char) is 1 by definition
Why do you cast the result of realloc ?
Your function invokes undefined behaviour when running out of memory, it
should return NULL instead.

Yeah, whatever. I'm a coder at a Fortune 500 company, I think I can just
about write a strdup function that works more than adequately on any
machine I'd ever want to run it on.

Mark McIntyre · Sep 15, 2007

If he would give a different answer on a different group, one of these
statements would be a lie or a joke.

If my teenage son asks me how they work out the price of credit
default swaps, I give one answer.

If junior quant analyst in a bank asks me the same question, I give an
entirely different answer.

I assure you, neither answer is a lie or a joke.

So he is a fundamentalist, ostracist, extremist...

Or he's tailoring his answer to the forum of the question.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Richard Heathfield · Sep 15, 2007

Sam Harris said:

Yeah, whatever. I'm a coder at a Fortune 500 company, I think I can
just about write a strdup function that works more than adequately on
any machine I'd ever want to run it on.

Claiming that you work for a Fortune 500 company might impress your
aunt, but the fact remains that your implementation of a string
duplication function left a lot to be desired. You would do well to
learn from your mistakes, rather than try to justify them.

P.J. Plauger · Sep 15, 2007

Mr Clive Feather submitted a defect report saying in substance the
same thing as I said. The committee answer was:

<quote>
Thus, asctime() may exhibit undefined behavior if any of the members of
timeptr produce undefined behavior in the sample algorithm (for example,
if the timeptr->tm_wday is outside the range 0 to 6 the function may
index beyond the end of an array).

As always, the range of undefined behavior permitted includes:
Corrupting memory
Aborting the program
Range checking the argument and returning a failure indicator (e.g., a
null pointer)
Returning truncated results within the traditional 26 byte buffer.
There is no consensus to make the suggested change or any change along
this line.
<end quote>

You read correctly. Corrupting memory (i.e. a buffer overflow) is
within the range of undefined behavior acceptable!!!!

I have the right then, to name a buffer overflow for what it is, a
buffer overflow in the C standard with all the committee behind it.

You have the right to misread anything. You have a responsibility, as
a self-proclaimed expert. to think with something other than your
gonads. The response you quoted clearly encompasses a variety
of nicer behaviors than buffer overflow, but you neglect to take
them in.

The committee *accepts* that buffer overflow can occur in a
conforming implementation. The same is true of:

int a[10];
a[300] = 4;

And the committee is "behind" an implementation that overwrites
storage when this ill-formed program executes. (The committee
is also "behind" an implementation that aborts with a diagnostic
message.)

Get over it.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

Mark McIntyre · Sep 15, 2007

Yeah, whatever. I'm a coder at a Fortune 500 company,

Whoopy doo. *Anyone* will tell you that this is absolutely no
guarantee whatsoever of coding quality or ability.

I think I can just
about write a strdup function that works more than adequately on any
machine I'd ever want to run it on.

Pardon me, but you had a few actual errors pointed out, perhaps you
should consider being less arrogant?
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

pete · Sep 16, 2007

Sam said:
Yeah, whatever. I'm a coder at a Fortune 500 company,
I think I can just
about write a strdup function that works more than adequately on any
machine I'd ever want to run it on.

You posted an extraordinarily crappy code example.

strtok and strtok_r	6	Sep 14, 2007
Why does strcat mess up the tokens in strtok (and strtok_r)?	92	Jun 11, 2014
strtok	6	Nov 25, 2005
Can't solve problems! please Help	0	Sep 26, 2022
How to compare these variables by use operator 'AND'	5	Sep 22, 2003
Tokenizer Function (plus rant on strtok documentation)	18	Jul 11, 2006
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023

strtok and strtok_r

Charlie Gordon

Joachim Schmitz

Charlie Gordon

Flash Gordon

Joe Wright

Keith Thompson

Richard Heathfield

Keith Thompson

Keith Thompson

Chris Torek

rafaelc

pete

jacob navia

Keith Thompson

Sam Harris

Mark McIntyre

Richard Heathfield

P.J. Plauger

Mark McIntyre

pete

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads