Why does strcat mess up the tokens in strtok (and strtok_r)?

Ian Collins · Jun 14, 2014

BartC said:
Either it makes a copy of the contents of a string, or it doesn't. If not,
then it doesn't do the same thing as the following C code. If it does copy,
and is still nearly as fast as returning a pointer, then how does it do
that?

Return value optimisation (RVO). The copy goes directly into the return
value location to avoid creating (and copying) a temporary object. This
handy little compiler trick makes returning non-trivial objects from
functions efficient.

Ian Collins · Jun 14, 2014

Keith said:
A sufficiently clever optimizing compiler might perform that
optimization, but you shouldn't count on it. The compiler would have to
prove beyond doubt that the length of str doesn't change from one
iteration of the loop to the next. If you write:

for (i = 0; i <= strlen(str); i ++) { ... }

you're telling it to evaluate strlen(str) once per iteration; the
compiler can disregard that only if it can prove that it doesn't change
the programs behavior.

I thought gcc would if str where declared const, but it doesn't which
just goes to show you can't rely on these things! Sun Studio cc inlines
strlen, but again, it doesn't optimise to a single call.

Joe Pfeiffer · Jun 14, 2014

DFS said:
On 06/11/2014 10:45 PM, Ben Bacarisse wrote:
char longestWords[2] = "";

You have 2 bytes (one taken with a null) and you write the whole output
string to it. That scribbles on goodness know what other objects. Make
this array big enough for the intended use (or pass its size to
getLongestWords and use snprintf).

This kind of string-handling and memory mgmt stuff make C less "fun"
than it could be.

Click to expand...

True - but for precisely that reason, it (or C++, which makes memory
management only slightly easier) is often the language used to implement
many of the more "fun" languages.

Click to expand...

I never understood how C was used to write a C compiler, or in fact
how one computer language is developed using a different language.

I always thought you wrote the higher-level language in a lower-level
language.

I assume VB was developed with C, and C was developed with assembler?

It would seem like it, but this does not turn out to be the case. If
you've got a working implementation of any sufficiently capable
language, you can use that language to write a compiler for whatever
language you want. Your compiler will (normally) translate to machine
code or assembly code (relying on an assembler to go to machine code),
but that doesn't mean the compiler itself needs to be written in
assembler.

DFS · Jun 14, 2014

I thought gcc would if str where declared const, but it doesn't which
just goes to show you can't rely on these things!

I'm curious: How do you know gcc evaluates strlen(str) each iteration?

Sun Studio cc inlines strlen, but again, it doesn't optimise to a single call.

What does it mean to 'inline strlen'?

Ben Bacarisse · Jun 14, 2014

Ian Collins said:
I thought gcc would if str where declared const, but it doesn't which
just goes to show you can't rely on these things! Sun Studio cc
inlines strlen, but again, it doesn't optimise to a single call.

gcc here (4.8.2) will do both -- inline it and move it out of the loop,
though at higher levels of optimisation it only does one (more it out of
the loop). Of course, it depends on the fact that the code is simple
and contains no transfers of control outside of the translation unit
(other than strlen -- but the compiler is permitted to know what that
does).

(I don't think const can help here. After all, it says nothing that the
compiler can't check anyway.)

Joe Pfeiffer · Jun 14, 2014

DFS said:
I'm curious: How do you know gcc evaluates strlen(str) each iteration?

It's defined as evaluating the termination condition on every
iteration. That includes making any function calls.

Didn't I already say that?

What does it mean to 'inline strlen'?

Instead of generating code to actually call the function, it takes the
actual code of the function itself and inserts it into the program (so
it puts it in line).

I'm a little surprised that a compiler that could inline something as
simple as strlen() couldn't then optimize to a single calculation.

Ian Collins · Jun 14, 2014

DFS said:
I'm curious: How do you know gcc evaluates strlen(str) each iteration?

By misreading the generated assembly

It does move strlen(str) out of
the loop.

What does it mean to 'inline strlen'?

It includes the strlen cone inline rather than calling the function.

Ben Bacarisse · Jun 14, 2014

DFS said:
I'm curious: How do you know gcc evaluates strlen(str) each iteration?

I do it by adding -S and looking at the code generated (see man gcc).

What does it mean to 'inline strlen'?

It means replacing the call with a copy of the body of the function, or,
more loosely, replacing the call with code that is functionally
equivalent. For example, when optimising for space on x86, gcc might
replace a call to strlen with a "repnz scasb" micro-loop. (I don't want
to call it "an instruction" but it's not a normal loop either.)

James Kuyper · Jun 14, 2014

On 06/13/2014 05:41 PM, Ike Naar wrote: ....

Or, more succinctly,

if (strchr(delim,str))

Click to expand...

Given char delim[4] = " ,-";

How does strchr(delim,str) know which delim character to use?

It uses the entire string stored in delim. It searches that string for
the first match to str. It returns a pointer to the first such match,
or NULL if there is no such match. The if() condition will therefore be
true if str matches any of the delimiters, and false otherwise.

Tim Rentsch · Jun 14, 2014

Ben Bacarisse said:
[snip]

I was thinking of using existing functions that are better
behaved that strtok:

int getMaxWordLength(const char *str, const char *delim)
{
size_t max = 0;
while (*str) {
size_t l = strcspn(str, delim);
if (l > max) max = l;
str += l + strspn(str, delim);
}
return max;
}

Programming interview test question - What is wrong
with this code, and why does it not matter?

Tim Rentsch · Jun 14, 2014

[snip]

Fast is good!

I wrote this last night:

===========================================================================

//FIBONACCI SERIES
//0,1,1,2,3,5,8,13,21,34,55,89...

[...]

int Fib(int num) {
int i;
if(num==0) {i = 0;}
else if(num==1) {i = 1;}
else if(num>=2) {i = Fib(num-1) + Fib(num-2);}
return i;
}

[...]

// Then I found Binet's formula:
// http://mathworld.wolfram.com/BinetsFibonacciNumberFormula.html

int FibBinet(int num) {
int i = (pow(1+sqrt(5),num) - pow(1-sqrt(5),num)) / (pow(2,num) *
sqrt(5));
return i;
}

A few comments...

1. Using floating point may give problems with rounding
behavior.

2. Using double probably limits the results to 50-ish bits
(ie, on common processors).

3. It's easy to write an integer version that gives results
exact to 64 bits, and fairly quickly, by simple iteration, or
three parameter recursion.

4. Using d'Ocagne's identity, an integer four parameter
recursive version can be written that is both 64-bit
exact and runs faster than the floating point version.

I leave 3 and 4 as exercises (but feel free to ask for
help if you get stuck).

Barry Schwarz · Jun 14, 2014

I was previously using strcat since it was easier, but I thought strncat
handled memory better?

"Use the strcat function with caution as it is easy to concatenate more
bytes into your variable using the strcat function, which can cause
unpredictable behavior."

Not sure what you mean:

What I mean is you can define strcat as
char* strcat(char *tgt, const char *src)
{
return strncat(tgt, src, strlen(src));
}

Using strncat this way will always copy exactly strlen(src) bytes to
the end of tgt and then append exactly one '\0' after that. This
happens to be the exact definition of what strcat does.

strtokens.c:86:7: error: too many arguments to function ‘strcat’
strcat(longestWords, addWord, strlen(addWord));

I said it could be an implementation of strcat, not a definition. See
above.

BartC · Jun 14, 2014

Ian Collins said:
Return value optimisation (RVO). The copy goes directly into the return
value location to avoid creating (and copying) a temporary object.

So does it make a copy of that 1000000-character string or not? Or does it
simply save having to copy it twice?

(If I write this in Python:

def copy-of_in_py(s):
return s

then it doesn't even copy it once! And is also *considerably* easier on the
eyes without all that '::std::string' and 'const' clutter!)

Ike Naar · Jun 14, 2014

Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {

Here lurks a subtle bug:
1) if delim has less than three elements, the behaviour is undefined if
e.g. delim[2] is ever accessed.
2) if delim does have at least three elements but strlen(delim) < 2, the
conditional expression will reference characters that appear after
delim's null terminator, which may lead to surprising results.

Here's an example:

char words[] = "BACH:LISZT:CAGE";
char delimiters[3];
strcpy(delimiters, ":");
int length = getMaxWordLength(words, delimiters);
/* expecting length=5 i.e. length of longest word in {BACH,LISZT,CAGE} */

but what might actually happen:

char words[] = "BACH:LISZT:CAGE";
char delimiters[3];
/* delimiters[] contains three indeterminate values, say {'X','Y','Z'}
strcpy(delimiters, ":");
/* delimiters[] contains {':','\0','Z');
int length = getMaxWordLength(words, delimiters);
/* finding length=4 i.e. length of longest word in {BACH,LIS,T,CAGE} */

Ian Collins · Jun 14, 2014

BartC said:
So does it make a copy of that 1000000-character string or not? Or does it
simply save having to copy it twice?

Given the requirement is to copy the string, then the function wouldn't
be much use if it didn't, would it? The the actual copy of the data
will most likely not happen until the the returned value gets changed
assuming the string uses copy on write.

(If I write this in Python:

def copy-of_in_py(s):
return s

then it doesn't even copy it once!

So how can it be a "copy-of_in"?

And is also *considerably* easier on the
eyes without all that '::std::string' and 'const' clutter!)

Part of that (the initial "::") is unnecessary clutter in C++. The rest
is telling the compiler we are using string from the std namespace, not
another.

BartC · Jun 14, 2014

Ian Collins said:
BartC wrote:

Given the requirement is to copy the string, then the function wouldn't be
much use if it didn't, would it? The the actual copy of the data will
most likely not happen until the the returned value gets changed assuming
the string uses copy on write.

So the million characters (or whatever it might be), *are* copied, it just
gets delayed until a bit later (presumably just so you can claim that the
function itself is fast!). Unless of course the new string never gets
modified, in which case you can probably write a more efficient C version
too (which emulates the Python method).

So how can it be a "copy-of_in"?

It's effectively a copy for all intents and purposes. Although for this to
work, the string has to be immutable...

In all cases, assuming the resulting string needs to be modified (otherwise
there's little point in making a copy), then the 1M-char string of my
example needs to be duplicated.

(There might be a mild advantage in being able to create a 'copy' of a
string, without being sure when or if it will need duplicating.

I assume also that in the C++ example, if the original is written to or
destroyed, that that will trigger a duplication of the copy. But with
potential thousands of copies of the same string that could be in existence,
that implies quite a lot going on behind the scenes.)

Ian Collins · Jun 14, 2014

BartC said:
So the million characters (or whatever it might be), *are* copied, it just
gets delayed until a bit later (presumably just so you can claim that the
function itself is fast!). Unless of course the new string never gets
modified, in which case you can probably write a more efficient C version
too (which emulates the Python method).

If you make a copy and then modify one of the strings, the data
associated with the modified string will have to be changed somehow, in
any language, won't it?

It's effectively a copy for all intents and purposes. Although for this to
work, the string has to be immutable...

In all cases, assuming the resulting string needs to be modified (otherwise
there's little point in making a copy), then the 1M-char string of my
example needs to be duplicated.

(There might be a mild advantage in being able to create a 'copy' of a
string, without being sure when or if it will need duplicating.

I assume also that in the C++ example, if the original is written to or
destroyed, that that will trigger a duplication of the copy.

No, why would it?

But with
potential thousands of copies of the same string that could be in existence,
that implies quite a lot going on behind the scenes.)

No, just the copy that changes.

BartC · Jun 14, 2014

Ian Collins said:
If you make a copy and then modify one of the strings, the data associated
with the modified string will have to be changed somehow, in any language,
won't it?

OK, so the claim originally made by Stefan Ram was a little disingenuous;
the copy function itself is efficient, but only because the actual copy
operation is done later. (In the meantime, any operations done to the string
have to be monitored, to make sure that copy is performed, unlike the C
version where there are no such overheads.)

No, why would it?

No, just the copy that changes.

Suppose there is a string S = "abcdefghijklmnopqrstuvwxyz", and you do:

S2 = copy_of_in_cpp(S);
S3 = copy_of_in_cpp(S);
S4 = copy_of_in_cpp(S);
S5 = copy_of_in_cpp(S);

Presumably, at this point, S2, S3, S4, S5 still point to the same string as
S?

In that case, what happens when I do S[0]='X'; ?

What happens when I do S="Another string"; ?

What you say implies that the original string exists independently of any of
these (is not directly owned by them), and that there is a form of garbage
collection in place that knows have many references are pointing to any
string. I thought C++ didn't use GC.

Ian Collins · Jun 14, 2014

BartC said:
OK, so the claim originally made by Stefan Ram was a little disingenuous;
the copy function itself is efficient, but only because the actual copy
operation is done later. (In the meantime, any operations done to the string
have to be monitored, to make sure that copy is performed, unlike the C
version where there are no such overheads.)

With C "strings" you can't use copy on write, you have to make a copy.
With C++ (and I expect most scripting languages) you can because you can
detect and write operation on the string. That's where the saving
comes: a copy can be as quick as copying two pointers (reference count
and the string data), you only pay the price of copying the data if you
have to.

Suppose there is a string S = "abcdefghijklmnopqrstuvwxyz", and you do:

S2 = copy_of_in_cpp(S);
S3 = copy_of_in_cpp(S);
S4 = copy_of_in_cpp(S);
S5 = copy_of_in_cpp(S);

Presumably, at this point, S2, S3, S4, S5 still point to the same string as
S?

Could and should do, yes.

In that case, what happens when I do S[0]='X'; ?

What happens when I do S="Another string"; ?

Assuming the string uses reference counting, the reference count to the
original data is decremented and a new bloc is assigned to S and the
data copied.

What you say implies that the original string exists independently of any of
these (is not directly owned by them), and that there is a form of garbage
collection in place that knows have many references are pointing to any
string. I thought C++ didn't use GC.

That's correct, typically this will be reference counting. No need for
any of that GC nonsense in C++ thanks to RAII. Once all of the copies
of the string are destroyed, the data will be freed.

Kaz Kylheku · Jun 14, 2014

Ben Bacarisse said:
Ben Bacarisse said:

DFS said:

On 06/11/2014 10:45 PM, Ben Bacarisse wrote:

//GET LENGTH OF LONGEST WORD IN STRING

Click to expand...

[snip]

I was thinking of using existing functions that are better
behaved that strtok:

int getMaxWordLength(const char *str, const char *delim)
{
size_t max = 0;
while (*str) {
size_t l = strcspn(str, delim);
if (l > max) max = l;
str += l + strspn(str, delim);
}
return max;
}

Click to expand...

Programming interview test question - What is wrong
with this code, and why does it not matter?

What's wrong: that it doesn't do str += l + strspn(str + l, delim);

Why it doesn't matter: l will be zero at the top of the next iteration, which
doesn't affect the correctness of the value of max, and allows the strspn to
proceed from the new current str location.

[newbie] strcpy, strtok and strcat problem...	16	Aug 31, 2006
strtok question	6	May 31, 2008
strtok problem	16	Jun 8, 2010
strtok problem - strcmp	4	Mar 24, 2005
Reversing order of words in a given string	2	Apr 26, 2007
In C, the longest palindromic subsequence multithread exists	0	Nov 23, 2022
home grown strtok() function for review	4	Sep 19, 2006
strtok problem	3	Apr 11, 2004

Why does strcat mess up the tokens in strtok (and strtok_r)?

Ian Collins

Ian Collins

Joe Pfeiffer

DFS

Ben Bacarisse

Joe Pfeiffer

Ian Collins

Ben Bacarisse

James Kuyper

Tim Rentsch

Tim Rentsch

Barry Schwarz

BartC

Ike Naar

Ian Collins

BartC

Ian Collins

BartC

Ian Collins

Kaz Kylheku

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads