Why does strcat mess up the tokens in strtok (and strtok_r)?


D

DFS

You might consider re-writing this avoid using strtok. By using strtok,
you force the string to be copied (twice) and counting word lengths does
not need to copy the string at all.

I read various criticisms of strtok, including it destroys/modifies the
original input string, and isn't thread safe (so there's strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}
return maxWordLen;
}



But strtok is a little faster (gotta save those nanoseconds);
========================================================================
#define BILLION 1E9
struct timespec start1, end1, start2, end2;

clock_gettime(CLOCK_MONOTONIC, &start1);
i = getMaxWordLength(splitStr, delim);
clock_gettime(CLOCK_MONOTONIC, &end1);

clock_gettime(CLOCK_MONOTONIC, &start2);
i = getMaxWordLength_strtok(splitStr, delim);
clock_gettime(CLOCK_MONOTONIC, &end2);


double et = (end1.tv_sec-start1.tv_sec) + ((end1.tv_nsec-start1.tv_nsec)
/ BILLION);

long net = (BILLION * (end1.tv_sec - start1.tv_sec)) + (end1.tv_nsec -
start1.tv_nsec);

printf( "getMaxWordLength took: %.6f seconds (%li nanoseconds)\n", et,
net );


et = (end2.tv_sec-start2.tv_sec) + ((end2.tv_nsec-start2.tv_nsec) /
BILLION);

net = (BILLION * (end2.tv_sec - start2.tv_sec)) + (end2.tv_nsec -
start2.tv_nsec);

printf( "getMaxWordLength_strtok took: %.6f seconds (%li
nanoseconds)\n", et, net );


getMaxWordLength took: 0.000022 seconds (22143 nanoseconds)
getMaxWordLength_strtok took: 0.000011 seconds (10889 nanoseconds)
========================================================================


//GET LIST OF LONGEST WORDS IN STRING (DON'T STORE REPEATS).
int getLongestWords(char *str, char *longestWords, int maxWordLen,
char *delim) {
char *token, wordCounter[3];
int i = 1;
token = strtok(str, delim);
while(token != NULL)
{
if ((int)strlen(token) == maxWordLen) {
if(strstr(longestWords, token) == 0) {
sprintf(wordCounter, "%d", i);
strcat(longestWords, wordCounter);
strcat(longestWords, ".");
strcat(longestWords, token);
strcat(longestWords, " ");

Why not sprintf(longestWords, "%d. %s ", i, token)?

Thanks for tip! Saved a few lines, and is more readable. Decided not
to number the output words. I also went with strncat.

sprintf(addWord, "%s ", token);
strncat(longestWords, addWord, strlen(addWord));


Then consider using
snprintf which can be told how much room is available in the array (it's
up to you to pass the correct size into the function, but that's not
hard.

I'll look into it.

int main(void)
{

char splitStr[200] = "Look at him working, darningg hishersh socks
in the night when there's nobody there";
char splitStrA[200] = "";
char splitStrB[200] = "";
char delim[2] = " ";
char longestWords[2] = "";

You have 2 bytes (one taken with a null) and you write the whole output
string to it. That scribbles on goodness know what other objects. Make
this array big enough for the intended use (or pass its size to
getLongestWords and use snprintf).

This kind of string-handling and memory mgmt stuff make C less "fun"
than it could be.



As has been said, valgrind turns out not to help much here, but don't
let that stop you using it. It will help out soon enough!

I was doing the chapters in Learn C The Hard Way, and the author Zed
Shaw uses valgrind (pronounced valgrinned if you didn't know) early on.
It's a little hardcore for me at the moment. I barely understand
pointers, and am clueless about heap/stack. I just learned a few weeks
ago what int main(int argc, char *argv[]) means.



[[email protected] files]$ valgrind -v ./strtokens

(I don't think I've ever used -v. It seems to say too much as far as I
am concerned.)

<snip>


Thanks for your feedback and help.
 
Ad

Advertisements

B

Barry Schwarz

snip
I read various criticisms of strtok, including it destroys/modifies the
original input string, and isn't thread safe (so there's strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {


This code requires that strlen(delim) be at least 2. You could
eliminate both this minimum and your maximum length limitation by
replacing the if with (but keeping the braces):
int i2;
for (i2 = strlen(delim)+1; i2 >= 0; i2--)
if (str == delim[i2])
//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}
return maxWordLen;
}



But strtok is a little faster (gotta save those nanoseconds);
========================================================================
#define BILLION 1E9
struct timespec start1, end1, start2, end2;

clock_gettime(CLOCK_MONOTONIC, &start1);
i = getMaxWordLength(splitStr, delim);
clock_gettime(CLOCK_MONOTONIC, &end1);

clock_gettime(CLOCK_MONOTONIC, &start2);
i = getMaxWordLength_strtok(splitStr, delim);
clock_gettime(CLOCK_MONOTONIC, &end2);


double et = (end1.tv_sec-start1.tv_sec) + ((end1.tv_nsec-start1.tv_nsec)
/ BILLION);

long net = (BILLION * (end1.tv_sec - start1.tv_sec)) + (end1.tv_nsec -
start1.tv_nsec);

printf( "getMaxWordLength took: %.6f seconds (%li nanoseconds)\n", et,
net );


et = (end2.tv_sec-start2.tv_sec) + ((end2.tv_nsec-start2.tv_nsec) /
BILLION);

net = (BILLION * (end2.tv_sec - start2.tv_sec)) + (end2.tv_nsec -
start2.tv_nsec);

printf( "getMaxWordLength_strtok took: %.6f seconds (%li
nanoseconds)\n", et, net );


getMaxWordLength took: 0.000022 seconds (22143 nanoseconds)
getMaxWordLength_strtok took: 0.000011 seconds (10889 nanoseconds)

Does you clock_gettime retrieve "wall clock time" or processor time
spent actually executing your code?

snip
Thanks for tip! Saved a few lines, and is more readable. Decided not
to number the output words. I also went with strncat.

sprintf(addWord, "%s ", token);
strncat(longestWords, addWord, strlen(addWord));

Using strncat this way provides no benefit over strcat. In fact, this
would be a valid implementation for strcat.
 
J

James Kuyper

char longestWords[2] = "";

You have 2 bytes (one taken with a null) and you write the whole output
string to it. That scribbles on goodness know what other objects. Make
this array big enough for the intended use (or pass its size to
getLongestWords and use snprintf).

This kind of string-handling and memory mgmt stuff make C less "fun"
than it could be.

True - but for precisely that reason, it (or C++, which makes memory
management only slightly easier) is often the language used to implement
many of the more "fun" languages. The implementation in C keeps track of
the relevant details, so that the user of the more "fun" language
doesn't have to.

Don't use C to do anything that can be done more easily in one of those
other languages, without a good reason. For example, a well-designed C
program can often be much faster than a corresponding program in the
"fun" language, which might be a good reason. However, this fact will do
you absolutely no good, unless you're good enough at C to produce a
"well-designed ... program".
 
K

Keith Thompson

James Kuyper said:
Don't use C to do anything that can be done more easily in one of those
other languages, without a good reason. For example, a well-designed C
program can often be much faster than a corresponding program in the
"fun" language, which might be a good reason. However, this fact will do
you absolutely no good, unless you're good enough at C to produce a
"well-designed ... program".

<joke>But I want to get my wrong answers as quickly as possible!</joke>
 
D

DFS

char longestWords[2] = "";

You have 2 bytes (one taken with a null) and you write the whole output
string to it. That scribbles on goodness know what other objects. Make
this array big enough for the intended use (or pass its size to
getLongestWords and use snprintf).

This kind of string-handling and memory mgmt stuff make C less "fun"
than it could be.

True - but for precisely that reason, it (or C++, which makes memory
management only slightly easier) is often the language used to implement
many of the more "fun" languages.


I never understood how C was used to write a C compiler, or in fact how
one computer language is developed using a different language.

I always thought you wrote the higher-level language in a lower-level
language.

I assume VB was developed with C, and C was developed with assembler?



The implementation in C keeps track of
the relevant details, so that the user of the more "fun" language
doesn't have to.

As I'm learning C I'm seeing its influence in VB/VBA, and I'm also
seeing how much of a hack parts of them are. On the other hand, it IS
more fun to use than C because it's easier and you feel more productive.
And it interfaces well with data-handling libraries, so it's great for
business systems like I wrote for years.


Don't use C to do anything that can be done more easily in one of those
other languages, without a good reason. For example, a well-designed C
program can often be much faster than a corresponding program in the
"fun" language, which might be a good reason.

Fast is good!

I wrote this last night:

===========================================================================

//FIBONACCI SERIES
//0,1,1,2,3,5,8,13,21,34,55,89...

#include <stdio.h>
#include <time.h>

#define BILLION 1E9

int Fib(int num) {
int i;
if(num==0) {i = 0;}
else if(num==1) {i = 1;}
else if(num>=2) {i = Fib(num-1) + Fib(num-2);}
return i;
}

int main(void) {
int i,j;
struct timespec start, end;
double et;

for (i = 30; i <= 44; i++) {
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
j = Fib(i);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
et = (end.tv_sec-start.tv_sec) + ((end.tv_nsec-start.tv_nsec) /
BILLION);
printf("Fibonacci %d = %d (%.2f seconds)\n", i, j, et);
}
}

===========================================================================

Was disappointed in the speed:

[[email protected] misc]$ ./Fibonacci
Fibonacci 30 = 832040 (0.03 seconds)
Fibonacci 31 = 1346269 (0.05 seconds)
Fibonacci 32 = 2178309 (0.08 seconds)
Fibonacci 33 = 3524578 (0.12 seconds)
Fibonacci 34 = 5702887 (0.20 seconds)
Fibonacci 35 = 9227465 (0.32 seconds)
Fibonacci 36 = 14930352 (0.52 seconds)
Fibonacci 37 = 24157817 (0.84 seconds)
Fibonacci 38 = 39088169 (1.36 seconds)
Fibonacci 39 = 63245986 (2.19 seconds)
Fibonacci 40 = 102334155 (3.55 seconds)
Fibonacci 41 = 165580141 (5.75 seconds)
Fibonacci 42 = 267914296 (9.30 seconds)
Fibonacci 43 = 433494437 (15.04 seconds)
Fibonacci 44 = 701408733 (24.34 seconds)

Note how the times to calculate the series follows a Fibonacci series as
well. Which makes sense, given that code.

===========================================================================

Then I found Binet's formula:
http://mathworld.wolfram.com/BinetsFibonacciNumberFormula.html

int FibBinet(int num) {
int i = (pow(1+sqrt(5),num) - pow(1-sqrt(5),num)) / (pow(2,num) *
sqrt(5));
return i;
}


[[email protected] misc]$ ./Fibonacci
Fibonacci 30 = 832040 (0.00002221 seconds)
Fibonacci 31 = 1346269 (0.00000314 seconds)
Fibonacci 32 = 2178309 (0.00000117 seconds)
Fibonacci 33 = 3524578 (0.00000111 seconds)
Fibonacci 34 = 5702887 (0.00000108 seconds)
Fibonacci 35 = 9227465 (0.00000111 seconds)
Fibonacci 36 = 14930352 (0.00000108 seconds)
Fibonacci 37 = 24157817 (0.00000108 seconds)
Fibonacci 38 = 39088169 (0.00000110 seconds)
Fibonacci 39 = 63245986 (0.00000108 seconds)
Fibonacci 40 = 102334155 (0.00000110 seconds)
Fibonacci 41 = 165580141 (0.00000104 seconds)
Fibonacci 42 = 267914296 (0.00000109 seconds)
Fibonacci 43 = 433494437 (0.00000107 seconds)
Fibonacci 44 = 701408733 (0.00000106 seconds)

heh!

===========================================================================


However, this fact will do
you absolutely no good, unless you're good enough at C to produce a
"well-designed ... program".

That's where you clc gurus, Google, and stackoverflow.com come in!

I want to eventually program embedded systems and devices with C. It
will take a while.

First thing I'm tackling is string/text handling. I wrote


VB:
Public Function padToEnd(lineOut As String, lineLen As Byte, delim As
String) As String
Dim padding As String
For i = 0 To (lineLen - (Len(lineOut) + 1))
padding = padding & delim
Next i
padding = padding & "|"
padToEnd = padding
End Function


C:
char *padToEnd(const char *lineToPad, char *paddedLine, int lineLen,
char *delim) {
strcpy(paddedLine,lineToPad);
for (int i = 1; i <= (lineLen - (strlen(lineToPad) + 1)); i++) {
strncat(paddedLine, delim, 1);
}
strncat(paddedLine, "|", 1);
return paddedLine;
}


Very similar. The C code runs a heck of a lot faster, for whatever reason.
 
K

Keith Thompson

DFS said:
for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}

[...]

You're evaluating strlen(str) each time through the loop, which can be
terribly inefficient for long strings.

For example, if str has a length of 100, then this:

for (i = 0; i <= strlen(str); i ++) {
/* ... */
}

will have to do roughly 5000 character comparisons just in the
strlen() calls. In computer science terms, strlen()'s execution
time is O(n), where n is the length of the string. And since str
doesn't change while the loop is executing, the result of strlen(str)
won't change either, so there's no point in re-evaluating it.

For strings that aren't *too* long, the wasted time likely won't be
noticeable, but it's worth fixing it, or at least being aware of it.

The simplest way to avoid this is to call strlen() once and save
the result:

const size_t len = strlen(str);
for (i = 0; i <= len; i ++) {
/* ... */
}

Another (if you don't need the value of i inside the loop) is to use
pointer arithmetic:

char *ptr = str;
while (*ptr != '\0') {
/* ... */
ptr ++;
}

Your use of "<=" rather than "<" is slightly unusual; it means that,
on the last iteration, str will always be the terminating '\0'
character. I see you're testing for that, I haven't examined your
program's logic closely enough to know whether changing the "<="
to "<" would make more sense.
 
Ad

Advertisements

J

James Kuyper

On 06/13/2014 03:37 PM, DFS wrote:
....
I never understood how C was used to write a C compiler, or in fact how
one computer language is developed using a different language.

I gather that this kind of thing is often taught in one of the classes
that you have to take to get a CS degree. I'd recommend taking such a
class if you're interested. My own BS was in Physics, so I'm not
familiar with the details - almost any other expert monitoring this
group would be a better source of advice about this. However, I do have
some idea how it is done.
I always thought you wrote the higher-level language in a lower-level
language.

I assume VB was developed with C, and C was developed with assembler?

I know nothing of how VB was developed. However, C quite famously
started out as a bootstrapped language. The very first working C
compiler was necessarily written in some other language (possibly
assembly, possibly B - I don't know the details). However, as soon as
that compiler implemented a sufficiently powerful language, each later
version of the compiler was written using the version of C implemented
by an earlier version of the compiler (not necessarily the immediately
preceding version, I wouldn't think).

I suspect that for modern versions of C, the first compiler for any
given platform is usually a cross-compiler written in C on a different
platform, so it's no longer bootstrapped.
 
D

DFS

snip
I read various criticisms of strtok, including it destroys/modifies the
original input string, and isn't thread safe (so there's strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {


This code requires that strlen(delim) be at least 2. You could
eliminate both this minimum and your maximum length limitation by
replacing the if with (but keeping the braces):
int i2;
for (i2 = strlen(delim)+1; i2 >= 0; i2--)
if (str == delim[i2])
//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}
return maxWordLen;
}



I like that. The flexibility is good.


Does you clock_gettime retrieve "wall clock time" or processor time
spent actually executing your code?


Neither. It's MONOTONIC, or time elapsed since some arbitrary period.

'wall clock' time is CLOCK_REALTIME
'CPU time' is CLOCK_PROCESS_CPUTIME_ID (or CLOCK_THREAD_CPUTIME_ID)

Since I made the original post, I've switched from MONOTONIC to
MONOTONIC_RAW.

http://man7.org/linux/man-pages/man2/clock_gettime.2.html

http://www.cs.rutgers.edu/~pxk/416/notes/c-tutorials/gettime.html


snip


Using strncat this way provides no benefit over strcat.

I was previously using strcat since it was easier, but I thought strncat
handled memory better?

"Use the strcat function with caution as it is easy to concatenate more
bytes into your variable using the strcat function, which can cause
unpredictable behavior."


In fact, this would be a valid implementation for strcat.

Not sure what you mean:

strtokens.c:86:7: error: too many arguments to function ‘strcat’
strcat(longestWords, addWord, strlen(addWord));
^
 
S

Stefan Ram

James Kuyper said:
C++ ....
makes memory management only slightly easier

With RAII and rvalue references , memory management is
significantly easier. For example, to return a new copy
of a string (untested):

::std::string copy_of_in_cpp( ::std::string const source )
{ return ::std::string{ source }; }

. This is nearly as efficient as returning a pointer,
because it returns an rvalue, in this case C++ does /not/ copy
the object to return the string! It is much simpler than the
similar (untested):

copy_of_in_c( char const * const source )
{ char * const target = malloc( strlen( source )+ 1 );
if( target )strcpy( target, source );
return target; }

not only here in the definition of the function, but
also for the caller, who does have to check the result
for 0 in C, but not in C++.
program can often be much faster than a corresponding program in the
"fun" language, which might be a good reason. However, this fact will do

I have read "fun" as "functional". In this case, I'd like to
add that on »benchmarksgame.alioth.debian.org« Haskell is in
the group of the fast languages (with Java) and is much
faster that Perl or Python.
 
J

James Kuyper

I was previously using strcat since it was easier, but I thought strncat
handled memory better?

"Use the strcat function with caution as it is easy to concatenate more
bytes into your variable using the strcat function, which can cause
unpredictable behavior."

Using strncat() this way carries precisely the same dangers as calling
strcat(longestWords, addWord). If addWord is too long, it will overwrite
the end of the longestWords array. The third argument of strncat()
should no bigger than the number of remaining bytes in your array;
strlen(addWord) isn't the right value. The correct value would be
(sizeof longestWords - strlen(longestWords)), but unless you can
calculate that value by some method faster than calling strlen(), it's
probably not worthwhile.

Also, if there's any possibility that addWord might not be properly null
terminated, you should also make sure that the third argument to
strncat() is small enough to prevent reading past the end of the array
that it's stored in.

For this kind of situation, I like to keep a pointer to the end of the
output string, in which case I could calculate the appropriate limit
very efficiently as (sizeof longestWords + (word_end - longestWords)).
However, precisely because I would be keeping track of the end of the
string, I wouldn't use strncat(longestWords, ...), I'd use
strncpy(word_end, ...), thereby avoiding the time that strncat() wastes
looking for the end of the string - my program already knows where that is.
Not sure what you mean:

strtokens.c:86:7: error: too many arguments to function ‘strcat’
strcat(longestWords, addWord, strlen(addWord));

He means that an implementation of strcat() could be written as follows:

char *strcat(char * restrict s1,
const char * restrict s2)
{
return strncat(s1, s2, strlen(s2));
}

and that implementation would carry precisely the same dangers you
should be worried about when using strcat().
 
J

James Kuyper

I have read "fun" as "functional".

I was using "fun" to refer back to DFS's comment that
This kind of string-handling and memory mgmt stuff make C less "fun"
than it could be.

I assumed he was using "fun" in it's normal meaning, not as a shortened
version of "functional".
 
Ad

Advertisements

B

Ben Bacarisse

DFS said:
You might consider re-writing this avoid using strtok. By using strtok,
you force the string to be copied (twice) and counting word lengths does
not need to copy the string at all.

I read various criticisms of strtok, including it destroys/modifies
the original input string, and isn't thread safe (so there's
strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2]
|| str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}
return maxWordLen;
}


I was thinking of using existing functions that are better behaved that
strtok:

int getMaxWordLength(const char *str, const char *delim)
{
size_t max = 0;
while (*str) {
size_t l = strcspn(str, delim);
if (l > max) max = l;
str += l + strspn(str, delim);
}
return max;
}

<snip>
 
I

Ike Naar

snip
I read various criticisms of strtok, including it destroys/modifies the
original input string, and isn't thread safe (so there's strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {


This code requires that strlen(delim) be at least 2. You could
eliminate both this minimum and your maximum length limitation by
replacing the if with (but keeping the braces):
int i2;
for (i2 = strlen(delim)+1; i2 >= 0; i2--)
if (str == delim[i2])


Or, more succinctly,

if (strchr(delim,str))
 
D

DFS

DFS said:
for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}

[...]

You're evaluating strlen(str) each time through the loop, which can be
terribly inefficient for long strings.

For example, if str has a length of 100, then this:

for (i = 0; i <= strlen(str); i ++) {
/* ... */
}

will have to do roughly 5000 character comparisons just in the
strlen() calls. In computer science terms, strlen()'s execution
time is O(n), where n is the length of the string. And since str
doesn't change while the loop is executing, the result of strlen(str)
won't change either, so there's no point in re-evaluating it.

For strings that aren't *too* long, the wasted time likely won't be
noticeable, but it's worth fixing it, or at least being aware of it.

The simplest way to avoid this is to call strlen() once and save
the result:

const size_t len = strlen(str);
for (i = 0; i <= len; i ++) {
/* ... */
}



Doesn't the compiler calculate strlen(str) just once, when the for loop
is initialized?



Another (if you don't need the value of i inside the loop) is to use
pointer arithmetic:

char *ptr = str;
while (*ptr != '\0') {
/* ... */
ptr ++;
}


I think I need the i inside the loop, or some other way to mark the last
time it found a delimiter.

I can't make the pointer arithmetic work, and I don't understand why it
should be used.


int getMaxWordLength3(char *str, char *delim) {
int i=0;
char *ptr = str;
while (*ptr != '\0') {
printf("%c(%d)\n",ptr,i);
ptr++;
i++;
}
}


Input (*str): According to a news release...

Output is invalid:
A(0)
c(1)
r(2)
i(3)
g(4)
t(5)
(6)
(7)
e(8)
s(9)
r(10)
....



Your use of "<=" rather than "<" is slightly unusual; it means that,
on the last iteration, str will always be the terminating '\0'
character. I see you're testing for that, I haven't examined your
program's logic closely enough to know whether changing the "<="
to "<" would make more sense.


If I don't use '<=' it never evaluates the last word before the
terminator \0.


string: The titanosaur remains were nevereaten.

Longest word should be 11 characters (which incl the period)

Using < never evaluates that last word.
 
J

Joe Pfeiffer

DFS said:
Doesn't the compiler calculate strlen(str) just once, when the for
loop is initialized?

No. The termination condition is evaluated every time through the loop
(including any function calls required for the evaluation).

That said, a sufficiently aggressive optimizing compiler/linker might be
able to notice that strlen(ptr) will give you the same answer every time
through and optimize the actual work away, but that should not be
expected.
 
K

Keith Thompson

DFS said:
DFS said:
for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}

[...]

You're evaluating strlen(str) each time through the loop, which can be
terribly inefficient for long strings.

For example, if str has a length of 100, then this:

for (i = 0; i <= strlen(str); i ++) {
/* ... */
}

will have to do roughly 5000 character comparisons just in the
strlen() calls. In computer science terms, strlen()'s execution
time is O(n), where n is the length of the string. And since str
doesn't change while the loop is executing, the result of strlen(str)
won't change either, so there's no point in re-evaluating it.

For strings that aren't *too* long, the wasted time likely won't be
noticeable, but it's worth fixing it, or at least being aware of it.

The simplest way to avoid this is to call strlen() once and save
the result:

const size_t len = strlen(str);
for (i = 0; i <= len; i ++) {
/* ... */
}



Doesn't the compiler calculate strlen(str) just once, when the for loop
is initialized?


A sufficiently clever optimizing compiler might perform that
optimization, but you shouldn't count on it. The compiler would have to
prove beyond doubt that the length of str doesn't change from one
iteration of the loop to the next. If you write:

for (i = 0; i <= strlen(str); i ++) { ... }

you're telling it to evaluate strlen(str) once per iteration; the
compiler can disregard that only if it can prove that it doesn't change
the programs behavior.
I think I need the i inside the loop, or some other way to mark the last
time it found a delimiter.

Then using array indexing is probably the best approach in this case.
I can't make the pointer arithmetic work, and I don't understand why it
should be used.

It's probably not worth worrying about at this point. You can iterate
over an array either by incrementing an integer (used as an index), or
by incrementing a pointer that points successively to each element of
the array. Neither approach is better than the other (though in the old
days the pointer approach might have been faster).

[...]
Your use of "<=" rather than "<" is slightly unusual; it means that,
on the last iteration, str will always be the terminating '\0'
character. I see you're testing for that, I haven't examined your
program's logic closely enough to know whether changing the "<="
to "<" would make more sense.


If I don't use '<=' it never evaluates the last word before the
terminator \0.


Right. You could use "<" and have your loop stop when it sees the '\0',
but then you'd probably have to repeat the code that handles the end of
a word.

Just be aware that most of the time, when you're iterating of the
characters of a string, you'll probably want to use "<" rather than
"<=". Again, don't worry about it for this case.
 
Ad

Advertisements

D

DFS

DFS said:
//GET LENGTH OF LONGEST WORD IN STRING
int getMaxWordLength(char *str, char *delim) {
char *token;
token = strtok(str, delim);
int maxWordLen = (int)strlen(token);

while(token != NULL)
{
if ((int)strlen(token) > maxWordLen) {
maxWordLen = (int)strlen(token);
}
token = strtok(NULL, delim);
}
return maxWordLen;
}

You might consider re-writing this avoid using strtok. By using strtok,
you force the string to be copied (twice) and counting word lengths does
not need to copy the string at all.

I read various criticisms of strtok, including it destroys/modifies
the original input string, and isn't thread safe (so there's
strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2]
|| str=='\0') {

//printf("%d\n",i-j);
if((i-j) > maxWordLen) {
maxWordLen = (i-j);
}
j = i + 1;
}
}
return maxWordLen;
}


I was thinking of using existing functions that are better behaved that
strtok:

int getMaxWordLength(const char *str, const char *delim)
{
size_t max = 0;
while (*str) {
size_t l = strcspn(str, delim);
if (l > max) max = l;
str += l + strspn(str, delim);
}
return max;
}

<snip>



It's a beautiful thing!

My code is more readable (to me anyway); yours is more better...
 
D

DFS

snip
I read various criticisms of strtok, including it destroys/modifies the
original input string, and isn't thread safe (so there's strtok_r).

But it's handy, and will let you use multiple delimiters in one pass:
char delim[4] = " ,.-";


Regardless, I rewrote it without strtok, and to handle at most 3
delimiters. Seems to work well.

int getMaxWordLength(char *str, char *delim) {
int i=0,j=0,maxWordLen=0;

for(i=0;i<=strlen(str);i++) {
if(str==delim[0] || str==delim[1] || str==delim[2] ||
str=='\0') {


This code requires that strlen(delim) be at least 2. You could
eliminate both this minimum and your maximum length limitation by
replacing the if with (but keeping the braces):
int i2;
for (i2 = strlen(delim)+1; i2 >= 0; i2--)
if (str == delim[i2])


Or, more succinctly,

if (strchr(delim,str))



Given char delim[4] = " ,-";

How does strchr(delim,str) know which delim character to use?
 
B

BartC

DFS said:
On 06/13/2014 02:32 PM, James Kuyper wrote:

I never understood how C was used to write a C compiler, or in fact how
one computer language is developed using a different language.

I always thought you wrote the higher-level language in a lower-level
language.

I assume VB was developed with C, and C was developed with assembler?

You can use whatever other language is available. You can write a C compiler
in a much higher level language. But if you have nothing (this was more
common decades ago when you couldn't just 'download' a language for free)
you might well have to start from basics, eg. machine code keyed in with
switches.
Fibonacci 44 = 701408733 (24.34 seconds)
Fibonacci 44 = 701408733 (0.00000106 seconds)

The first Fibonacci method is often used for benchmarking (testing how well
a language implementation deals with lots of recursion). The second method
is not so useful. So fast isn't always good.

(BTW my C compiler manages that first method in about 7 seconds on a low-end
PC. Have you turned on the optimising switch on your compiler?)
Public Function padToEnd(lineOut As String, lineLen As Byte, delim As
String) As String
Dim padding As String
For i = 0 To (lineLen - (Len(lineOut) + 1))
padding = padding & delim
Next i
padding = padding & "|"
padToEnd = padding
End Function


C:
char *padToEnd(const char *lineToPad, char *paddedLine, int lineLen, char
*delim) {
strcpy(paddedLine,lineToPad);
for (int i = 1; i <= (lineLen - (strlen(lineToPad) + 1)); i++) {
strncat(paddedLine, delim, 1);
}
strncat(paddedLine, "|", 1);
return paddedLine;
}


Very similar. The C code runs a heck of a lot faster, for whatever
reason.

It might well be faster anyway. But the VB code is using a different method:
starting with an empty string (shouldn't it be initialised to LineOut?) then
appending to it a character at a time (then copying the whole thing at the
end), which can inefficient because of the memory management going on behind
the scenes.

In C, the destination string has presumably been preallocated to a known
length.

But then, the C version has also been crippled by:

(1) Not keeping track of where you are in paddedLine, so that you can use
strcpy/memcpy direcly to the end of the string. (strncat has to rescan from
the start of paddedLine each time.)

(2) Not making use of the fact that you are only adding a character at a
time; you don't need string routines. So each loop could just contain
paddedLine[i+k]=*delim; k depends on the initial size of paddedLine. (Don't
forget a 0 char at the end of the string.)
 
Ad

Advertisements

B

BartC

Stefan Ram said:
With RAII and rvalue references , memory management is
significantly easier. For example, to return a new copy
of a string (untested):

::std::string copy_of_in_cpp( ::std::string const source )
{ return ::std::string{ source }; }

. This is nearly as efficient as returning a pointer,
because it returns an rvalue, in this case C++ does /not/ copy
the object to return the string!

Either it makes a copy of the contents of a string, or it doesn't. If not,
then it doesn't do the same thing as the following C code. If it does copy,
and is still nearly as fast as returning a pointer, then how does it do
that?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top