Comment on trim string function please

K

Keith Thompson

Bill Reid said:
Keith Thompson said:
Bill Reid said:
news:[email protected]... [...]
An alternative would be to use memmove() here, so you don't
have to do it byte by byte.

What would be the actual metrics to make a decision here
one way or the other? I myself just assume for this kind of stuff
that it will only be done on relatively small strings, just a few
characters, so it seems that a call to memmove() might be overkill.
Or is it? DOES it depend on the size of the block you'll
be moving? How many cycles does it take to assign one
character to a previous array position, versus the overhead of
a call to memmove(), and the operation of the function itself?

All this is system-specific.

This is not really responsive.

Ok, here's a more responsive answer.

I don't know.

Specifically, I don't know how many cycles any given operation will
take on whatever system you're using, partly because I don't know what
system you're using. Also, I don't know how many cycles it would take
on the system(s) *I'm* using, since I've never bothered to measure it.

I would expect that a call to memmove() would be faster than an
equivalent explicit loop on some systems, and vice versa on others.
memmove() has some overhead to determine whether the operands overlap,
but if the compiler can determine that they don't, or if it knows
which way they overlap and knows how memcpy() behaves, it might be
able to avoid that overhead. memmove() or memcpy() likely has some
overhead due to the function call, but an optimizing compiler might
replace either with inline code in some cases. An implementation of
memmove() or memcpy() might take advantage of copying chunks larger
than 1 byte; whether it can do this might depend on the alignment of
the operands. And so forth.

Perfect knowledge of the C language (something I don't claim to have)
would not enable someone to answer your question; that should be a
clue that this is not the place to ask it. Consider measuring it
yourself, on your own system with your own code.

[...]
But I'm "listening";

I won't speculate on what the quotation marks are supposed to mean.
tell me how many cycles different "systems"
would consume to perform each method...

See above.
or is memmove() kind
of like fread() and fwrite(), you THINK that they're doing something
"special", but really they're using fgetc() and fputc() which in most
cases you probably could have just used yourself and eliminated
the "middleman"...

The behavior of fread() and fwrite() is defined in terms of fgetc()
and fputc(), but there's no requirement that they actually be
implemented that way. Conversely, fgetc() and fputc() typically use
buffering, which means that 1000 calls to fputc() won't necessarily be
faster or slower than an equivalent single call to fwrite(). Use
whatever is clearer.

[...]
The compiler has "told" me which it considers faster, since it is
documented that the default optimization is to replace sub-scripts
with pointers. Therefore I cannot really be preventing the compiler
from doing its "job" (your opinion) by pre-emptively writing my
code to do the same thing as the "optimizer", at least for this
particular "optimization".

Perhaps for this particular compiler. Other compilers for other
systems might behave differently.
 
B

Ben Bacarisse

Bill Reid said:
Yeah, but does it fly past the terminating null character?

Nope. Why would you think that?
I think
you may have out-obfuscated me...

I am happy to oblige. Those were your rules: you wanted minimal code;
your own example suggests clarity does not matter (although I think my
loop is clearer, but that is simply option).
Also, I've not been following why the cast is important here...


Actually, the "boundary case" for this is the size of array; as long
as it is at least one character bigger than the "string" then this will
always work. But you are correct if you replace the word "string" with
the word "array fully-populated by the string" above.

There are actually only five possible test cases, which all should
ASSUME a "array fully-populated by the string" (which I unfortunately
didn't): spaces before text, spaces after the text, spaces before and
after the text, just spaces, empty string (six if you count the stupidity
of passing NULL). Pass those five (or six) cases, the thing is
perfect...

I am getting lost in all the words. I think your code is wrong (in
two of the five cases you are considering). Are you saying it is
correct? Like many "off by one" bugs it may not produce undefined
behaviour. For example passing char test[2] = ""; is fine because of
the extra byte, but it is still a bug.
OK, I need to work on a few things to make it the perfect piece
of obscure ruthlessly efficient code...

I'd want to be sure it is correct first. I would be the first admit I
make mistakes but you have not persuaded me that I am wrong about
this. I tried to be as clear about the bug as possible.
 
B

Bill Reid

CBFalconer said:
(spaces added for legibility above)

Serious bug. isspace(int) requires the integer value of an
unsigned char. Repeat, unsigned.

Repetition is not clarification. By declaration, and all available
"official" documentation, isspace(int) requires a signed integer value...
If the char type on any machine
is signed isspace can receive a negative value, and quite likely
blowup.

On my machine, "char" is "signed char" by default, at least
according to SOME of the documentation. Nothing is "blowing
up". Why so?
That's why the argument to isspace should receive an
explicit cast. The only negative value those functions can receive
is EOF.

Sure you're not conflating "strings" with "streams"?
The cast can be avoided when the integer output of getc() is
passed, because getc returns the appropriately cast value in the
first place.

How do you think the characters wound up in a string in the first
place?
But you are getting those chars from a string, and
whether a raw char is signed or unsigned is always undefined.

I appear to "getting" a signed 8-bit integer value from my string,
with no problems, yet you think it is critically important to cast
it to an unsigned integer value, for a function that takes a signed
integer value as an argument, in order to perform a look-up of
certain integer values. Nope, still not getting it...
 
B

Bill Reid

Ben Bacarisse said:
Nope. Why would you think that?

Oh, you know, the fact that you aren't actually testing for
the null character, you're just relying on isspace() to return
0 for it, which probably is correct in all cases, but just
"looks dangerous"...
I am happy to oblige. Those were your rules: you wanted minimal code;
your own example suggests clarity does not matter (although I think my
loop is clearer, but that is simply option).


I am getting lost in all the words. I think your code is wrong (in
two of the five cases you are considering). Are you saying it is
correct?

No, it is not correct. You're right, it is a TRUE "bug".
Like many "off by one" bugs it may not produce undefined
behaviour. For example passing char test[2] = ""; is fine because of
the extra byte, but it is still a bug.

Of course. The way I allocate arrays for text on THIS system
means you might NEVER encounter it, but it is still wrong...
I'd want to be sure it is correct first. I would be the first admit I
make mistakes but you have not persuaded me that I am wrong about
this. I tried to be as clear about the bug as possible.

Don't worry about it, you're right. Also, I'm giving up on writing
efficient obcure code for this. Aside from the cast "issues", here's
something that is totally "straight-forward" but probably not the
most "efficient" algorithm for all cases, or even any (it is also what
I actually use for small strings):

char *remove_beg_end_non_text(char *text) {
char *curr_char,*text_char=text;
size_t spaces=0;

for(curr_char=text;*curr_char!='\0';curr_char++)
if(!isspace(*curr_char)) break;

while(*curr_char!='\0') {

if(isspace(*curr_char)) spaces++;
else spaces=0;

*text_char++=*curr_char++;
}

*(text_char-spaces)='\0';

return text;
}

ASIDE FROM THE CAST "ISSUES", where's the bug in THAT?
 
S

santosh

Bill said:
Repetition is not clarification. By declaration, and all available
"official" documentation, isspace(int) requires a signed integer
value...

That's what Chuck is saying too. But you are passing it a plain char
value without a cast to unsigned. If plain char happens to be unsigned
on your implementation then things are fine, but is it a risk that you
want to take?

[ ... ]
Sure you're not conflating "strings" with "streams"?

How so? As far as I can see he is talking about neither, but the
interface of the is* and to* functions, which can be used with both
strings and text streams.

<snip>
 
S

sebastian

Doing such things as returning malloc'ed memory, checking for NULL, or
returning a pointer to the first non-space will generally lead to
misuse. Just keep it simple:

char*
trim( char* str )
{
char
* cpy = str,
* seq = str,
* fin = str + strlen( str ) - 1;
while( seq <= fin && isspace( *seq ) )
++seq;
while( seq <= fin && isspace( *fin ) )
--fin;
while( seq <= fin )
*cpy++ = *seq++;
*cpy = 0;
return str;
}

I can't guarantee that it's bug free of course...
 
S

sebastian

Ah, it seems there were quite a few more post than I thought (using a
new feed reader) - in that case my post must seem a little
discombobulated!

Well, insofar as strictly using isspace as an iterative condition,
yes. Otherwise, it really doesn't matter, does it?
 
B

Bill Reid

Eric Sosman said:
"If it works, it's not sufficiently tested." I'll bet you
a cookie that either (1) char is in fact not signed on your
systems, or (2) you haven't tested "Götterdãmmerung" or "Aïda"
or "µsec" or "garçon" or "£1 is worth ¥210" or ...

How many negative values are there in those characters that
are not part of the ASCII character set? Does the concept of
"sign-preserving" come into play here?

As far as whether my "system" uses a default signed char,
in one place in the documentation they state that they USED
to do it that way, IMPLYING they don't any more, but don't worry
about it in any event, then in another place they EXPLICITLY
state that's how they do it currently.

Bastards...
I guess your implication is that they were converted from
int values returned by getc(). Note that "converted from" is
not "copied from."

There's not that many place where I personally get characters,
and explictly avoid foreign currency symbols (not necessarily the
currencies themselves, especially considering the plummeting
value of the dollar, but I there's just a lot of stuff that never makes
it into my system), and if it did, I'd first think about changing
my "locale", except that would probably be a waste of time (see
end of post)...
Evidently. Let's try it in small, simple steps:

1) The <ctype.h> functions take an argument of type int.

2) The value of the int argument must be in the range zero
through UCHAR_MAX (all positive) or else EOF (negative).

WHO TOL YA DAT!!! WHO TOL YA DAT!!! Although it kind of
makes sense, since by definition a "C" type would be in that range...
3) Passing a positive argument >UCHAR_MAX or a negative
argument !=EOF invokes undefined behavior.

Well, maybe... (now THAT'S "undefined behavior"!!!).
With me so far? Okay, stick with it:

4) On an implementation where char is signed, some char
values are negative. (Since the "basic execution set"
characters are all non-negative, lackadaisacal testing
will not reveal this.)

OK, where does "locale" play into this?
5) At least 126 of those negative values are !=EOF.

OK, where does "locale" play into this?
6) Therefore, passing one of those char values to a <ctype.h>
function invokes undefined behavior.

Which of course, includes exactly the expected behavior as one
of the possibilities...

Enter string with air
-> Götterdammerung

original string (19 characters):
Götterdammerung |
new string (15 characters):
Götterdammerung|

(I put that little bar "|" at the end of my printf()s showing the before
and after strings so I could see where the string ends, as well as
the string length.) Well, that one worked...let's keep going, OKAY??!!?!!
ARE YOU WITH ME SO FAR??!??!!

Enter string with air
-> Aïda

original string (8 characters):
Aïda|
new string (4 characters):
Aïda|

I'M LOOKING FORWARD TO THAT COOKIE!!!

Enter string with air
-> µsec

original string (4 characters):
µsec|
new string (4 characters):
µsec|

Enter string with air
-> garçon

original string (24 characters):
garçon |
new string (6 characters):
garçon|

Enter string with air
-> £1 is worth ¥210

original string (24 characters):
£1 is worth ¥210 |
new string (16 characters):
£1 is worth ¥210|

I WANT MY COOKIE!!!!!!

ANY OTHER OF YOU IGNORANT BASTARDS WANT TO GIVE
ME SOME FREE FOOD???!!??!!
 
B

Ben Bacarisse

Bill Reid said:
WHO TOL YA DAT!!! WHO TOL YA DAT!!! Although it kind of
makes sense, since by definition a "C" type would be in that range...


Well, maybe... (now THAT'S "undefined behavior"!!!).


OK, where does "locale" play into this?


OK, where does "locale" play into this?


Which of course, includes exactly the expected behavior as one
of the possibilities...
I WANT MY COOKIE!!!!!!

ANY OTHER OF YOU IGNORANT BASTARDS WANT TO GIVE
ME SOME FREE FOOD???!!??!!

Eric Sosman is about as far from ignorant (on things topical here) as
I can imagine. You asked about an issue. He explained the problem.
You showed that some programs do what you expect on your system, which
in no way invalidates the explanation of the problem. OK,
misunderstandings happen, but then you then go on to be rude to the
person trying to explain this to you.

Why would anyone want to try again?
 
B

Barry Schwarz

How many negative values are there in those characters that
are not part of the ASCII character set? Does the concept of
"sign-preserving" come into play here?

As far as whether my "system" uses a default signed char,
in one place in the documentation they state that they USED
to do it that way, IMPLYING they don't any more, but don't worry
about it in any event, then in another place they EXPLICITLY
state that's how they do it currently.

Bastards...


There's not that many place where I personally get characters,
and explictly avoid foreign currency symbols (not necessarily the
currencies themselves, especially considering the plummeting
value of the dollar, but I there's just a lot of stuff that never makes
it into my system), and if it did, I'd first think about changing
my "locale", except that would probably be a waste of time (see
end of post)...


WHO TOL YA DAT!!! WHO TOL YA DAT!!! Although it kind of
makes sense, since by definition a "C" type would be in that range...

It is in the Standard, in the very first paragraph of the section of
that discusses the ctype.h functions. You know, the document you
don't want to read.
Well, maybe... (now THAT'S "undefined behavior"!!!).


OK, where does "locale" play into this?


OK, where does "locale" play into this?

locale doesn't play at all. EOF is a macro. It substitutes exactly
one int value. There are at least 127 negative signed char values. At
least 126 of them cannot equal EOF.
Which of course, includes exactly the expected behavior as one
of the possibilities...

And sticking your head in the sand may actually ward off disasters. It
is one of the possible outcomes.


Remove del for email
 
B

Barry Schwarz

Ah, it seems there were quite a few more post than I thought (using a
new feed reader) - in that case my post must seem a little
discombobulated!


Well, insofar as strictly using isspace as an iterative condition,
yes. Otherwise, it really doesn't matter, does it?

Undefined behavior always matters if you want you code to work.


Remove del for email
 
C

CBFalconer

Bill said:
.... snip ...

I WANT MY COOKIE!!!!!!

ANY OTHER OF YOU IGNORANT BASTARDS WANT TO GIVE
ME SOME FREE FOOD???!!??!!

Since you are ignoring advice, and being rude about it, I see no
reason for paying any further attention to you. PLONK.
 
K

Keith Thompson

sebastian said:
Ah, it seems there were quite a few more post than I thought (using a
new feed reader) - in that case my post must seem a little
discombobulated!


Well, insofar as strictly using isspace as an iterative condition,
yes. Otherwise, it really doesn't matter, does it?

Yes, it certainly does matter. Passing isspace() an argument that is
not either an int representation of an unsigned char value or the
negative value EOF invokes undefined behavior. If plain char is
signed, then this:

char c = -42;
isspace(c);

invokes undefined behavior. On systems that use an ASCII-based
character set, this can easily happen for, for example, accented
letters.
 
K

Keith Thompson

Bill Reid said:
I WANT MY COOKIE!!!!!!

ANY OTHER OF YOU IGNORANT BASTARDS WANT TO GIVE
ME SOME FREE FOOD???!!??!!

I regret my earlier attempts to help you.
 
N

Nick Keighley

Eric Sosman <[email protected]> wrote in message

As far as whether my "system" uses a default signed char,
in one place in the documentation they state that they USED
to do it that way, IMPLYING they don't any more, but don't worry
about it in any event, then in another place they EXPLICITLY
state that's how they do it currently.

Bastards...

it hardly ever matters...
WHO TOL YA DAT!!!  WHO TOL YA DAT!!!  

"American National Standard
for Information Systems -
Programming Language - C"

known to its friends as ANSI X3.159-1989

also K&R; also Linux man page; also MSDN (though their wording is odd)
google gave 499,000 hits for isspace() I havn't checked them all yet

<snip>
 
N

Nick Keighley

I use memcpy() (or memmove()) unless someone proves it's too slow.
Then I try other things.


it might not. Some compilers insert specially optimised
inline code. Some processors can do memmove() in hardware.


yes it is

I am left wondering how you manage to cope with everyday life
with such poor reading comprehension.  I specifically asked about
"cycles", and you respond with "price of tea"...

I'm left wondering how you manage. Are you usually this rude to
people?

"how many miles to the gallon does it take to travel to London"
"it depends what type of vehicle you use"
"you are unresponsive!"


what systems a Cray? a Z80? PC? Marconi Myriad? IBM 360? DeathStation?
(the DS-POT did actually depend on the price of tea in china).

If you really care then measure it. And be aware the answer will
change when you change processor, OS, compiler, compiler version
or compiler switches or omega (that is something I hadn't thought of).

the rule is write clearly and use appropriate library functions.
If its too slow or fails to meet some other quality attribute *then*
mofify the code. Always rememebering to measure both before and
afterwards.

I was looking for the "Answers Department", but accidentally walked
into the "Department of Begging the Question" (part of the "Clueless
Pedants Department", sharing a common desk with the "Pointless
Arguments Department")...

idiot

<snip>
 
S

sebastian

Yes, it certainly does matter. Passing isspace() an argument that is
not either an int representation of an unsigned char value or the
negative value EOF invokes undefined behavior.

Well then the standard is flawed (either chars should always be
unsigned or isdigit should take into account that they may not be).
Instead of forcing people to use defensive programming techniques, why
not just remap the ctype functions for signed char systems, eg:

#if CHAR_MIN < 0
int
isdigit_char_is_signed( int ch )
{
return isdigit( ch ) && ch >= 0;
}
#undef isdigit
#define isdigit isdigit_char_is_signed
// continue with the rest of the ctype functions...
#endif
 
S

sebastian

or rather:

#if CHAR_MIN < 0
int
isdigit_char_is_signed( int ch )
{
return isdigit( ( unsigned )ch );
}
#undef isdigit
#define isdigit isdigit_char_is_signed
// continue with the rest of the ctype functions...
#endif
 
L

lovecreatesbea...

Just looking for a few eyes on this code other than my own.

void TrimCString(char *str)
{
        // Trim whitespace from beginning:
        size_t i = 0;
        size_t j;

        while(isspace(str))
        {
                i++;
        }
        if(i > 0)
        {
                for(j = 0; i < strlen(str); j++, i++)
            {
                        str[j] = str;
                }
                str[j] = '\0';
        }

        // Trim whitespace from end:
        i = strlen(str) - 1;

        while(isspace(str))
        {
                i--;
        }
        if(i < (strlen(str) - 1))
        {
                str[i + 1] = '\0';
        }



}


this is my code:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

char *str_reverse(char *p)
{
char *p1, *p2, ch;

for (p1 = p; *(p1 + 1); p1++) ;
for (p2 = p; p2 < p1; p2++, p1--)
ch = *p2, *p2 = *p1, *p1 = ch;
return p;
}

char *str_trim(char *s)
{
char *p;

p = s + strlen(s) - 1;
while (isspace(*p)) *p-- = '\0';
return s;
}

int main (int argc, char *argv[])
{
printf("\"%s\"\n", argv[1]);
printf("\"%s\"\n", str_trim(argv[1]));
printf("\"%s\"\n",
str_reverse(str_trim(str_reverse(argv[1]))));
return EXIT_SUCCESS;
}

$ make
gcc -ansi -pedantic -Wall -W -g -c -o a.o a.c
a.c:25: warning: unused parameter ‘argc’
gcc a.o -o a.out
$ ./a.out " hello world "
" hello world "
" hello world"
"hello world"
$
 
S

santosh

(e-mail address removed) wrote:

this is my code:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

char *str_reverse(char *p)
{
char *p1, *p2, ch;

for (p1 = p; *(p1 + 1); p1++) ;
for (p2 = p; p2 < p1; p2++, p1--)
ch = *p2, *p2 = *p1, *p1 = ch;
return p;
}

char *str_trim(char *s)
{
char *p;

p = s + strlen(s) - 1;
while (isspace(*p)) *p-- = '\0';
return s;
}

int main (int argc, char *argv[])
{
printf("\"%s\"\n", argv[1]);
printf("\"%s\"\n", str_trim(argv[1]));
printf("\"%s\"\n",
str_reverse(str_trim(str_reverse(argv[1]))));
return EXIT_SUCCESS;
}
[ ... ]

Your program misbehaves when invoked with no arguments or with an empty
string argument.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top