Comment on trim string function please

W

Willem

Bill Reid wrote:
) (I turn them ALL off, which means sub-scripting is not converted
) to pointers, which I believe takes longer, so to retain the speed
) I've manually converted most of the stuff like this).

As far as I know, there is no speed difference between sub-scripting
and pointers on most modern CPUs.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
S

swengineer001

I sort of took a different approach based on the comments here.
#include <ctype.h>
#include <string.h>
char* TrimCString(char *str)
{
    // Trim whitespace from beginning:
    size_t i = 0;
    size_t length = strlen(str);
    char* new_start;
    if((str == NULL) || (length == 0))
    {
        return str;
    }
    while(isspace((unsigned char)str))
    {
        i++;
    }

    new_start = str + i;
    length -= i;
    // Trim whitespace from end:
    i = (length ? (length - 1) : length);
    if(i != 0)
    {
        while(isspace((unsigned char)str))
        {
            i--;
            length--;
        }
        str[i + 1] = '\0';
    }
    memmove(str, new_start, (length + 1));
    return str;
}


     You're still working far too hard, IMHO.  Ponder this one:

        #include <ctype.h>

        /* Remove trailing white space from a string, and return
         * a pointer to the string's first non-space character
         * (which might be its terminating '\0').
         */
        char *TrimString(char *str)
        {
            char *ptr;
            char *end;

            /* skip leading spaces */
            while (isspace( (unsigned char)*str ))
                ++str;

            /* find start of trailing spaces (if any) */
            end = str;
            for (ptr = str;  *ptr != '\0';  ++ptr) {
                if (! isspace( (unsigned char)*ptr ))
                    end = ptr + 1;
            }

            /* clip off trailing spaces (if any) */
            *end = '\0';

            return str;
        }

Or you might prefer to write the second loop this way:

        /* find start of trailing spaces (if any) */
        end = ptr = str;
        while (*ptr != '\0') {
            if (! isspace( (unsigned char)*ptr++ ))
                end = ptr;
        }

Use whichever seems clearer to you; I think it's a toss-up.


I think this is more of a stylistic difference than anything else. For
me I feel my version is more understandable. For you maybe not.

I am curious though, did I address your concerns about an empty string
sufficiently?

Thanks for the help.
 
S

swengineer001

I sort of took a different approach based on the comments here.

It probably isn't relevant, but
I profiled 1,000,000 calls of your first and last versions:

index % time    self  children    called     name
[2]     94.2    6.26    0.00 1000000         TrimCString1 [2]
[3]      4.7    0.31    0.00 1000000         TrimCString2 [3]

Increase in speed was ( 6.26 / 0.31 = 20+ ) twentyfold :)

Interesting. I would not have guessed it would be that much of a
difference.

I did not introduce any bugs with this change that you see did I? In
my testing it still performs as I would expect.
 
B

badc0de4

voidpointer said:
another version, i don't know if it's correct, but work fine here

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

char *stringtrim(char *str)
{
char *res, *ini, *end, *ptr;
int size;

/* safe mode to NULL string */
if (!str)
return NULL;

size = strlen(str);

if (!size)
return NULL;

Why stringtrim("") returns NULL?
"" is a perfectly valid string.
ini = str;

/* go to the first non blank character */
while (isspace((unsigned char)*ini)) {
++ini;
--size;
}

if (size <= 0)
return NULL;

Why stringtrim(" ") returns NULL?
"" is a perfectly valid string.
end = str + strlen(str) - 1; /* point to the last character */
/* go to the last non blank character */
while (isspace((unsigned char)*end)) {
--end;
--size;
}

if (size <= 0)
return NULL;

res = malloc(sizeof(char) * size);
for (ptr = res; ini != (end+1); ini++)
*ptr++ = *ini;
*ptr = '\0';

return res;
}

There's a memory leak in your program.
You never free the memory you malloc'd for the result.

[snip]

Oh ... `stringtrim` is a reserved name
http://www.gnu.org/software/libtool/manual/libc/Reserved-Names.html
 
S

swengineer001

I sort of took a different approach based on the comments here.
#include <ctype.h>
#include <string.h>
char* TrimCString(char *str)
{
    // Trim whitespace from beginning:
    size_t i = 0;
    size_t length = strlen(str);
    char* new_start;
    if((str == NULL) || (length == 0))
    {
        return str;
    }
    while(isspace((unsigned char)str))
    {
        i++;
    }

    new_start = str + i;
    length -= i;
    // Trim whitespace from end:
    i = (length ? (length - 1) : length);
    if(i != 0)
    {
        while(isspace((unsigned char)str))
        {
            i--;
            length--;
        }
        str[i + 1] = '\0';
    }
    memmove(str, new_start, (length + 1));
    return str;
}


     You're still working far too hard, IMHO.  Ponder this one:

        #include <ctype.h>

        /* Remove trailing white space from a string, and return
         * a pointer to the string's first non-space character
         * (which might be its terminating '\0').
         */
        char *TrimString(char *str)
        {
            char *ptr;
            char *end;

            /* skip leading spaces */
            while (isspace( (unsigned char)*str ))
                ++str;

            /* find start of trailing spaces (if any) */
            end = str;
            for (ptr = str;  *ptr != '\0';  ++ptr) {
                if (! isspace( (unsigned char)*ptr ))
                    end = ptr + 1;
            }

            /* clip off trailing spaces (if any) */
            *end = '\0';

            return str;
        }

Or you might prefer to write the second loop this way:

        /* find start of trailing spaces (if any) */
        end = ptr = str;
        while (*ptr != '\0') {
            if (! isspace( (unsigned char)*ptr++ ))
                end = ptr;
        }

Use whichever seems clearer to you; I think it's a toss-up.


Actually, now that I look at this a little closer won't this truncate
a string with a space in the middle. That is not what I want. A space
in the middle is fine I just want to trim off the front and back.
 
B

Ben Bacarisse

On Jul 10, 5:01 pm, Eric Sosman <[email protected]> wrote:

Actually, now that I look at this a little closer won't this truncate
a string with a space in the middle.

Not as far as I can see. In both cases end is set to point just after
something that is know not to be space. Because the loop runs to the
end of the string, end points just after the *last* thing know not to
be a space. This will be the null or the first trailing space.
 
B

badc0de4

I did not introduce any bugs with this change that you see did I? In
my testing it still performs as I would expect.

Hmmm ... if I pass NULL to your TrimCString it'll try to do strlen on
that.
I'm not sure you can do that.

: char* TrimCString(char *str)
: {
: // Trim whitespace from beginning:
: size_t i = 0;
: size_t length = strlen(str); /* <=== strlen(NULL)
is ok? */
: char* new_start;
:
: if((str == NULL) || (length == 0))
: {
: return str;
: }
: [rest of program]

Maybe try this instead

char* TrimCString(char *str)
{
// Trim whitespace from beginning:
size_t i = 0;
size_t length; /* not initialized */
char* new_start;

if(str == NULL) /* only test for NULL */
{
return str;
}
length = strlen(str);
if(length == 0) /* only test for
length == 0 */
{
return str;
}
[rest of program]
 
K

Keith Thompson

Bill Reid said:
What would be the actual metrics to make a decision here
one way or the other? I myself just assume for this kind of stuff
that it will only be done on relatively small strings, just a few
characters, so it seems that a call to memmove() might be overkill.
Or is it? DOES it depend on the size of the block you'll
be moving? How many cycles does it take to assign one
character to a previous array position, versus the overhead of
a call to memmove(), and the operation of the function itself?

All this is system-specific.

[snip]
Can't go wrong there...I will admit that I use to always use array
sub-scripting such as the OP because I found it easier conceptually
to understand, but then abandoned it largely for the equivalent
pointers as part of my paranoia about compiler "optimizations"
(I turn them ALL off, which means sub-scripting is not converted
to pointers, which I believe takes longer, so to retain the speed
I've manually converted most of the stuff like this).

Subscripting might be faster than pointer manipulation on some
systems; pointer manipulation might be faster on others. A decent
optimizing compiler will know which is faster, and can often convert
either form into whichever is best for the target system.

Consider letting the compiler do its job.
 
S

swengineer001

I did not introduce any bugs with this change that you see did I? In
my testing it still performs as I would expect.

Hmmm ... if I pass NULL to your TrimCString it'll try to do strlen on
that.
I'm not sure you can do that.

: char* TrimCString(char *str)
: {
:     // Trim whitespace from beginning:
:     size_t i = 0;
:     size_t length = strlen(str);               /* <=== strlen(NULL)
is ok? */
:     char* new_start;
:
:     if((str == NULL) || (length == 0))
:     {
:         return str;
:     }
:    [rest of program]

Maybe try this instead

char* TrimCString(char *str)
{
    // Trim whitespace from beginning:
    size_t i = 0;
    size_t length;                           /* not initialized */
    char* new_start;

    if(str == NULL)                         /* only test for NULL */
    {
        return str;
    }
    length = strlen(str);
    if(length == 0)                             /* only test for
length == 0 */
    {
        return str;
    }
    [rest of program]

Thanks. That could have been a disaster.
 
B

Bill Reid

Willem said:
Bill Reid wrote:
) (I turn them ALL off, which means sub-scripting is not converted
) to pointers, which I believe takes longer, so to retain the speed
) I've manually converted most of the stuff like this).

As far as I know, there is no speed difference between sub-scripting
and pointers on most modern CPUs.

DAMN!!! More time I've wasted because I JUST WON'T LISTEN!!!

In any event, does the following win the prize for most efficient
implementation of the presumed requirements of the function?

char *remove_beg_end_non_text(char *text) {
char *beg;
size_t length;

for(beg=text;*beg!='\0';beg++)
if(!isspace(*beg)) break;

length=strlen(beg);

while(length>0)
if(!isspace(*(beg+(--length)))) break;

*(beg+(++length))='\0';

return beg==text ? text : memmove(text,beg,length+1);
}

Gotta admit, you couldn't reduce the cycles too much on
that, could you? And it could even win a little bonus prize
for obfuscatory conditions like if(!isspace(*(beg+(--length))))...
 
B

Bill Reid

Keith Thompson said:
All this is system-specific.

This is not really responsive. I make a distinction between doing
an "delete" of about 20 characters from a text block of 300K and moving
10 characters one position forward. I find it difficult to believe that
some "specific" systems make no or counter-intuitive operational
distinctions between the two different methods of accomplishing the
operation.

But I'm "listening"; tell me how many cycles different "systems"
would consume to perform each method...or is memmove() kind
of like fread() and fwrite(), you THINK that they're doing something
"special", but really they're using fgetc() and fputc() which in most
cases you probably could have just used yourself and eliminated
the "middleman"...
[snip]
Can't go wrong there...I will admit that I use to always use array
sub-scripting such as the OP because I found it easier conceptually
to understand, but then abandoned it largely for the equivalent
pointers as part of my paranoia about compiler "optimizations"
(I turn them ALL off, which means sub-scripting is not converted
to pointers, which I believe takes longer, so to retain the speed
I've manually converted most of the stuff like this).

Subscripting might be faster than pointer manipulation on some
systems; pointer manipulation might be faster on others. A decent
optimizing compiler will know which is faster, and can often convert
either form into whichever is best for the target system.

Consider letting the compiler do its job.

The compiler has "told" me which it considers faster, since it is
documented that the default optimization is to replace sub-scripts
with pointers. Therefore I cannot really be preventing the compiler
from doing its "job" (your opinion) by pre-emptively writing my
code to do the same thing as the "optimizer", at least for this
particular "optimization".

Riiiiiiiiiiiiiiight? Now as far as some other optimizations are
concerned, I can't re-write my code to emulate those, but THOSE
tend to be the ones I am really concerned about in the first place,
with GOOD reason...
 
3

31349 83652

Bill said:
In any event, does the following win the prize for most efficient
implementation of the presumed requirements of the function?

No :)
char *remove_beg_end_non_text(char *text) {
char *beg;
size_t length;

for(beg=text;*beg!='\0';beg++)
if(!isspace(*beg)) break;

BUG: missing cast
length=strlen(beg);

while(length>0)
if(!isspace(*(beg+(--length)))) break;

BUG: missing cast
*(beg+(++length))='\0';

return beg==text ? text : memmove(text,beg,length+1);
}

Also
remove_beg_end_non_text(NULL);
crashes on my computer.
And it could even win a little bonus prize
for obfuscatory conditions like if(!isspace(*(beg+(--length))))...

Obfuscatory conditions never win prizes (except at IOCCC).
 
B

Ben Bacarisse

Bill Reid said:
In any event, does the following win the prize for most efficient
implementation of the presumed requirements of the function?

I think you have a bug.
char *remove_beg_end_non_text(char *text) {
char *beg;
size_t length;

for(beg=text;*beg!='\0';beg++)
if(!isspace(*beg)) break;

I'd guess

for (beg = text; isspace((unsigned char)*beg); beg++);

makes shorter code with some compilers.
length=strlen(beg);

while(length>0)
if(!isspace(*(beg+(--length)))) break;

*(beg+(++length))='\0';

If length never is never decremented (because it was zero after the
initial space scan) then this writes outside the string. It always
helps to walk through what your code does in boundary cases like ""
and " ".
return beg==text ? text : memmove(text,beg,length+1);
}

Gotta admit, you couldn't reduce the cycles too much on
that, could you? And it could even win a little bonus prize
for obfuscatory conditions like if(!isspace(*(beg+(--length))))...

You might have obscure it even to yourself!
 
L

lawrence.jones

Bill Reid said:
This is not really responsive. I make a distinction between doing
an "delete" of about 20 characters from a text block of 300K and moving
10 characters one position forward. I find it difficult to believe that
some "specific" systems make no or counter-intuitive operational
distinctions between the two different methods of accomplishing the
operation.

Then you need to practice believing more things. :)

On some systems, a call to memmove() will generate an actual call to a
library function that moves one byte at a time just like your loop, and
thus will always be slower.

On other systems, a call to memmove() will generate an actual call to a
library function that uses a few (maybe even just one) machine
instructions to move the bytes much faster than your byte at a time
loop. Which is faster depends on the number of bytes moved, how much
overhead there is in a function call, and how much faster the machine
instructions are than your loop. It is likely that your loop will be
faster for "short" strings and slower for "long" strings, but exactly
where the break-even point is will vary from system to system. It might
be just a few bytes, it might be tens of bytes, it might even be
hundreds of bytes (although probably not).

On still other systems, a call to memmove() will generate those few (or
maybe even just one) machine instructions inline, which will almost
certainly be faster than your byte at a time loop all the time.
or is memmove() kind
of like fread() and fwrite(), you THINK that they're doing something
"special", but really they're using fgetc() and fputc() which in most
cases you probably could have just used yourself and eliminated
the "middleman"...

Although the C standard describes fread()/fwrite() as working as if they
called fgetc()/fputc() for each byte, I've never seen an implementation
where they actually do that; real implementations nearly always *do* do
something special instead. On the other hand, I've seen actual
implementations use all three of the above methods of implementing
memmove().
 
B

Bill Reid

Ben Bacarisse said:
"Bill Reid" <[email protected]> writes:
I think you have a bug.


I'd guess

for (beg = text; isspace((unsigned char)*beg); beg++);

makes shorter code with some compilers.

Yeah, but does it fly past the terminating null character? I think
you may have out-obfuscated me...

Also, I've not been following why the cast is important here...
If length never is never decremented (because it was zero after the
initial space scan) then this writes outside the string. It always
helps to walk through what your code does in boundary cases like ""
and " ".

Actually, the "boundary case" for this is the size of array; as long
as it is at least one character bigger than the "string" then this will
always work. But you are correct if you replace the word "string" with
the word "array fully-populated by the string" above.

There are actually only five possible test cases, which all should
ASSUME a "array fully-populated by the string" (which I unfortunately
didn't): spaces before text, spaces after the text, spaces before and
after the text, just spaces, empty string (six if you count the stupidity
of passing NULL). Pass those five (or six) cases, the thing is
perfect...
You might have obscure it even to yourself!

OK, I need to work on a few things to make it the perfect piece
of obscure ruthlessly efficient code...
 
B

Bill Reid

31349 83652 said:
Bill Reid wrote:

No :)

Honorable mention?
BUG: missing cast

I've not been following why the cast is important; is this some type
of "error" that has never actually occured on the planet Earth but MIGHT
happen in the unpredictable future?
BUG: missing cast
IBID.

You missed a bug!
Also
remove_beg_end_non_text(NULL);
crashes on my computer.

Of course it does, so why did you do it? Just how far down do
we have to push error checking? If your answer is "to the depths
of hell", this will make you happy, but I'll bet you'll still manage
to crash your computer SOMEHOW if you do stuff like that:

if(text==NULL) return text;
Obfuscatory conditions never win prizes (except at IOCCC).

Even if they actually make things work "better"?
 
B

Bill Reid

Mark McIntyre said:
Bill said:
Keith Thompson said:
[...]
An alternative would be to use memmove() here, so you don't
have to do it byte by byte.
What would be the actual metrics to make a decision here
one way or the other? I myself just assume for this kind of stuff
that it will only be done on relatively small strings, just a few
characters, so it seems that a call to memmove() might be overkill.
Or is it? DOES it depend on the size of the block you'll
be moving? How many cycles does it take to assign one
character to a previous array position, versus the overhead of
a call to memmove(), and the operation of the function itself?
All this is system-specific.

This is not really responsive. ....I find it difficult to believe that
some "specific" systems make no or counter-intuitive operational
distinctions between the two different methods of accomplishing the
operation.

You misunderstand. Keith's point is that how this is done is not defined
by hte C language, it is specific to individual implementations. Your
system may do it differently to mine.

You misunderstand the English language, since I specifically used
the word "system" above.
So to get a good answer, you need to ask in a system-specific news group.

OK, I'll take that as a "I have no idea"...
Between 0 and a billion, depending on how your particular system
implements the function.....

OK, I'll take that as a "I REALLY have no idea", since all you had
to do was pick ANY system (or two or three) that you were familiar with
to answer the question...
 
B

Bill Reid

Eric Sosman said:
Bill said:
Keith Thompson said:
[...]
An alternative would be to use memmove() here, so you don't
have to do it byte by byte.
What would be the actual metrics to make a decision here
one way or the other? I myself just assume for this kind of stuff
that it will only be done on relatively small strings, just a few
characters, so it seems that a call to memmove() might be overkill.
Or is it? DOES it depend on the size of the block you'll
be moving? How many cycles does it take to assign one
character to a previous array position, versus the overhead of
a call to memmove(), and the operation of the function itself?
All this is system-specific.

This is not really responsive.

Valid complaint: Although most of your paragraph was about
system-specific nonsense, the actual question "What would be
the actual metrics?" can be addressed at least in part without
system specificity. Here are some "actual metrics" for your
consideration; others may occur to your thought:

- Brevity
- Clarity
- Maintainability
- Program size
- Execution time
- Price of tea in China

I am left wondering how you manage to cope with everyday life
with such poor reading comprehension. I specifically asked about
"cycles", and you respond with "price of tea"...
All this is system-specific.

I was looking for the "Answers Department", but accidentally walked
into the "Department of Begging the Question" (part of the "Clueless
Pedants Department", sharing a common desk with the "Pointless
Arguments Department")...

In any event, I guess "Jens Thoms Toerring" was totally off-base in
suggesting memmove() over re-writing the string in place, since he
clearly didn't have access to the non-information about the specific
"system"...
 
V

vippstar

You are correct that size_t is an unsigned integer type, and can never
be negative. But that doesn't mean that underflowing it won't cause a
very serious problem. Consider:
Compile and execute this and see what you get.
I don't see underflowing occuring anywhere in that code. Unsigned
integers can not underflow or overflow.
I do see your point, but your terminology was not correct.
Only signed integers can overflow, and when that happends the behavior
is not defined.
 
C

CBFalconer

(spaces added for legibility above)
I've not been following why the cast is important; is this some
type of "error" that has never actually occured on the planet
Earth but MIGHT happen in the unpredictable future?

Serious bug. isspace(int) requires the integer value of an
unsigned char. Repeat, unsigned. If the char type on any machine
is signed isspace can receive a negative value, and quite likely
blowup. That's why the argument to isspace should receive an
explicit cast. The only negative value those functions can receive
is EOF.

The cast can be avoided when the integer output of getc() is
passed, because getc returns the appropriately cast value in the
first place. But you are getting those chars from a string, and
whether a raw char is signed or unsigned is always undefined.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,208
Latest member
RandallLay

Latest Threads

Top