trim whitespace v3

T

Tom St Denis

Must you shatter all illusions of those faithful to the creed?

Ah, so you *are* a troll like the rest and your gimmick is you write
overly complicated completely amateur code. I guess your ruse is to
show how incapable and clumsy the C language is by incompetently
developing shit software?

Tom
 
S

Seebs

Trim removes bytes and never adds them, so you need to allocate one
buffer the size of your input... let's see...
char *trim(const char *src)
{
char *tmpdst, *dst = calloc(1, strlen(src) + 1);

Hmm. This does mean that you necessarily do two passes through the
string; an in-place operation only has to do one.
if (!dst) return NULL;
tmpdst = dst;
while (*src) {
switch (*src) {
case ' ':
case '\n':
case '\t':
break;
default:
*tmpdst++ = *src;
}
++src;
}
return dst;
}

This is incorrect. The original trim() trims only left and right whitespace,
NOT whitespace in the middle of the string.
int trim_inplace(char *str)
{
char *tmp = trim(str);
if (!tmp) return -1;
assert(strlen(str) <= strlen(tmp)); // sanity check
strcpy(str, tmp);
free(tmp);
return 0;
}
OMG!!! Hierarchical programming!

And at this point, you're iterating over the whole string once to calculate
its length inside trim, once to copy it, twice for the assert, once for
the strcpy...

http://www.joelonsoftware.com/articles/fog0000000319.html
That's what separates a developer from a coder. In 27 lines of code
[both functions] I replaced what took him 100s of lines. And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

And which doesn't do what it's supposed to do.

If you look at all the examples and test cases in the thread, they're
consistent that " foo bar " => "foo bar", not "foobar".

-s
 
K

Keith Thompson

Tom St Denis said:
Trim removes bytes and never adds them, so you need to allocate one
buffer the size of your input... let's see...

char *trim(const char *src)
{
char *tmpdst, *dst = calloc(1, strlen(src) + 1);
if (!dst) return NULL;
tmpdst = dst;
while (*src) {
switch (*src) {
case ' ':
case '\n':
case '\t':
break;
default:
*tmpdst++ = *src;
}
++src;
}
return dst;
}

Done. Wow that's hard. Didn't involve 100s of lines of code, is
portable, simple to extend/change, etc.
[...]

But it doesn't have the virtue of being correct. It doesn't trim
leading and trailing whitespace; it removes *all* whitespace (or
rather, all occurrences of ' ', '\n', and '\t'; there are other
whitespace characters). I suggest using isspace() rather than
testing for specific characters (unless the specification says to
remove only those characters).

For example, trim(" hello world ") should give you "hello world";
your function gives you "helloworld".
 
S

Seebs

Just saying your method is flawed and nobody seems to be pointing you
on track.

You have a point here.
And you really ought to stop posting it over and over again
until you think the problem through.

This, I think, is good advice. I certainly have the problem of not stopping
to correctly articulate the problem I'm working on sometimes. Five minutes
spent trying to answer the question "what am I actually trying to do" beats
an hour or two of trying to do something.

-s
 
T

Tom St Denis

Hmm.  This does mean that you necessarily do two passes through the
string; an in-place operation only has to do one.

ERROR: Premature optimization leads to sloppy development.

If you're writing a parser that has no use for white space then just
program it to skip over it. You shouldn't have to move any bytes AT
ALL to parse a string with white space in it.
This is incorrect.  The original trim() trims only left and right whitespace,
NOT whitespace in the middle of the string.

Ah, true, I made the assumption that it was meant to trim spaces off
command line args for some reason. Simple enough to fix, basically
you want to trim spaces off the start until you hit a non-space, then
scan backwards. So it becomes

char *trim(const char *src)
{
char *tmpdst, *dst = calloc(1, strlen(src + 1));

if (!dst) return NULL;

tmpdst = dst;
while (isspace(*src)) { ++src; }
strcpy(dst, src);

tmpdst = &dst[strlen(dst)-1];
while (isspace(*tmpdst)) { *tmpdst-- = 0; }

return dst;
}

I'd use a macro for the space test [isspace for instance].
And at this point, you're iterating over the whole string once to calculate
its length inside trim, once to copy it, twice for the assert, once for
the strcpy...

Except where is this being used? I'd argue for the sake of command
line parsing who cares? It's a wasted optimization cycle. If I were
writing a parser I'd just make the FSM handle white space and not do
any sort of formatting.
That's what separates a developer from a coder.  In 27 lines of code
[both functions] I replaced what took him 100s of lines.  And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

And which doesn't do what it's supposed to do.

If you look at all the examples and test cases in the thread, they're
consistent that " foo bar " => "foo bar", not "foobar".

My bad, I misfigured out what he was trying to do. Hell even if done
in place the fix becomes

void trim(const char *str)
{
char *tmpsrc;

// trim leading spaces
tmpsrc = src;
while (isspace(*src)) { ++src; }
if (!*src) {
*tmpsrc = 0;
return;
}
memmove(tmpsrc, src, strlen(src)+1);

// trim trailing spaces
tmpsrc = &tmpsrc[strlen(tmpsrc)-1];
while (isspace(*tmpsrc)) { *tmpsrc-- = 0; }
}

Which with comments is still only a dozen lines long. So it's both
comparable to what he's doing and 6 times smaller.

Tom
 
T

Tom St Denis

[...]


Trim removes bytes and never adds them, so you need to allocate one
buffer the size of your input... let's see...
char *trim(const char *src)
{
   char *tmpdst, *dst = calloc(1, strlen(src) + 1);
   if (!dst) return NULL;
   tmpdst = dst;
   while (*src) {
      switch (*src) {
         case ' ':
         case '\n':
         case '\t':
            break;
         default:
            *tmpdst++ = *src;
      }
      ++src;
   }
   return dst;
}
Done.  Wow that's hard.  Didn't involve 100s of lines of code, is
portable, simple to extend/change, etc.

[...]

But it doesn't have the virtue of being correct.  It doesn't trim
leading and trailing whitespace; it removes *all* whitespace (or
rather, all occurrences of ' ', '\n', and '\t'; there are other
whitespace characters).  I suggest using isspace() rather than
testing for specific characters (unless the specification says to
remove only those characters).

For example, trim("  hello world  ") should give you "hello world";
your function gives you "helloworld".

Yeah my bad, Sorry John my original reply was foobared. However, I
replied to Seebs with a corrected in-place trim function that ought to
do the trick and is still fairly short [and portable and ...].

tom
 
J

John Kelly

I pointed out the two missing macros above just to let you know about an
implementation which needs further attention.

Just for the records I'll report this here:

#define EPERM 1 /* Operation not permitted */
#define ENOFILE 2 /* No such file or directory */ [snip]
#define ENOTEMPTY 41 /* Directory not empty (90 in Cyg?) */
#define EILSEQ 42 /* Illegal byte sequence */


They say "oil is where you find it."

On linux

# find /usr/include -type f -exec grep -Hn EOVERFLOW {} \;
/usr/include/asm-generic/errno.h:48:#define EOVERFLOW 75


On Interix

# find /usr/include -type f -exec grep -on EOVERFLOW {} \;
/usr/include/errno.h:122:#define EOVERFLOW 88


On MinGW I don't know where EOVERFLOW or PTRDIFF_MAX reside. I can only
suggest using tools like "find" and "grep" to locate them.
 
S

Seebs

Ah, true, I made the assumption that it was meant to trim spaces off
command line args for some reason.

I think it is, but that usually means trimming around them, not within
them.
Simple enough to fix, basically
you want to trim spaces off the start until you hit a non-space, then
scan backwards. So it becomes

char *trim(const char *src)
{
char *tmpdst, *dst = calloc(1, strlen(src + 1));

You probably mean
strlen(src) + 1
rather than
strlen(src + 1)
.... this is an off-by-two error.
while (isspace(*tmpdst)) { *tmpdst-- = 0; }

What happens if you reach the beginning of the string? It seems like you
test the character before the allocated region with isspace(). Now, that
should only happen for an empty string, at which point the strlen(src + 1)
probably failed too.

Makes some sense. Hmm. If I were doing allocation, I'd probably do:

char *trim(const char *src)
{
char *dst;
size_t len;

if (!src)
return NULL;
len = strlen(src);
while (isspace(*src)) {
++src;
--len;
}
dst = malloc(len + 1);
if (!dst)
return NULL;
strcpy(dst, src);
--len;
while (isspace(dst[len]) && len != (size_t) -1) {
dst[len--] = '\0';
}
return dst;
}

I moved the allocation lower because why not, and switched from pointers
to offsets so I could test for falling off the beginning.
Except where is this being used? I'd argue for the sake of command
line parsing who cares? It's a wasted optimization cycle. If I were
writing a parser I'd just make the FSM handle white space and not do
any sort of formatting.

Well, we don't know where it's being used; the OP seemed to be suggesting
generic use.

But I think it's good habit to ignore multiple runs through strings, they
do have a noticeable cost over repeated use.
void trim(const char *str)
{
char *tmpsrc;

// trim leading spaces
tmpsrc = src;
while (isspace(*src)) { ++src; }
if (!*src) {
*tmpsrc = 0;
return;
}
memmove(tmpsrc, src, strlen(src)+1);

// trim trailing spaces
tmpsrc = &tmpsrc[strlen(tmpsrc)-1];
while (isspace(*tmpsrc)) { *tmpsrc-- = 0; }
}

I think this one's protected against running off the beginning by the
first test for *src. I'd probably cache the intermediate
strlen(src) result, because it is obvious that strlen(tmpsrc) has to
be the same as strlen(src).
Which with comments is still only a dozen lines long. So it's both
comparable to what he's doing and 6 times smaller.

You will not catch me arguing that Kelly is a very good programmer (yet,
anyway). I do think he has potential, and if he could move past viewing the
world in terms of how things make him look, I think he's got some of the
right instincts.

-s
 
K

Keith Thompson

Tom St Denis said:
My bad, I misfigured out what he was trying to do. Hell even if done
in place the fix becomes

void trim(const char *str)
{
char *tmpsrc;

// trim leading spaces
tmpsrc = src;
while (isspace(*src)) { ++src; }
if (!*src) {
*tmpsrc = 0;
return;
}
memmove(tmpsrc, src, strlen(src)+1);

// trim trailing spaces
tmpsrc = &tmpsrc[strlen(tmpsrc)-1];
while (isspace(*tmpsrc)) { *tmpsrc-- = 0; }
}

Which with comments is still only a dozen lines long. So it's both
comparable to what he's doing and 6 times smaller.

And incorrect. You call the parameter "str" and refer to it as
"src". With that fixed, the assignment "tmpsrc = src;" discards the
"const" qualifier.

Both problems can be fixed without increasing the size of the
code (change "const char *str" to "char *src"), and the result
is probably correct (I haven't thoroughly checked), but if you're
going to demonstrate how easy something is you should probably at
least compile it before posting.

Here's a rough outline of how I'd might do it:

Scan from the start of the string to the end, keeping track of
the positions of the first and last non-whitespace characters.
Then do the memmove() and write the trailing '\0'. Think about all
the corner cases and add special-case code if necessary (I haven't
actually thought about it; it may well be that no special-case code
is needed).

This does a single scan over the string, avoiding the strlen()
and reverse scan, which *might* be significant if you expect to
deal with very long strings.

Of course this isn't, and cannot be, "bullet proof", in the sense
that it will run off the end of an array that contains no '\0'
characters. Include in the documentation a statement that the
behavior is undefined in that case.
 
K

Keith Thompson

Seebs said:
You have a point here.


This, I think, is good advice. I certainly have the problem of not stopping
to correctly articulate the problem I'm working on sometimes. Five minutes
spent trying to answer the question "what am I actually trying to do" beats
an hour or two of trying to do something.

Indeed.

My advice for John: Describe *in English* exactly what trim() is
supposed to do. The description should be sufficiently precise that
two programmers implementing it will produce functions that yield
exactly the same results for valid arguments (including arguments
that are intended to trigger some sort of error indication).
 
S

Seebs

They say "oil is where you find it."

I think you have misunderstood the complaint.
On MinGW I don't know where EOVERFLOW or PTRDIFF_MAX reside. I can only
suggest using tools like "find" and "grep" to locate them.

It is possible that they don't reside, because EOVERFLOW is not a symbol
implementations are obliged to provide, and some don't.

-s
 
S

Seebs

They say "oil is where you find it."

I think you have misunderstood the complaint.
On MinGW I don't know where EOVERFLOW or PTRDIFF_MAX reside. I can only
suggest using tools like "find" and "grep" to locate them.

It is possible that they don't reside, because EOVERFLOW is not a symbol
implementations are obliged to provide, and some don't.

-s
 
K

Keith Thompson

John Kelly said:
They say "oil is where you find it."

On linux

# find /usr/include -type f -exec grep -Hn EOVERFLOW {} \;
/usr/include/asm-generic/errno.h:48:#define EOVERFLOW 75


On Interix

# find /usr/include -type f -exec grep -on EOVERFLOW {} \;
/usr/include/errno.h:122:#define EOVERFLOW 88


On MinGW I don't know where EOVERFLOW or PTRDIFF_MAX reside. I can only
suggest using tools like "find" and "grep" to locate them.

It doesn't matter *where* EOVERFLOW is defined. It could be in
<errno.h>, or in some file directly or indirectly included from
<errno.h> -- or nowhere.

Or it might be defined in some implementation-defined header;
POSIX defines all its non-C-standard E* macros in <errno.h>,
and the C standard encourages this, but other systems might
behave differently. In that case, I'd suggest checking the
documentation first. Just because you happen to find the definition
in /usr/include/asm-generic/errno.h, that doesn't mean you should
write #include <asm-generic/errno.h>.

Similar considerations apply to PTRDIFF_MAX. If #include <stdint.h>
doesn't make PTRDIFF_MAX visible, chances are your system doesn't
define it at all.

#include <errno.h>
#include <stdio.h>
int main(void)
{
#ifdef EOVERFLOW
puts("EOVERFLOW is defined");
#else
puts("EOVERFLOW is not defined");
#endif
return 0;
}
 
T

Tom St Denis

I think it is, but that usually means trimming around them, not within
them.



You probably mean
        strlen(src) + 1
rather than
        strlen(src + 1)
... this is an off-by-two error.


What happens if you reach the beginning of the string?  It seems like you
test the character before the allocated region with isspace().  Now, that
should only happen for an empty string, at which point the strlen(src + 1)
probably failed too.

All this is proves is that my code when writing really quickly and
concentrating on other things is about as good as John Kelly's best
effort.

The 2nd version fixes the "nothing but space" bug.
Well, we don't know where it's being used; the OP seemed to be suggesting
generic use.

Ok, except that as a "library" I'd still provide both copies since you
might not have write access to the source string. You could go the
other way and make a trim_copy that copies the string into a new
buffer then trims it inplace. Really amounts to the same work either
way.
But I think it's good habit to ignore multiple runs through strings, they
do have a noticeable cost over repeated use.

True, except you have to be careful how you optimize. I suppose
optimizing a library makes sense. But if I were writing some off hand
code to parse a command line I wouldn't be so concerned with its
optimization. Mostly because it accounts for so little of the
execution time.
void trim(const char *str)
{
   char *tmpsrc;
   // trim leading spaces
   tmpsrc = src;
   while (isspace(*src)) { ++src; }
   if (!*src) {
     *tmpsrc = 0;
     return;
   }
   memmove(tmpsrc, src, strlen(src)+1);
   // trim trailing spaces
   tmpsrc = &tmpsrc[strlen(tmpsrc)-1];
   while (isspace(*tmpsrc)) { *tmpsrc-- = 0; }
}

I think this one's protected against running off the beginning by the
first test for *src.  I'd probably cache the intermediate
strlen(src) result, because it is obvious that strlen(tmpsrc) has to
be the same as strlen(src).

Fair enough observation.
You will not catch me arguing that Kelly is a very good programmer (yet,
anyway).  I do think he has potential, and if he could move past viewing the
world in terms of how things make him look, I think he's got some of the
right instincts.

I suppose I didn't help myself by not actually testing the code I was
writing. I think mostly I wanted to point out the problem isn't as
nearly as complicated as they're trying to make it out to be. With a
couple minor things touched up the trim function [2nd one] should be
"production worthy" and it's a trivial small piece of code. Compared
to his ridiculous 74 lines of nonsense.

Thing is if I can write a decent approximation of a trim function off
the top of my head in 5 minutes and he takes days to write something
that is so far inferior it speaks volumes to his approach.

What would be nice is if people like John who are obviously new to
computer science would write pseudo code first. Then C. Trying to
tackle the semantics and syntax of a language while simultaneously
trying to learn comp.sci is just a bad idea. That's like trying to
learn how to read shakespeare while still trying to learn how to read
simple English.

Tom
 
S

Seebs

Thing is if I can write a decent approximation of a trim function off
the top of my head in 5 minutes and he takes days to write something
that is so far inferior it speaks volumes to his approach.

I dunno. I think it's probably a combination of approach and experience.
He's clearly a fairly new programmer, and I think he deserves credit for
at least WANTING to deal with invalid inputs gracefully. I think it turns
out that, due to the nature of C, his attempts to do that are causing more
harm than good, but it's a good impulse to have.
What would be nice is if people like John who are obviously new to
computer science would write pseudo code first. Then C. Trying to
tackle the semantics and syntax of a language while simultaneously
trying to learn comp.sci is just a bad idea. That's like trying to
learn how to read shakespeare while still trying to learn how to read
simple English.

I'm not sure about that. I think you have to do SOMETHING that you can
mess with and try it to see how it works. Joel's rant about string
manipulation does point out an issue you sometimes have if you don't start
down at the byte-manipulation level.

-s
 
T

Tom St Denis

[...]


My bad, I misfigured out what he was trying to do.  Hell even if done
in place the fix becomes
void trim(const char *str)
{
   char *tmpsrc;
   // trim leading spaces
   tmpsrc = src;
   while (isspace(*src)) { ++src; }
   if (!*src) {
     *tmpsrc = 0;
     return;
   }
   memmove(tmpsrc, src, strlen(src)+1);
   // trim trailing spaces
   tmpsrc = &tmpsrc[strlen(tmpsrc)-1];
   while (isspace(*tmpsrc)) { *tmpsrc-- = 0; }
}
Which with comments is still only a dozen lines long.  So it's both
comparable to what he's doing and 6 times smaller.

And incorrect.  You call the parameter "str" and refer to it as
"src".  With that fixed, the assignment "tmpsrc = src;" discards the
"const" qualifier.

Both problems can be fixed without increasing the size of the
code (change "const char *str" to "char *src"), and the result
is probably correct (I haven't thoroughly checked), but if you're
going to demonstrate how easy something is you should probably at
least compile it before posting.

Here's a rough outline of how I'd might do it:

Scan from the start of the string to the end, keeping track of
the positions of the first and last non-whitespace characters.
Then do the memmove() and write the trailing '\0'.  Think about all
the corner cases and add special-case code if necessary (I haven't
actually thought about it; it may well be that no special-case code
is needed).

This does a single scan over the string, avoiding the strlen()
and reverse scan, which *might* be significant if you expect to
deal with very long strings.

All valid design/development comments.

My point of posting was to show that it's not that hard to come up
with a clean rough approximation of what he's trying to do. Heck
here's what you're talking about

void trim(char *src)
{
char *src1 = src, *src2, *src3;

// skip leading space
while (isspace(*src)) ++src;
if (!*src) { *src1 = 0; return; }

// find last non-ispace
src3 = src;
while (*src) { if (!isspace(*src)) { src2 = src; } ++src };

// move src3...src2 to src1 and append a NUL
memmove(src1, src3, (size_t)(src2-src3));
src1[(size_t)(src2-src3)] = 0;
}

Single pass, handles all space, handles no space, should be what he
wants. And I spend roughly 4 minutes on writing that C code for this
post.

Tom
 
F

Francesco S. Carta

I pointed out the two missing macros above just to let you know about an
implementation which needs further attention.

Just for the records I'll report this here:

#define EPERM 1 /* Operation not permitted */
#define ENOFILE 2 /* No such file or directory */ [snip]
#define ENOTEMPTY 41 /* Directory not empty (90 in Cyg?) */
#define EILSEQ 42 /* Illegal byte sequence */


They say "oil is where you find it."

On linux

# find /usr/include -type f -exec grep -Hn EOVERFLOW {} \;
/usr/include/asm-generic/errno.h:48:#define EOVERFLOW 75


On Interix

# find /usr/include -type f -exec grep -on EOVERFLOW {} \;
/usr/include/errno.h:122:#define EOVERFLOW 88


On MinGW I don't know where EOVERFLOW or PTRDIFF_MAX reside. I can only
suggest using tools like "find" and "grep" to locate them.

First of all let me clarify that this is not a big issue for me: the
code is yours and it should be up to you to make it portable (assuming
that target falls within your aims).

I'm just doing my best to help in the few spots that I can (testing your
code with a toolchain and a system that you have not at your disposal).

Going on with the same spirit: in my implementation PTRDIFF_MAX resides
in <stdint.h> (it didn't find it because you didn't include that header).

EOVERFLOW is definitely missing.
 
T

Tom St Denis

I dunno.  I think it's probably a combination of approach and experience.
He's clearly a fairly new programmer, and I think he deserves credit for
at least WANTING to deal with invalid inputs gracefully.  I think it turns
out that, due to the nature of C, his attempts to do that are causing more
harm than good, but it's a good impulse to have.

The flaw in his approach is that he's not outlining his solution
before writing the code. I'm a professional developer [been working
for nearly a decade in crypto/swdevel] and I still mock out complex
functions [like say PKCS #8 parsers] in comment blocks before writing
any C statements.
I'm not sure about that.  I think you have to do SOMETHING that you can
mess with and try it to see how it works.  Joel's rant about string
manipulation does point out an issue you sometimes have if you don't start
down at the byte-manipulation level.

Yeah but he doesn't know what he's doing. He needs to think
abstractly first. E.g., what am I doing, what are the discrete steps,
etc.

In a reply to Keith I wrote a trim() function in 4 minutes that does a
single pass, in place trim. My routine is STILL shorter than Johns,
and not because it's obfuscated and hard to read, heck I even have
comments in it.

My solution is shorter because I thought about the problem for a bit
and came up with that.

Tom
 
J

John Kelly

My advice for John: Describe *in English* exactly what trim() is
supposed to do. The description should be sufficiently precise that
two programmers implementing it will produce functions that yield
exactly the same results for valid arguments (including arguments
that are intended to trigger some sort of error indication).

See "Define ideas"

The more you say, the more likely your words will not match the reality
of the code. I try to keep the code clean and the words few. That way,
you get an overview in "Define ideas," but for the details you must read
the code.
 
J

John Kelly

Or it might be defined in some implementation-defined header;
POSIX defines all its non-C-standard E* macros in <errno.h>,
and the C standard encourages this, but other systems might
behave differently. In that case, I'd suggest checking the
documentation first. Just because you happen to find the definition
in /usr/include/asm-generic/errno.h, that doesn't mean you should
write #include <asm-generic/errno.h>.

Right.

I'm not suggesting that he #inlcude low-level headers. But that can
help him learn where the constants reside, and then backtrack to figure
out what main headers are needed, if the constants exist somewhere in
his environment.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

trim whitespace, bullet proof version 63
trim whitespace 194
trim 6
Trim string 42
Request for source code review of simple Ising model 88
Strange bug 65
malloc and maximum size 56
Dead Code? 4

Members online

No members online now.

Forum statistics

Threads
473,775
Messages
2,569,601
Members
45,182
Latest member
alexanderrm

Latest Threads

Top