trim whitespace v3

S

Seebs

The flaw in his approach is that he's not outlining his solution
before writing the code. I'm a professional developer [been working
for nearly a decade in crypto/swdevel] and I still mock out complex
functions [like say PKCS #8 parsers] in comment blocks before writing
any C statements.

I think I agree with this analysis -- especially because it describes
what I tend to do wrong too. I'm wayyyy too inclined to jump in and
start typing.
Yeah but he doesn't know what he's doing. He needs to think
abstractly first. E.g., what am I doing, what are the discrete steps,
etc.

For many people, thinking abstractly about something requires first having
gotten some hands-on time with it. In short, they *can't* do that first,
until they've done something else a few times and developed the necessary
infrastructure. And the only reason I say "many" and not "all" is that there
appear to exist a handful of counterexamples. (I'm probably one of them;
I handle concrete things better if I've been given a purely abstract map
of the space first.)
My solution is shorter because I thought about the problem for a bit
and came up with that.

Also, I think, because you are not trying to solve unsolvable side-problems.

-s
 
S

Seebs

The more you say, the more likely your words will not match the reality
of the code.

True. But! If what you say is correct, then *the code is wrong*. And
that's the thing -- people don't care that your program does exactly what
you coded it to do, they care whether it does what you told them it would
do.

A sufficient description to be able to predict the output of a function
without reading its source is pretty valuable, and I think effectively
essential; without it, the code is useless.

You have mentioned interest in developing a stable library of solutions
like this one. Well, stage one in that will be to come up with definitions
sufficiently useful that people COULD use the functions, because without
those definitions, there's no value to them in the functions.
I try to keep the code clean and the words few. That way,
you get an overview in "Define ideas," but for the details you must read
the code.

Okay, imagine that we are to look at Tom's trim() function which strips
internal whitespace also.

Is that a bug that we should fix? How would you know, without a spec
a little more detailed than "remove whitespace"?

-s
 
J

John Kelly

First of all let me clarify that this is not a big issue for me: the
code is yours and it should be up to you to make it portable (assuming
that target falls within your aims).

I'm just doing my best to help in the few spots that I can (testing your
code with a toolchain and a system that you have not at your disposal).

Going on with the same spirit: in my implementation PTRDIFF_MAX resides
in <stdint.h> (it didn't find it because you didn't include that header).

Have you tried v3?

# if __STDC_VERSION__ >= 199901L
# include <stdint.h>
# endif

<stdint.h> won't be inlcluded, unless the macro above is true. On the
gcc I am using, that means calling gcc with the option:

-std=c99

EOVERFLOW is definitely missing.

That seems strange for a gcc environment.

http://www.opengroup.org/onlinepubs/000095399/basedefs/errno.h.html
 
K

Keith Thompson

John Kelly said:
Right.

I'm not suggesting that he #inlcude low-level headers. But that can
help him learn where the constants reside, and then backtrack to figure
out what main headers are needed, if the constants exist somewhere in
his environment.

I suggest that a recursive grep of /usr/include should not be the
first thing you try when you want to know whether a given identifier
is defined by your system. Start with the documentation. Actually,
start with the C standard (or a good C reference).

For example, on my system a number of headers files are under
/usr/lib/gcc, *not* under /usr/include.

I admit I sometimes grep /usr/include myself -- but I have enough
experience to understand that whatever information I get may
be unreliable. Definitions might be surrounded by #ifdefs whose
condition will never be true on my system.

Read the documentation. That's what it's for.
 
J

John Kelly

My advice for John: Describe *in English* exactly what trim() is
supposed to do. The description should be sufficiently precise that
two programmers implementing it will produce functions that yield
exactly the same results for valid arguments (including arguments
that are intended to trigger some sort of error indication).

I understand the principle you advocate. It's used in corporate
environments where the lawyers insist on clean room development with a
Chinese wall between specification and code. The reason is to guard
against possible copyright infringement.

But I don't have that need. I'm publishing it as open source with a
liberal Apache license, which allow usage with or without publishing
your derived source code.

Do with it what you will. With an Apache license, there are no onerous
restrictions preventing anyone from using it.

That means the clean room Chinese wall scenario is moot. Read the code.
There's no reason not to.
 
N

Nick

Seebs said:
I dunno. I think it's probably a combination of approach and experience.
He's clearly a fairly new programmer, and I think he deserves credit for
at least WANTING to deal with invalid inputs gracefully. I think it turns
out that, due to the nature of C, his attempts to do that are causing more
harm than good, but it's a good impulse to have.

I got a bit bogged down by all that stuff about the limits of ptrdiff_t
though. I find it hard to imagine when that could happen.

In fact, of course, one can write this program with three pointers, a
single (and a fraction) pass through the string and a state enum. You
start in an initial state, with the read and write pointers at the start
of the string.

In the initial state, you advance only the read pointer, until the read
pointer is non-blank, when you step to to state two.

In state two you copy from the read pointer to the write pointer and
advance both. If you see non-blank, instead you copy the read pointer
into the space pointer and move to state 3.

In state 3, you advance the read pointer until you hit non-space or the
end of the string. When this happens, if it's the end of the string you
just write EOS to the write pointer. If not, you copy from the space
pointer to the write pointer and advance both, until the space pointer
is the same as the read pointer, when you go back to state two.

There are details missed there (like early end of strings) and it needs
some testing, but that's the "proper" way to do it.

A nice thing about this is that you can easily pass it some flags to say
which of leading or trailing spaces you want to skip, and if you want to
fold multiple internal spaces into one.

It would take some careful testing to make sure you've got all the
corner cases, which is why I've sketched the solution, not written it.
 
N

Nick

Tom St Denis said:
My point of posting was to show that it's not that hard to come up
with a clean rough approximation of what he's trying to do. Heck
here's what you're talking about

void trim(char *src)
{
char *src1 = src, *src2, *src3;

// skip leading space
while (isspace(*src)) ++src;
if (!*src) { *src1 = 0; return; }

// find last non-ispace
src3 = src;
while (*src) { if (!isspace(*src)) { src2 = src; } ++src };

// move src3...src2 to src1 and append a NUL
memmove(src1, src3, (size_t)(src2-src3));
src1[(size_t)(src2-src3)] = 0;
}

Single pass, handles all space, handles no space, should be what he
wants. And I spend roughly 4 minutes on writing that C code for this
post.

Not quite single pass - you look over the whole string, and memmove
effectively contains another loop over most of the string (in typical
circumstances). OTOH, it could be quicker than my nearer-to-single-pass
sketch, as I manually walk the whole string with copying, while memmove
probably can use better optimised code.
 
J

John Kelly

I suggest that a recursive grep of /usr/include should not be the
first thing you try when you want to know whether a given identifier
is defined by your system. Start with the documentation. Actually,
start with the C standard (or a good C reference).

I try to break things first, and then see if I can fix them. ;-)

Read the documentation. That's what it's for.

Ugh.
 
K

Keith Thompson

John Kelly said:
See "Define ideas"

What? Was that in one of the versions you posted?
The more you say, the more likely your words will not match the reality
of the code. I try to keep the code clean and the words few. That way,
you get an overview in "Define ideas," but for the details you must read
the code.

If I have to read the code to know how to use your function, I'm not
interested in using it.
 
J

John Kelly

I got a bit bogged down by all that stuff about the limits of ptrdiff_t
though. I find it hard to imagine when that could happen.

In fact, of course, one can write this program with three pointers, a
single (and a fraction) pass through the string and a state enum. You
start in an initial state, with the read and write pointers at the start
of the string.

In the initial state, you advance only the read pointer, until the read
pointer is non-blank, when you step to to state two.

In state two you copy from the read pointer to the write pointer and
advance both. If you see non-blank, instead you copy the read pointer
into the space pointer and move to state 3.

In state 3, you advance the read pointer until you hit non-space or the
end of the string. When this happens, if it's the end of the string you
just write EOS to the write pointer. If not, you copy from the space
pointer to the write pointer and advance both, until the space pointer
is the same as the read pointer, when you go back to state two.

There are details missed there (like early end of strings) and it needs
some testing, but that's the "proper" way to do it.

A nice thing about this is that you can easily pass it some flags to say
which of leading or trailing spaces you want to skip, and if you want to
fold multiple internal spaces into one.

It would take some careful testing to make sure you've got all the
corner cases, which is why I've sketched the solution, not written it.

With a 1,000,000 byte string having one space at the front, your fancy
"state machine" performs 999,999 individual reads and stores. My code
moves the whole block all at once. Memmove() may be library optimized
to a single machine instruction.

Blindness to performance considerations is a mark of novice programmers,
Seebs pseudo-analysis notwithstanding.
 
K

Keith Thompson

John Kelly said:
I try to break things first, and then see if I can fix them. ;-)


Ugh.

You ask for my help, then you refuse to follow my advice.

That's your right, of course, but I don't think there's anything more
I can do for you (unless you change your mind and start listening
to people who have more experience than you do).
 
K

Keith Thompson

John Kelly said:
With a 1,000,000 byte string having one space at the front, your fancy
"state machine" performs 999,999 individual reads and stores. My code
moves the whole block all at once. Memmove() may be library optimized
to a single machine instruction.

That single machine instruction, if it exists, almost certainly
has O(N) performance. Unless you've measured the performance,
you don't know which is faster. (There have been cases where
microcoded CPU instructions have turned out to be slower than an
explicit loop that does the same thing.)
Blindness to performance considerations is a mark of novice programmers,
Seebs pseudo-analysis notwithstanding.

Premature obsession with performance considerations is also a mark
of novice programmers.
 
J

John Kelly

You ask for my help, then you refuse to follow my advice.

When I asked a specific question you provided no answer.

That's your right, of course, but I don't think there's anything more
I can do for you (unless you change your mind and start listening
to people who have more experience than you do).

I don't know who has more experience. But when you don't answer
specific questions, it's clear you have an agenda with more focus on
asserting yourself than helping individuals.
 
S

Seebs

I understand the principle you advocate.

No, you don't.
It's used in corporate
environments where the lawyers insist on clean room development with a
Chinese wall between specification and code. The reason is to guard
against possible copyright infringement.

Maybe it is, but that's not why it's useful.
Do with it what you will. With an Apache license, there are no onerous
restrictions preventing anyone from using it.

But there may be policies preventing them anyway.
That means the clean room Chinese wall scenario is moot. Read the code.
There's no reason not to.

Your code is frequently buggy. How should we determine whether your code is
intentionally doing something stupid, or we've found a bug in it? How should
we tell whether other users of your code might be depending on a given
behavior which seems buggy to us?

Open source software has every bit as much of a need for clear documentation
as anything else does, perhaps more so -- because it is INTENDED to have
multiple users and multiple maintainers.

-s
 
S

Seebs

I got a bit bogged down by all that stuff about the limits of ptrdiff_t
though. I find it hard to imagine when that could happen.

It is not entirely obvious to me that it COULD happen.

More importantly, I think it is exceedingly unlikely that, on any machine,
you'll ever see it happen when something else worse hasn't happened first.
Which is why I don't think it really "could" happen.
It would take some careful testing to make sure you've got all the
corner cases, which is why I've sketched the solution, not written it.

Yes, but it's very nicely sketched.

-s
 
J

John Kelly

What? Was that in one of the versions you posted?

If you don't know the answer to that question, you didn't even browse
the top of v3.

If I have to read the code to know how to use your function, I'm not
interested in using it.

Why should I care what you think.
 
I

Ian Collins

The best form for that description is unit tests. Clear, completely
unambiguous and shows how the code is used as well as what it does.
I understand the principle you advocate. It's used in corporate
environments where the lawyers insist on clean room development with a
Chinese wall between specification and code. The reason is to guard
against possible copyright infringement.

It is also used anywhere were quality and/or teamwork is important.
But I don't have that need. I'm publishing it as open source with a
liberal Apache license, which allow usage with or without publishing
your derived source code.

That's true, in your case it appears neither quality nor collaboration
is important.
 
S

Seebs

With a 1,000,000 byte string having one space at the front, your fancy
"state machine" performs 999,999 individual reads and stores. My code
moves the whole block all at once. Memmove() may be library optimized
to a single machine instruction.

It seems unlikely that it could do this in the general case -- I know of
some fast copy routines, but none that can take an arbitrary size argument.

There was at least one machine which had a single machine instruction
equivalent to strcpy(), but on which that instruction was DRAMATICALLY
slower than a plain loop written in C.

It is not necessarily the case that memmove() is enough faster to notice.
In any event, it's not especially hard to update the state machine design
to use memmove for large copies.
Blindness to performance considerations is a mark of novice programmers,
Seebs pseudo-analysis notwithstanding.

There are many states from "novice" to "expert" (google "Dreyfus model").
I have never yet met a novice programmer who was not attentive to
"performance considerations", but they've usually been extremely bad at
it, because they don't have the intuition developed for knowing where
performance matters.

In general, the best advice, which seems common among experts, is:

Write something clear, and then measure its performance.

Now, if you do need to change it to improve performance, you have the
advantage of a good test case to verify that you didn't break it.

-s
 
J

John Kelly

That single machine instruction, if it exists, almost certainly
has O(N) performance. Unless you've measured the performance,
you don't know which is faster. (There have been cases where
microcoded CPU instructions have turned out to be slower than an
explicit loop that does the same thing.)


Premature obsession with performance considerations is also a mark
of novice programmers.

You and Seebs make a good tag team. But I don't play that sport.
 
J

John Kelly

in your case it appears neither quality nor collaboration is important.

I listened to feedback and made corresponding improvements. The c.l.c
truth is, some of you hate anyone who won't kneel to you.

I bow to a higher authority.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

trim whitespace, bullet proof version 63
trim whitespace 194
trim 6
Trim string 42
Request for source code review of simple Ising model 88
Strange bug 65
malloc and maximum size 56
Dead Code? 4

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,088
Latest member
JeremyMedl

Latest Threads

Top