trim whitespace v3

K

Keith Thompson

Francesco S. Carta said:
]
How did you confirm it?

If this source file:

#include<errno.h>
#ifndef EOVERFLOW
#error "EOVERFLOW is not defined"
#endif

triggers the #error during compilation, you can be pretty sure
that your system doesn't define EOVERFLOW -- though there *might*
be a way to change the compiler's mode so it does some magic to
make it visible.

Already tested that by running the code in the OP of this thread, the
compiler halted at the first instance of EOVERFLOW.

Can you try compiling the above 4-line source file? I'm 90+% sure
you'll get a compiler error message referring to line 3 and containing
the string "EOVERFLOW is not defined".

Actually, a Google search finds numerous references to EOVERFLOW being
missing under MinGW, so make that 99%.

Still if you have a moment, I'd like to make it 100%.
I've searched for that token into all files of the packages, and I found
no (meaningful) occurrence of it. Just once in a comment of an internal
header, but no definition whatsoever (that commented instruction used it
instead of defining it).

If there is any way to force MinGW to magically pull it out from the
compiler binary, that goes beyond my knowledge.

If it were defined, it would probably pull it out of some header file,
not out of the compiler binary.
 
M

Moi

With a 1,000,000 byte string having one space at the front, your fancy
"state machine" performs 999,999 individual reads and stores. My code
moves the whole block all at once. Memmove() may be library optimized
to a single machine instruction.

If you have a 1MB string (or memory area, or "object"), and you want
to call trim() on it, you probably have a design problem.
Let me guess: it starts with an 'X' ?

In any case: getting the string into memory will probably be a bigger problem
than calling trim() in the first place.
(unless you call similar functions repeatedly ;-)

HTH,
AvK
 
J

John Kelly

Trim leading and trailing whitespace from any null terminated string
found in memory, no matter whether data or garbage. Reports failure
and quits if null terminator not found within system defined limit.
Encountering such a specification for code to be written, a programmer
might think that the program should look through memory to find any
null terminated strings, and remove leading and trailing whitespace
from each such string. Some programmers might squeeze out such
whitespace by moving the strings together in memory, others might
zero the bytes where such whitespace is found, etc. Some programmers
might use system calls to find out how much memory has been allocated
to your program so they know how much memory to scan, others might try
to enforce a string registration regime so they know which strings to
work on. All in all, it seems likely that n programmers implementing
it will produce n different, incompatible, and useless functions, vs
the desired case of all producing interchangeable functions.
Minor note, keep verb tenses parallel -- avoid Trim/Reports/quits,
use Trim/Report/quit or Trims/Reports/quits.

Yes, the clarity of that could be improved. I'll see what I can do.

When advice or criticism is specific and constructive, it's valuable.
Otherwise, it's not.

I'll wait for more code fixes before posting v4 though.
 
J

John Kelly

Why do you need to call memmove() at all? IMVHO, a low-level trim function
should only give you an offset and a length of the trimmed string. You can
then use that information to operate on the trimmed string directly without
resorting to shifting any memory around.

Yes you could do it that way. But after the trim, I want to operate on
the original pointer, not a different one. So I move the data in-place
and leave the pointer untouched.


Humm, well, I could understand calling memmove() and realloc() to free up
some space if the trimmed string is significantly smaller than the original
string.

If you allocated the memory you could do that. But trim() is designed
to work on any pointer, not just malloc()ed ones.
 
J

John Kelly

If you have a 1MB string (or memory area, or "object"), and you want
to call trim() on it, you probably have a design problem.
Let me guess: it starts with an 'X' ?

In any case: getting the string into memory will probably be a bigger problem
than calling trim() in the first place.
(unless you call similar functions repeatedly ;-)

Right.

Allowing for strings greater than 1MB has dubious practical value. I
could have chosen an arbitrary limit, but trying to avoid a hard-coded
constant was good exercise.
 
F

Francesco S. Carta

Tom St Denis said:
My point of posting was to show that it's not that hard to come up
with a clean rough approximation of what he's trying to do. Heck
here's what you're talking about

void trim(char *src)
{
char *src1 = src, *src2, *src3;

// skip leading space
while (isspace(*src)) ++src;
if (!*src) { *src1 = 0; return; }

// find last non-ispace
src3 = src;
while (*src) { if (!isspace(*src)) { src2 = src; } ++src };

// move src3...src2 to src1 and append a NUL
memmove(src1, src3, (size_t)(src2-src3));
src1[(size_t)(src2-src3)] = 0;
}

Single pass, handles all space, handles no space, should be what he
wants. And I spend roughly 4 minutes on writing that C code for this
post.

Not quite single pass - you look over the whole string, and memmove
effectively contains another loop over most of the string (in typical
circumstances). OTOH, it could be quicker than my nearer-to-single-pass
sketch, as I manually walk the whole string with copying, while memmove
probably can use better optimised code.

Is a single pass truly worth the effort? I seem to have achieved it, but
I did it at the cost of littering the code with tests... I wonder if
it's a true gain or if I just threw away rocks for mountains - if not
for bugged mountains, too...

Dissections are more than welcome.

//-------

#include <stdio.h>
#include <ctype.h>

/*
single pass in-place trim function
- trims all isspace() chars at both ends
- expects correctly null-terminated modifiable buffers
(except null pointers, which simply pass through)
*/

void trim(char *src) {
if (!src) return;
char* cur;
char* dest;
char* first_trail_char = 0;
int got_start = 0;
for (dest = cur = src; *cur; ++cur) {
int is_space = isspace(*cur);
if (got_start) {
if (is_space) {
if (!first_trail_char) first_trail_char = dest;
} else {
first_trail_char = 0;
}
*dest++ = *cur;
} else if (!is_space) {
got_start = 1;
*dest++ = *cur;
}
}
if (!got_start) {
*src = 0;
} else if (first_trail_char) {
*first_trail_char = 0;
} else {
*dest = 0;
}
}

// test case

char s0[] = " space both sides ";
char s1[] = "trailing space ";
char s2[] = " leading space";
char s3[] = " spaces both sides ";
char s4[] = "\n\t\rspecial trim both sides\n\t\r";
char s5[] = " \n\t\r special trim before";
char s6[] = "special trim after\n\t\r ";
char s7[] = "no trimming";
char s8[] = " ";
char s9[] = "";

char *strings[11] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL
};

int main (void) {
int i;
for (i = 0; i < 11; i++) {
printf (">%s<\n", strings);
trim (strings);
printf ("[%s]\n", strings);
printf ("---\n");
}
return 0;
}

//-------
 
J

John Kelly

Passing a NULL to %s in printf.

The printf that was originally in trim() is gone.

But if you mean the testing code that others posted, I didn't examine
that closely. My focus was on the trim() code, which I wrote. The code
is available to all interested parties, so anyone who wants to, can fix
it.
 
I

Ian Collins

The printf that was originally in trim() is gone.

But if you mean the testing code that others posted, I didn't examine
that closely. My focus was on the trim() code, which I wrote. The code
is available to all interested parties, so anyone who wants to, can fix
it.

So you are happy to provide code under your name which crashes when run?
 
K

Keith Thompson

Moi said:
Exactly.
I would not call myself a professional developer, but I've read and written
enough code to ask the question: "what is this thing supposed to do".

In the case of the trim() function it is obvious:
* the string is supposed to be a valid object, decently null terminated, etc.

That "etc." is the tricky part. John Kelly has claimed that he's
trying to write a "bullet proof" trim() function. He's ignoring the
fact that, for reasonable meanings of "bullet proof", the task is
not possible, but he should at least specify what kinds of bullets
he expects to handle and how he intends to handle them.
* It is writeable, and should be modified in-place.
* trimming it can never make it larger. At most smaller.

That last isn't part of the specification; it's a consequence of it. (I
know you didn't say you were writing a specification.)

[...]
You know who likes Shakespeare, too ?

Jean-Luc Picard?
 
F

Francesco S. Carta

Francesco S. Carta said:
]
EOVERFLOW is definitely missing. [...]
How did you confirm it?

If this source file:

#include<errno.h>
#ifndef EOVERFLOW
#error "EOVERFLOW is not defined"
#endif

triggers the #error during compilation, you can be pretty sure
that your system doesn't define EOVERFLOW -- though there *might*
be a way to change the compiler's mode so it does some magic to
make it visible.

Already tested that by running the code in the OP of this thread, the
compiler halted at the first instance of EOVERFLOW.

Can you try compiling the above 4-line source file? I'm 90+% sure
you'll get a compiler error message referring to line 3 and containing
the string "EOVERFLOW is not defined".

Actually, a Google search finds numerous references to EOVERFLOW being
missing under MinGW, so make that 99%.

Still if you have a moment, I'd like to make it 100%.

You're welcome:

main.c|3|error: #error "EOVERFLOW is not defined"|
||=== Build finished: 1 errors, 0 warnings ===|
If it were defined, it would probably pull it out of some header file,
not out of the compiler binary.

That is what I thought, but I seem to recall that (at least in C++)
implementations are not required to actually ship headers along, they
just need to behave as if they are there when they get #include-d into
source files. I suppose the same could stand for C implementations.

That's what I meant above, more or less, with "pulling stuff out of the
binary".
 
K

Keith Thompson

Francesco S. Carta said:
You're welcome:

main.c|3|error: #error "EOVERFLOW is not defined"|
||=== Build finished: 1 errors, 0 warnings ===|
Thanks.


That is what I thought, but I seem to recall that (at least in C++)
implementations are not required to actually ship headers along, they
just need to behave as if they are there when they get #include-d into
source files. I suppose the same could stand for C implementations.

That's what I meant above, more or less, with "pulling stuff out of the
binary".

You're right that headers specified by the #include <...> syntax (as
opposed to the #define "..." syntax) needn't be actual files in C;
the standard refers to "headers", not to "files". Then again, the
implementation and/or OS can use the word "files" to mean anything
it likes.

My understanding is that MinGW uses gcc as its compiler, and that
gcc doesn't play this kind of trick.
 
F

Francesco S. Carta

Is a single pass truly worth the effort? I seem to have achieved it, but
I did it at the cost of littering the code with tests... I wonder if
it's a true gain or if I just threw away rocks for mountains - if not
for bugged mountains, too...

Dissections are more than welcome.

This one looks better, separating the leading trimming simplified the code:

//-------

#include <stdio.h>
#include <ctype.h>

/*
single pass in-place trim function
- trims all isspace() chars at both ends
- expects correctly null-terminated modifiable buffers
(except null pointers, which simply pass through)
*/

void trim(char *src) {
if (!src) return;
char* cur = src;
while (isspace(*cur)) cur++;
if (*cur) {
char* first_trail_char = 0;
char* dest;
for (dest = src; *cur; ++cur) {
if (isspace(*cur)) {
if (!first_trail_char) first_trail_char = dest;
} else {
first_trail_char = 0;
}
*dest++ = *cur;
}
if (first_trail_char) {
*first_trail_char = 0;
} else {
*dest = 0;
}
} else {
*src = 0;
}
}

// test case

char s0[] = " space both sides ";
char s1[] = "trailing space ";
char s2[] = " leading space";
char s3[] = " spaces both sides ";
char s4[] = "\n\t\rspecial trim both sides\n\t\r";
char s5[] = " \n\t\r special trim before";
char s6[] = "special trim after\n\t\r ";
char s7[] = "no trimming";
char s8[] = " ";
char s9[] = "";

char *strings[11] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL
};

int main (void) {
int i;
for (i = 0; i < 11; i++) {
printf (">%s<\n", strings);
trim (strings);
printf ("[%s]\n", strings);
printf ("---\n");
}
return 0;
}

//-------
 
S

Seebs

// test case

char s0[] = " space both sides ";
char s1[] = "trailing space ";
char s2[] = " leading space";
char s3[] = " spaces both sides ";
char s4[] = "\n\t\rspecial trim both sides\n\t\r";
char s5[] = " \n\t\r special trim before";
char s6[] = "special trim after\n\t\r ";
char s7[] = "no trimming";
char s8[] = " ";
char s9[] = "";

char *strings[11] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL
};

Why not:
char *strings[] = {
" space both sides ",
"trailing space ",
...,
NULL
};
for (i = 0; i < 11; i++) {

and then:
for (i = 0; strings; ++i) {

-s
 
F

Francesco S. Carta

You're right that headers specified by the #include<...> syntax (as
opposed to the #define "..." syntax) needn't be actual files in C;
the standard refers to "headers", not to "files". Then again, the
implementation and/or OS can use the word "files" to mean anything
it likes.

Ah said:
My understanding is that MinGW uses gcc as its compiler, and that
gcc doesn't play this kind of trick.

I couldn't make it sure before because I usually compile via
Code::Blocks without bothering too much about the details of my
tool-chain - the very first times I think I've stomped on some problems
because I've been testing my C in .cpp files, which surely got compiled
by the C++ compiler.

This should dissipate any doubt:

C:\Users\Admin>gcc --version
gcc (GCC) 4.4.0
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is
NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.


C:\Users\Admin>type test.c
#include<errno.h>
#ifndef EOVERFLOW
#error "EOVERFLOW is not defined"
#endif

C:\Users\Admin>gcc -c test.c
test.c:3:2: error: #error "EOVERFLOW is not defined"

C:\Users\Admin>
 
J

James

John Kelly said:
Yes you could do it that way. But after the trim, I want to operate on
the original pointer, not a different one. So I move the data in-place
and leave the pointer untouched.

It seems to me that having a two level API would be most beneficial. A low
level trim_ex() function which returns offset and length of trimmed string,
and a higher level function trim() that actually does in-place
modifications. The high level trim() function could easily be based on the
lower level trim_ex().



If you allocated the memory you could do that.

Yes, of course you are correct here.

But trim() is designed
to work on any pointer, not just malloc()ed ones.


Well, I would not put realloc() in the high level trim() function; memmove()
is fine.
 
J

John Kelly

EOVERFLOW is not defined

Sometimes the visibility depends on another define:

+/* On OSF/1 5.1, when _XOPEN_SOURCE_EXTENDED is not defined, the macros
+ EMULTIHOP, ENOLINK, EOVERFLOW are not defined. */

Maybe you can learn more from the MinGW community.
 
I

Ian Collins

The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition

It lists common symbolic constants. Seems like any gcc environment
would have them.

If I find many environments lacking EOVERFLOW, I may use a different
symbolic constant. But if MinGW is the only platform lacking it, that
problem can wait.

So you go to extraordinary lengths to find a portable way to generate an
impractical maximum size, but use a non-standard error value?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

trim whitespace, bullet proof version 63
trim whitespace 194
trim 6
Trim string 42
Request for source code review of simple Ising model 88
Strange bug 65
malloc and maximum size 56
Dead Code? 4

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,286
Latest member
ChristieSo

Latest Threads

Top