trim whitespace v3

J

John Kelly

/*

Define author
John Kelly, August 20, 2010

Define copyright
Copyright John Kelly, 2010. All rights reserved.

Define license
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this work except in compliance with the License.
You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0

Define symbols and (words)
exam ......... temporary char *
hast ......... temporary char * to alpha !isspace
keep ......... temporary char * to omega !isspace
strlen ....... string length
trimlen ...... trimmed string length
ts ........... temporary string
tu ........... temporary size_t
xu ........... fail safe size_t


Define ideas

Trim leading and trailing whitespace from any null terminated string
found in memory, no matter whether data or garbage. Reports failure
and quits if null terminator not found within system defined limit.

Parameters
char * to string
NULL, or size_t * for trimlen result

On success returns 0

On failure returns -1 and after setting errno to one of
EINVAL
EOVERFLOW


*/

# if __STDC_VERSION__ >= 199901L
# include <stdint.h>
# endif
# include <stddef.h>
# include <stdlib.h>
# include <limits.h>
# include <ctype.h>
# include <errno.h>
# include <stdio.h>
# include <string.h>
# include <malloc.h>

# ifndef SIZE_MAX
# define SIZE_MAX ((size_t)-1)
# endif

ptrdiff_t
ptrdiff_max_ (void)
{
ptrdiff_t last, next;

for (last = 32767; (next = 2 * (double) last + 1) > last;) {
last = next;
}
return last;
}

static int
trim (char *const ts, size_t *trimlen)
{
size_t tu;
unsigned char *exam;
unsigned char *hast;
unsigned char *keep;
# ifdef PTRDIFF_MAX
ptrdiff_t const ptrdiff_max = PTRDIFF_MAX;
# else
ptrdiff_t const ptrdiff_max = ptrdiff_max_ ();
# endif
size_t const xu = ptrdiff_max < SIZE_MAX ? ptrdiff_max : SIZE_MAX;

if (!ts) {
errno = EINVAL;
return -1;
}
tu = 0;
exam = (unsigned char *) ts;
while (++tu < xu && isspace (*exam)) {
++exam;
}
if (tu == xu) {
errno = EOVERFLOW;
return -1;
}
tu = 0;
hast = keep = exam;
while (++tu < xu && *exam) {
if (!isspace (*exam)) {
keep = exam;
}
++exam;
}
if (tu == xu) {
errno = EOVERFLOW;
return -1;
}
if (*keep) {
*++keep = '\0';
}
tu = keep - hast;
if (hast != (unsigned char *) ts) {
(void) memmove (ts, hast, tu + 1);
}
if (trimlen) {
*trimlen = tu;
}
return 0;
}


/*
Test code courtesy c.l.c contributors
*/

static char s0[] = " I need trimming on both ends. ";
static char s1[] = "I need trimming on far end. ";
static char s2[] = " I need trimming on near end.";
static char s3[] = " I need more trimming on both ends. ";
static char s4[] = "\n\t\rI need special trimming on both ends.\n\t\r";
static char s5[] = " \n\t\r I need special trimming on near end.";
static char s6[] = "I need special trimming on far end. \n\t\r ";
static char s7[] = "I need no trimming";
static char s8[] = " ";
static char s9[] = "";
static char *strings[12] = {
s0, s1, s2, s3, s4, s5, s6, s7, s8, s9, NULL,
};

# define CHUNK 100000

int
main (void)
{
int i;
char *cp;
size_t length;

for (i = 0; i < 11; i++) {
printf ("Original string:[%s]\n", strings);
trim (strings, &length);
printf ("Trimmed string:[%s]\n", strings);
puts ("---------------------------------------");
}

/* insanity tests */
printf ("trim returned: %i\n", trim (NULL, NULL));
printf ("trim returned: %i\n", trim (NULL, &length));

cp = malloc (CHUNK);
if (cp) {
memset (cp, 'x', CHUNK);
memset (cp + CHUNK - 3, ' ', 1);
memset (cp + CHUNK - 2, ' ', 1);
memset (cp + CHUNK - 1, '\0', 1);
} else {
printf ("malloc failed\n");
return (1);
}

printf ("Original big string length %u\n", strlen (cp));
printf ("trim returned: %i\n", trim (cp, &length));
printf ("New string length %u\n", strlen (cp));
printf ("Double check length returned %u\n", length);
free (cp);
printf ("trim returned %i on a freed block!\n", trim (cp, &length));
printf ("The length returned was %d\n", length);
puts ("\nSUCCESS testing trim function");
return 0;
}
 
F

Francesco S. Carta

errno = EOVERFLOW;

EOVERFLOW is missing from my implementation, is that possible?

Besides, also PTRDIFF_MAX is, but the code is obviously falling back to
ptrdiff_max_()

I cannot compile your code with MinGW (4.4.0 32bit nor 4.5 64bit).
 
T

Tom St Denis

    Trim leading and trailing whitespace from any null terminated string
    found in memory, no matter whether data or garbage.  Reports failure
    and quits if null terminator not found within system defined limit.

<snip>

I'm sure this has been asked but I'll ask anyways

1. Why not just do a trim to a new buffer as opposed to inplace using
memmove?

2. If this is meant to hook up to some sort of parser, why not add
white space to the parser so it just skips over it as opposed to
needing it cut out?

3. Why not use pastebin for code snippets and/or setup a CVS/SVN/GIT/
etc if you plan on versioning it?

Tom
 
J

John Kelly

<snip>

I'm sure this has been asked but I'll ask anyways

1. Why not just do a trim to a new buffer as opposed to inplace using
memmove?

You could but that wasn't my objective.

2. If this is meant to hook up to some sort of parser, why not add
white space to the parser so it just skips over it as opposed to
needing it cut out?

One way I use it is to trim whitespace from command line arguments in
argv. I don't want to change the argv pointer list, so I change the
data in place.

3. Why not use pastebin for code snippets and/or setup a CVS/SVN/GIT/
etc if you plan on versioning it?

I used the feedback here to improve it. It consumed more time than I
expected though, and it's hard to find the time to do what you suggest.

One thing I wonder about is, why don't c.l.c contributors organize
themselves, develop a code snippets library, and put it up for download.
That could save people a lot of time.


:)
 
J

John Kelly

EOVERFLOW is missing from my implementation, is that possible?

I don't know much about MinGW.

Besides, also PTRDIFF_MAX is, but the code is obviously falling back to
ptrdiff_max_()

Probably an #include issue.

I cannot compile your code with MinGW (4.4.0 32bit nor 4.5 64bit).

I don't have MinGW to test with, but if you want to pursue this, we can
talk via email. My posted email address is real.
 
K

Keith Thompson

Francesco S. Carta said:
EOVERFLOW is missing from my implementation, is that possible?
[...]

Certainly. The only errno macros defined by the C standard are EDOM,
EILSEQ, and ERANGE (though EOVERFLOW is defined by POSIX).
 
S

Seebs

1. Why not just do a trim to a new buffer as opposed to inplace using
memmove?

That sounds like it'd be expensive and involve new allocations.
2. If this is meant to hook up to some sort of parser, why not add
white space to the parser so it just skips over it as opposed to
needing it cut out?

My guess is that this is intended to be general-purpose, rather than
for use in only one specific context.
3. Why not use pastebin for code snippets and/or setup a CVS/SVN/GIT/
etc if you plan on versioning it?

Can't answer that one.

-s
 
F

Francesco S. Carta

I don't know much about MinGW.



Probably an #include issue.



I don't have MinGW to test with, but if you want to pursue this, we can
talk via email. My posted email address is real.

I find it more useful to keep these things public, after all these
groups are meant for the community growth as a whole.

I pointed out the two missing macros above just to let you know about an
implementation which needs further attention.

Just for the records I'll report this here:

/*
* errno.h
* This file has no copyright assigned and is placed in the Public Domain.
* This file is a part of the mingw-runtime package.
* No warranty is given; refer to the file DISCLAIMER within the package.
*
* Error numbers and access to error reporting.
*
*/

#ifndef _ERRNO_H_
#define _ERRNO_H_

/* All the headers include this file. */
#include <_mingw.h>

/*
* Error numbers.
* TODO: Can't be sure of some of these assignments, I guessed from the
* names given by strerror and the defines in the Cygnus errno.h. A lot
* of the names from the Cygnus errno.h are not represented, and a few
* of the descriptions returned by strerror do not obviously match
* their error naming.
*/
#define EPERM 1 /* Operation not permitted */
#define ENOFILE 2 /* No such file or directory */
#define ENOENT 2
#define ESRCH 3 /* No such process */
#define EINTR 4 /* Interrupted function call */
#define EIO 5 /* Input/output error */
#define ENXIO 6 /* No such device or address */
#define E2BIG 7 /* Arg list too long */
#define ENOEXEC 8 /* Exec format error */
#define EBADF 9 /* Bad file descriptor */
#define ECHILD 10 /* No child processes */
#define EAGAIN 11 /* Resource temporarily unavailable */
#define ENOMEM 12 /* Not enough space */
#define EACCES 13 /* Permission denied */
#define EFAULT 14 /* Bad address */
/* 15 - Unknown Error */
#define EBUSY 16 /* strerror reports "Resource device" */
#define EEXIST 17 /* File exists */
#define EXDEV 18 /* Improper link (cross-device link?) */
#define ENODEV 19 /* No such device */
#define ENOTDIR 20 /* Not a directory */
#define EISDIR 21 /* Is a directory */
#define EINVAL 22 /* Invalid argument */
#define ENFILE 23 /* Too many open files in system */
#define EMFILE 24 /* Too many open files */
#define ENOTTY 25 /* Inappropriate I/O control operation */
/* 26 - Unknown Error */
#define EFBIG 27 /* File too large */
#define ENOSPC 28 /* No space left on device */
#define ESPIPE 29 /* Invalid seek (seek on a pipe?) */
#define EROFS 30 /* Read-only file system */
#define EMLINK 31 /* Too many links */
#define EPIPE 32 /* Broken pipe */
#define EDOM 33 /* Domain error (math functions) */
#define ERANGE 34 /* Result too large (possibly too small) */
/* 35 - Unknown Error */
#define EDEADLOCK 36 /* Resource deadlock avoided (non-Cyg) */
#define EDEADLK 36
/* 37 - Unknown Error */
#define ENAMETOOLONG 38 /* Filename too long (91 in Cyg?) */
#define ENOLCK 39 /* No locks available (46 in Cyg?) */
#define ENOSYS 40 /* Function not implemented (88 in Cyg?) */
#define ENOTEMPTY 41 /* Directory not empty (90 in Cyg?) */
#define EILSEQ 42 /* Illegal byte sequence */

/*
* NOTE: ENAMETOOLONG and ENOTEMPTY conflict with definitions in the
* sockets.h header provided with windows32api-0.1.2.
* You should go and put an #if 0 ... #endif around the whole block
* of errors (look at the comment above them).
*/

#ifndef RC_INVOKED

#ifdef __cplusplus
extern "C" {
#endif

/*
* Definitions of errno. For _doserrno, sys_nerr and * sys_errlist, see
* stdlib.h.
*/
#ifdef _UWIN
#undef errno
extern int errno;
#else
_CRTIMP int* __cdecl __MINGW_NOTHROW _errno(void);
#define errno (*_errno())
#endif

#ifdef __cplusplus
}
#endif

#endif /* Not RC_INVOKED */

#endif /* Not _ERRNO_H_ */
 
F

Francesco S. Carta

Francesco S. Carta said:
EOVERFLOW is missing from my implementation, is that possible?
[...]

Certainly. The only errno macros defined by the C standard are EDOM,
EILSEQ, and ERANGE (though EOVERFLOW is defined by POSIX).

Thank you for the clarification Keith.
 
T

Tom St Denis

You could but that wasn't my objective.

Might want to rethink that. Sometimes the simpler solution is better
even if it's marginally less efficient.
One way I use it is to trim whitespace from command line arguments in
argv.  I don't want to change the argv pointer list, so I change the
data in place.

Most C runtime startup libs will glob the commandline which includes
trimming white space from arguments. E.g.

$ hello.out param1 param2 param3

will have

argv[0] = "hello.out"
argv[1] = "param1"
argv[2] = "param2"
argv[3] = "param3"

Regardless of the fact there are spaces on the command line.

So unless your user calls its like

$ hello.out " param1" " param2 " ...

There won't be spaces.

More so, I still contend you write a parse that handles spaces. To me
though I'd consider passing " --config" to not be "--config" because
if the user is stupid enough to go out of their way to escape spaces
into the parameter they deserve to get an error message.
I used the feedback here to improve it.  It consumed more time than I
expected though, and it's hard to find the time to do what you suggest.

Thing is, we end up with a lot of lines posted over and over as you
post revisions. Posting a link is easy [pastebin is free!] and gets
you basically the same effect.
One thing I wonder about is, why don't c.l.c contributors organize
themselves, develop a code snippets library, and put it up for download.
That could save people a lot of time.

It's called glibc.

Tom
 
T

Tom St Denis

That sounds like it'd be expensive and involve new allocations.

allocation. Singular.

Trim removes bytes and never adds them, so you need to allocate one
buffer the size of your input... let's see...

char *trim(const char *src)
{
char *tmpdst, *dst = calloc(1, strlen(src) + 1);
if (!dst) return NULL;
tmpdst = dst;
while (*src) {
switch (*src) {
case ' ':
case '\n':
case '\t':
break;
default:
*tmpdst++ = *src;
}
++src;
}
return dst;
}

Done. Wow that's hard. Didn't involve 100s of lines of code, is
portable, simple to extend/change, etc.

Inside your argv[] reader you could do this

for (i = 1; i < argc; i++) {
char *tmp = trim(argv);
assert(tmp != NULL);
parse_opt(tmp);
free(tmp);
}

Incredible!
My guess is that this is intended to be general-purpose, rather than
for use in only one specific context.

I'd still write it that way then do this

int trim_inplace(char *str)
{
char *tmp = trim(str);
if (!tmp) return -1;
assert(strlen(str) <= strlen(tmp)); // sanity check
strcpy(str, tmp);
free(tmp);
return 0;
}

OMG!!! Hierarchical programming!

That's what separates a developer from a coder. In 27 lines of code
[both functions] I replaced what took him 100s of lines. And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

Tom
 
T

Tom St Denis

That's what separates a developer from a coder.  In 27 lines of code
[both functions] I replaced what took him 100s of lines.  And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

Sorry, his version [minus test cases and preamble comments] is 74
lines long. It's still 3 times longer than mine though.

Tom
 
J

John Kelly

One way I use it is to trim whitespace from command line arguments in
argv.  I don't want to change the argv pointer list, so I change the
data in place.

Most C runtime startup libs will glob the commandline which includes
trimming white space from arguments. E.g.

$ hello.out param1 param2 param3

will have

argv[0] = "hello.out"
argv[1] = "param1"
argv[2] = "param2"
argv[3] = "param3"

Regardless of the fact there are spaces on the command line.

I know.

So unless your user calls its like

$ hello.out " param1" " param2 " ...

There won't be spaces.

That's right. But however unlikely, I guard against that too, for the
use case I have in mind.

More so, I still contend you write a parse that handles spaces. To me
though I'd consider passing " --config" to not be "--config" because
if the user is stupid enough to go out of their way to escape spaces
into the parameter they deserve to get an error message.
I used the feedback here to improve it.  It consumed more time than I
expected though, and it's hard to find the time to do what you suggest.

Thing is, we end up with a lot of lines posted over and over as you
post revisions. Posting a link is easy [pastebin is free!] and gets
you basically the same effect.
One thing I wonder about is, why don't c.l.c contributors organize
themselves, develop a code snippets library, and put it up for download.
That could save people a lot of time.

It's called glibc.

Microsoft uses glibc? I never knew!
 
T

Tom St Denis

That's right.  But however unlikely, I guard against that too, for the
use case I have in mind.

You need smarter users.
Microsoft uses glibc?  I never knew!

Why not before you hand over your credit card to buy a copy of MSVC
you ask them to make their C library actually C99 compatible [or heck
at least C90 compatible] and get what you actually pay for.

Tom
 
J

John Kelly

That's what separates a developer from a coder. In 27 lines of code
[both functions] I replaced what took him 100s of lines. And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

Uh-oh, Rambo is here. I better run and hide.

:-D
 
T

Tom St Denis

That's what separates a developer from a coder.  In 27 lines of code
[both functions] I replaced what took him 100s of lines.  And my
version has error checking, uses standard C functions, is portable,
and is plenty easy to read.

Uh-oh, Rambo is here.  I better run and hide.

:-D

Just saying your method is flawed and nobody seems to be pointing you
on track. And you really ought to stop posting it over and over again
until you think the problem through. I mean I wrote my version while
on a 5 min break at work [I'm working on a user document for a product
we're writing]. With the exception of missing '\r' in the trim
function it does [from what I can tell] what you want.

You need to think about what the problem is first then try and come up
with an optimal solution both in execution speed but simplicity and
maintainability. My version based on first making a formatted copy
then overwriting allows both use cases where you can't overwrite the
string [or don't want to] and where you want it in place.

In my example, I put a bit of thought [5 mins worth] into how I'd
solve that problem if I were going to use the solution. That's what a
developer does. In your case your so task oriented on making such a
exemplar solution that you're overlooking how it's going to be used.
In the case of parsing command line args it's hardly a performance
hazard to double buffer, and in fact your solution still does a double
buffer anyways [in effect].

Now I get doing it in place might have been some time of mental
exercise for your development [of your development skills]. And if so
I apologize, but just the same, why learn to do things the
inappropriate way?

Tom
 
K

Kenny McCormack

(Channelling the CLC goon squad)

Only if "C runtime startup" is how you pronounce "shell".

--
One of the best lines I've heard lately:

Obama could cure cancer tomorrow, and the Republicans would be
complaining that he had ruined the pharmaceutical business.

(Heard on Stephanie Miller = but the sad thing is that there is an awful lot
of direct truth in it. We've constructed an economy in which eliminating
cancer would be a horrible disaster. There are many other such examples.)
 
T

Tom St Denis

(Channelling the CLC goon squad)

Only if "C runtime startup" is how you pronounce "shell".

command line globbing is not always a function of the shell. The
DJGPP CRT would do it for you for instance.

CRT startup != defined by C specification.

It's true that in *NIX like environments the shell would do that for
you (e.g. if you exec() something with say *.c it wouldn't expand
that) but that's not always the case.

Irrespective, I'd consider it a user error if they went out of their
way to make sure leading spaces were present in the command line
option. I would handle it gracefully by saying something like
"unknown command: --config".

Tom
 
K

Kenny McCormack

Must you shatter all illusions of those faithful to the creed?

Call me iconoclastic.

--
"The anti-regulation business ethos is based on the charmingly naive notion
that people will not do unspeakable things for money." - Dana Carpender

Quoted by Paul Ciszek (pciszek at panix dot com). But what I want to know
is why is this diet/low-carb food author doing making pithy political/economic
statements?

Nevertheless, the above quote is dead-on, because, the thing is - business
in one breath tells us they don't need to be regulated (which is to say:
that they can morally self-regulate), then in the next breath tells us that
corporations are amoral entities which have no obligations to anyone except
their officers and shareholders, then in the next breath they tell us they
don't need to be regulated (that they can morally self-regulate) ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

trim whitespace, bullet proof version 63
trim whitespace 194
trim 6
Trim string 42
Request for source code review of simple Ising model 88
Strange bug 65
malloc and maximum size 56
Dead Code? 4

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top