String parsing question

Christopher Benson-Manica · Oct 14, 2003

I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

int myfunc( const char *list )
{
int items=0;
char *cp=strdup( idlist ); /* nonstandard */
char *newstr=cp;
int shifts=0;
int chars=0;

for( ; *cp ; *cp++ ) {
if( *cp == ':' ) {
if( chars == 6 ) {
shifts++;
continue;
}
if( chars == 5 ) {
*(cp-shifts)=' ';
chars++;
continue;
}
return( -1 ); /* error */
}
if( *cp == ';' ) {
items++;
if( chars != 14 ) {
return( -1 ); /* error */
}
chars=0;
}
else if( ++chars > 14 ) {
return( -1 ); /* error */
}
*(cp-shifts)=*cp;
}
*(cp-shifts)='\0';
if( chars == 14 ) {
items++;
}
if( !items || (chars && chars != 14) ) {
return( -1 ); /* error */
}
printf( "The string '%s' has %d items.", newstr, items );
free( newstr );
return( 0 ); /* success */
}

Is there a better way?

Dan Pop · Oct 14, 2003

In said:
I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

int myfunc( const char *list )
{
int items=0;
char *cp=strdup( idlist ); /* nonstandard */
char *newstr=cp;
int shifts=0;
int chars=0;

for( ; *cp ; *cp++ ) {
if( *cp == ':' ) {
if( chars == 6 ) {
shifts++;
continue;
}
if( chars == 5 ) {
*(cp-shifts)=' ';
chars++;
continue;
}
return( -1 ); /* error */
}
if( *cp == ';' ) {
items++;
if( chars != 14 ) {
return( -1 ); /* error */
}
chars=0;
}
else if( ++chars > 14 ) {
return( -1 ); /* error */
}
*(cp-shifts)=*cp;
}
*(cp-shifts)='\0';
if( chars == 14 ) {
items++;
}
if( !items || (chars && chars != 14) ) {
return( -1 ); /* error */
}
printf( "The string '%s' has %d items.", newstr, items );
free( newstr );
return( 0 ); /* success */
}

Is there a better way?

1. Such a code is a maintenance nightmare (imagine that you'll have to
make some changes, 5 years from now).

2. I may be missing something, but I can't find any attempt to test that
your characters really are alphanums, you're merely looking for your
separators.

I would implement this function using sscanf calls. The result would be
slower, but a lot more readable. The conversion specifier for
alphanumerics can use the following macro:

#define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

Dan

Thomas Matthews · Oct 14, 2003

Christopher said:
I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

[code snipped]

Is there a better way?

Another method would be parse the string like a language. Analyze the
data to find its current format, then apply the conversion.

Let's look closer at the formats. Let A represent any character
from the set of alphanumerics.
[1] AAAAAAAAAAAAAA
[2] AAAAA AAAAAAAA
[3] AAAAAA:AAAAAAAA
[4] AAAAA:AAAAAAAA
Looking at the above lines, the formats differ at the 6th
column (starting with column 1 as the first column).
The variations are:
6th char Format Number
-------- -------------
':' 4
' ' 2
A 1 or 3
This last value requires looking at column 7:
7th char Format Number
-------- -------------
':' 3
A 1

Based on this analysis, format selection looks easy.
Format conversion is left for the reader & OP.

Format1 ::= AlphaNum AlphaNum {...} AlphaNum

Format2 ::= AlphaNum AlphaNum AlphaNum AlphaNum
AlphaNum ' '

Etc. You could try using a Lexer tool, such as
Yacc and Lexx (Bison and Flex).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book

Kevin D. Quitt · Oct 14, 2003

Have you looked at strspn and strcspn? The latter will locate the (next)
semi-colon, and the former can verify that the characters from the current
to the semi-colon are all alphanumerics.

char *alnum = "abcdefghijklmnopqrstuvwxyz"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"0123456789";

size_t tokenLength( char *tkn )
{
size_t len, semi;

if ( !tkn )
return (size_t)0;

len = strlen( tkn );
semi = strcspn( tkn, ";" );
if ( semi == len ) // There's no semi-colon
return (size_t)0;

if ( strspn( tkn, alnum ) != semi )
return (size_t)0; // Not all alpha-num

return semi;
}

Christopher Benson-Manica · Oct 14, 2003

Dan Pop said:
1. Such a code is a maintenance nightmare (imagine that you'll have to
make some changes, 5 years from now).

Probably. However, I'd rather not use sscanf, for two reasons: This code is
for a somewhat performance-sensitive application, and (also) the existing code
I'm working with generally uses similarly obtuse but efficient code. I've
added some comments to the source to indicate to the programmer (presumably
not me) who gets to revisit it 10 years from now.

2. I may be missing something, but I can't find any attempt to test that
your characters really are alphanums, you're merely looking for your
separators.

The functions that call this one are assumed to be well-behaved - I used the
term alphanumeric to distinguish the "other" characters from the delimiters.
Sorry to be unclear.

Christopher Benson-Manica · Oct 14, 2003

Kevin D. Quitt said:
Have you looked at strspn and strcspn? The latter will locate the (next)
semi-colon, and the former can verify that the characters from the current
to the semi-colon are all alphanumerics.

If I didn't have to remove the ':' characters, I might do just that.
Unfortunately I don't have that luxury.

Christopher Benson-Manica · Oct 14, 2003

Thomas Matthews said:
Based on this analysis, format selection looks easy.
Format conversion is left for the reader & OP.

It's true that I can easily validate the string without stepping through the
whole thing; however, I can't think of a good way to delete the semicolons
efficiently without stepping through the string. The conversion issue is just
the one I'm trying to improve upon...

Etc. You could try using a Lexer tool, such as
Yacc and Lexx (Bison and Flex).

Unfortunately, Lexx is really out of the question, since it doesn't really fit
the development paradigm I'm working within.

Sheldon Simms · Oct 14, 2003

I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

I think your approach is reasonable, and I don't agree that it's
a maintainance nightmare. It took me less than 5 minutes to understand
what you are trying to do. I do think your code can be improved a
little bit

My main two changes would be 1) Don't use strdup(), you can build the
new string while scanning the original one. 2) use array notation
rather than pointer arithmetic to access the characters.

First some small critiques, then I'll show my "improved" version of
your code. First critique, this code does not compile, and when the
obvious correction is made, it doesn't work properly. However, what
you're trying to do is clear enough to continue.

int myfunc( const char *list )

presumably this should be 'idlist'

{
int items=0;
char *cp=strdup( idlist ); /* nonstandard */

You have to check for a NULL pointer result here.

char *newstr=cp;
int shifts=0;

the way you are using this variable makes the code a little
bit harder to understand, IMHO. I would prefer to have two
indices: one for the original string, one for the new string.
You can keep track of each index independently instead of
keeping track of the difference between the 'current' location
in each string.

int chars=0;

for( ; *cp ; *cp++ ) {
if( *cp == ':' ) {
if( chars == 6 ) {
shifts++;
continue;

I'm not usually one to gripe about using things like 'continue'
or even 'goto', but here you're just using 'continue' instead of
'else'. Don't do that, just use 'else'

}
if( chars == 5 ) {
*(cp-shifts)=' ';
chars++;
continue;
}
return( -1 ); /* error */
}
if( *cp == ';' ) {
items++;
if( chars != 14 ) {
return( -1 ); /* error */
}
chars=0;
}
else if( ++chars > 14 ) {
return( -1 ); /* error */
}
*(cp-shifts)=*cp;
}
*(cp-shifts)='\0';
if( chars == 14 ) {
items++;
}
if( !items || (chars && chars != 14) ) {
return( -1 ); /* error */
}
printf( "The string '%s' has %d items.", newstr, items );
free( newstr );
return( 0 ); /* success */
}

Is there a better way?

Here's my version:

#include <ctype.h> /* isalnum() */
#include <stdio.h> /* printf() */
#include <stdlib.h> /* malloc() */
#include <string.h> /* strlen() */

int myfunc( const char *idlist )
{
int items = 0;
int chars = 0;
int srcidx = 0;
int dstidx = 0;
char *newstr;

newstr = malloc(strlen(idlist)+1);
if (newstr == NULL)
return -1;

while (idlist[srcidx])
{
printf("%c (%d)\n", idlist[srcidx], chars);
fflush(stdout);

if (isalnum(idlist[srcidx]) || idlist[srcidx] == ' ')
{
newstr[dstidx++] = idlist[srcidx];
++chars;
}
else if (idlist[srcidx] == ':')
{
if (chars == 5)
{
newstr[dstidx++] = ' ';
++chars;
}
else if (chars != 6)
return -2;

/* if chars == 6, just act like the ':' didn't exist */
}
else if (idlist[srcidx] == ';')
{
if (chars != 14)
return -3;

newstr[dstidx++] = ';';
chars = 0;
++items;
}
else if (chars > 14)
{
return -4;
}

++srcidx;
}

newstr[dstidx] = '\0';

if (chars == 14)
++items;
else if (items == 0 || chars != 0)
return -5;

printf("\nThe string '%s' has %d items.", newstr, items);
free(newstr);

return 0; /* success */
}

int main (void)
{
int val;
val = myfunc("abcdefghijklmn;abcde 12345678;"
"123456:abcdefgh;abcde:12345678;");
printf("result: %d\n", val);

return val;
}

Christopher Benson-Manica · Oct 14, 2003

Sheldon Simms said:
My main two changes would be 1) Don't use strdup(), you can build the
new string while scanning the original one. 2) use array notation
rather than pointer arithmetic to access the characters.

Thank you, those both sound like excellent suggestions

The only problem is
that this code compiles in a C++ environment, so I have to invoke malloc thus:

char *newstr=(char *)malloc( strlen(idlist)+1 ); /* forced cast */

Of course, this is both off-topic and not your problem

presumably this should be 'idlist'

Yes, typo...

You have to check for a NULL pointer result here.

Wish I could claim *this* one was a typo

(translation: whoops!)

the way you are using this variable makes the code a little
bit harder to understand, IMHO. I would prefer to have two
indices: one for the original string, one for the new string.
You can keep track of each index independently instead of
keeping track of the difference between the 'current' location
in each string.

Since this neatly eliminates the fact that I was wasting time copying
characters I didn't need to, I've incorporated this idea into my code.
Thanks.

I'm not usually one to gripe about using things like 'continue'
or even 'goto', but here you're just using 'continue' instead of
'else'. Don't do that, just use 'else'

Good call - done.

while (idlist[srcidx])

I've taken the liberty of using for( ; idlist[srcidx] ; srcidx++ )...

Thanks for your suggestions, they were most helpful.

Irrwahn Grausewitz · Oct 14, 2003

Christopher Benson-Manica said:
It's true that I can easily validate the string without stepping through the
whole thing; however, I can't think of a good way to delete the semicolons
efficiently without stepping through the string.

What exactly do you mean by "delete": "overwrite" or "move all following
chars to the left"?

The conversion issue is just
the one I'm trying to improve upon...

As for the conversion:

#include <string.h>

/*
** Convert:
** - format 4 to format 2: return 4
** - format 3 to format 1: return 3
** conversion impossible: return 0
** String pointed to by s must be writeable.
*/
int f4to2_f3to1( char *s )
{
int ret = 0;

if ( s[5] == ':' ) /* f4 -> f2 */
{
s[5] = ' ';
ret = 4;
}
else if ( s[6] == ':' ) /* f3 -> f1 */
{
memcpy( s+6, s+7, strlen(s+7)+1 );
ret = 3;
}
return ret;
}

Hm, still not very efficient, is it?

Ah, and one additional remark: in your original code you used the
non-standard strdup() function, which in turn will very likely perform
some strlen-like operation[1], thus iterating through the string once
more...

[1] Unless the implementation does some kind of magic.

Regards

Christopher Benson-Manica · Oct 14, 2003

Irrwahn Grausewitz said:
What exactly do you mean by "delete": "overwrite" or "move all following
chars to the left"?

Basically,
"AAAAAA:AAAAAAAA;AAAAA:AAAAAAAA;AAAAAAAAAAAAAA;AAAAAA:AAAAAAAA" ->
"AAAAAAAAAAAAAA;AAAAA AAAAAAAA;AAAAAAAAAAAAAA;AAAAAAAAAAAAAA"

else if ( s[6] == ':' ) /* f3 -> f1 */
{
memcpy( s+6, s+7, strlen(s+7)+1 );
ret = 3;
}
return ret;
}

Hm, still not very efficient, is it?

Depends on how efficient memcpy() relative to what I wrote...

Ah, and one additional remark: in your original code you used the
non-standard strdup() function, which in turn will very likely perform
some strlen-like operation[1], thus iterating through the string once
more...

Indeed, which is why I gratefully used another poster's suggestion for
eliminating strdup()

Irrwahn Grausewitz · Oct 14, 2003

Christopher Benson-Manica said:
Basically,
"AAAAAA:AAAAAAAA;AAAAA:AAAAAAAA;AAAAAAAAAAAAAA;AAAAAA:AAAAAAAA" ->
"AAAAAAAAAAAAAA;AAAAA AAAAAAAA;AAAAAAAAAAAAAA;AAAAAAAAAAAAAA"

Ah, I see, you didn't mean
"I can't think of a good way to delete the semicolons."
but
"I can't think of a good way to delete the colons." [1]

else if ( s[6] == ':' ) /* f3 -> f1 */
{
memcpy( s+6, s+7, strlen(s+7)+1 );
ret = 3;
}
return ret;
}

Click to expand...

Hm, still not very efficient, is it?

Click to expand...

Depends on how efficient memcpy() relative to what I wrote...

Well, I'm more concerned about the efficiency of strlen(), though.

Ah, and one additional remark: in your original code you used the
non-standard strdup() function, which in turn will very likely perform
some strlen-like operation[1], thus iterating through the string once
more...

Click to expand...

Indeed, which is why I gratefully used another poster's suggestion for
eliminating strdup()

Wise move.

[1] Actually, I wouldn't feel comfortable with my colon deleted. ;-)

Regards

Christopher Benson-Manica · Oct 14, 2003

Irrwahn Grausewitz said:
Ah, I see, you didn't mean
"I can't think of a good way to delete the semicolons."
but
"I can't think of a good way to delete the colons." [1]

Yes. As you can see, however, I did think of a rather dubious way of doing it

Well, I'm more concerned about the efficiency of strlen(), though.

Well, calling strlen and memcpy together multiple times seems like it'd be a
little on the slow side, compared to both my original and revised versions.

[1] Actually, I wouldn't feel comfortable with my colon deleted. ;-)

I doubt I would either, although I don't think I'd miss my semicolon (it's a
vestigial organ in humans).

Kevin D. Quitt · Oct 14, 2003

If I didn't have to remove the ':' characters, I might do just that.
Unfortunately I don't have that luxury.

Huh? Each call to that function tells you how many characters to move to
your output, excluding the : (I thought you said

. So use strncpy to
move each section to the area it belongs.

CBFalconer · Oct 15, 2003

Dan said:
I have a string delimited by semicolons. The items delimited
may be in any of the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first
format, and items in the fourth format to the second. Also,
I need to count the number of items in the string, which may
or may not have a trailing semicolon.

Click to expand...

.... snips of code ...

I would implement this function using sscanf calls. The result
would be slower, but a lot more readable. The conversion
specifier for alphanumerics can use the following macro:

#define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

For some reason this problem intrigued me, so ....

Only Dan would recommend sscanf, others find other methods much
easier. I challenge him to match the performance and clarity of
the following code using any scanf() variety. If he succeeds it
will be quite instructive to me.

This accepts records of types 1 and 2, and converts those of types
3 and 4. Anything else is an error and the next \n or semicolon
resynchronizes. After the code there follows a set of test
inputs. After a run its output passes through unchanged.

Released to Public Domain, by C.B. Falconer (just in case).

---------------------cut convert.c here ---------------
#include <stdio.h>
#include <string.h>
#include <ctype.h>

#define MAXTOKEN 14 /* on output */
#define MAXLINE (72 - MAXTOKEN) /* limit output line lgh */

/* allowable return values from getitem() */
typedef enum results {DONE
,OK
,TOOLONG
,TOOSHORT
,BADCHAR
,BADLOGIC} results;

#define SEMI ';' /* record separator */

/* ---------------- */

static int flushrecord(int ch)
{
while ((SEMI != ch) && (EOF != ch) && ('\n' != ch)) {
ch = getchar();
}
return ch;
} /* flushrecord */

/* ---------------- */

/* Acquire the next record. At exit either EOF has
occured or the next SEMI or '\n' has been read so
that the input stream is ready to start the next item. */
static results getitem(char * buf)
{
int ch;
int ix;
int kind;
results err;

ix = 0; err = OK; kind = 1;
while (EOF != (ch = getchar())
&& (SEMI != ch)
&& ('\n' != ch)
&& (ix < MAXTOKEN)
&& (OK == err)) {
buf[ix++] = ch;

if (!isalnum(ch)) { /* check the transforms */
if (kind != 1) {
err = BADCHAR;
break;
}
else if ((' ' == ch) && (6 == ix)) {
kind = 2;
}
else if ((':' == ch) && (7 == ix)) {
/* type 3 becomes type 1 */
kind = 3;
--ix;
}
else if ((':' == ch) && (6 == ix)) {
/* type 4 becomes type 2 */
kind = 4;
buf[ix - 1] = ' ';
}
else {
err = BADCHAR;
break;
}
} /* if (isalnum) */
} /* while */
buf[ix] = '\0';
if ((MAXTOKEN == ix) && (OK == err)
&& (('\n' == ch) || (SEMI == ch))) return OK;
else if (OK != err) /* propagate the already set err. */
/* Do nothing */;
else if (EOF == ch) err = DONE;
else if (MAXTOKEN == ix) err = TOOLONG;
else if (MAXTOKEN > ix) err = TOOSHORT;
else /* can't happen */ err = BADLOGIC;

if (EOF == flushrecord(ch)) err = DONE;
return err;
} /* getitem */

/* ---------------- */

int main(void)
{
int linelgh = 0;
results result;
char buffer[MAXTOKEN+1];

/* First, we handle input and output and termination */
/* We postpone all the nitty-gritty to getitem() */
while (DONE != (result = getitem(buffer))) {
if (OK != result)
fprintf(stderr, "Error %d: %s\n", result, buffer);
else {
if (linelgh != 0) putc(SEMI, stdout);
fputs(buffer, stdout);
if (MAXLINE < (linelgh += (1 + strlen(buffer)))) {
putc('\n', stdout);
linelgh = 0;
}
}
}
if (linelgh != 0) putc('\n', stdout);
if (strlen(buffer)) {
fprintf(stderr, "Orphan data: %s\n", buffer);
}
return 0;
} /* main of convert */

-----------------cut convert.c ends here ---------------
-------------------cut convtest.txt here ---------------
01RecordType01
02Rcd TypeNo02
03This:TypeNo03
04Rcd:TypeNo04
05RecordType01
06Rcd TypeNo02
07This:TypeNo03
08Rcd:TypeNo04
09RecordType01
10Rcd TypeNo02
11This:TypeNo03
12Rcd:TypeNo04
13ShortType01
14Rcd Short02
15Recd:Short03
16Rcd:Short04
17LongRcdType1x
18Rcd LongTyp2x
19This:LongTyp3x
20Rcd:LongTyp4x
2101Bad~Type01
2202~ TypeNo02
2303~s:TypeNo03
2404~:TypeNo04
25r01bad~ype01
26r02 Typ~No02
27r03s:Typ~No03
28R04:Type~o04
29RecordType01
30Rcd TypeNo02
31This:TypeNo03
32Rcd:TypeNo04
----------------cut convtest.txt ends here ---------------

Glynne · Oct 15, 2003

Christopher Benson-Manica said:
I'm wondering about the best way to do the following:

I have a string delimited by semicolons. The items delimited may be in any of
the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first format, and items
in the fourth format to the second. Also, I need to count the number of items
in the string, which may or may not have a trailing semicolon.

My plan (which I feel is sub-optimal - hence this post), is to step through
the initial string one character at a time to accomplish these things in one
pass. While I could count semicolons easily with strchr(), deleting the
colons properly means stepping through the whole string anyway (right?) and so
I may as well count semicolons simultaneously. I'd also like to validate the
data format (i.e., 15-character items are not allowed).

int myfunc( const char *list )
{
int items=0;
char *cp=strdup( idlist ); /* nonstandard */
char *newstr=cp;
int shifts=0;
int chars=0;

for( ; *cp ; *cp++ ) {
if( *cp == ':' ) {
if( chars == 6 ) {
shifts++;
continue;
}
if( chars == 5 ) {
*(cp-shifts)=' ';
chars++;
continue;
}
return( -1 ); /* error */
}
if( *cp == ';' ) {
items++;
if( chars != 14 ) {
return( -1 ); /* error */
}
chars=0;
}
else if( ++chars > 14 ) {
return( -1 ); /* error */
}
*(cp-shifts)=*cp;
}
*(cp-shifts)='\0';
if( chars == 14 ) {
items++;
}
if( !items || (chars && chars != 14) ) {
return( -1 ); /* error */
}
printf( "The string '%s' has %d items.", newstr, items );
free( newstr );
return( 0 ); /* success */
}

Is there a better way?

/* how's this? */
int myfunc( char *src )
{
char *dst;
int n, b, i, j, k;

/* should check for null or empty src string */

for( dst=src, b=n=i=j=k=0 ; dst[k]=src ; i++, k++ ) {
if( dst[k]==':' ) {
if( k-j==6 ) {
k--; /* eat it */
}
else if( k-j==5 ) {
dst[k]= ' '; /* blank it */
}
}
else if( dst[k]==';' ) {
n++; /* count it */
if( k-j > 14+b ) {
return -1;
}
j= k+1; /* start of next item */
b= 0;
}
if( dst[k]==' ' ) {
b= 1; /* items with a blank are longer */
}
}
n++;

if( src[i-1]==';' ) {
n--; /* trailing semicolon */
}

printf( "The string '%s' has %d items.", dst, n );
return 0;
}

Dan Pop · Oct 15, 2003

In said:
Dan said:

I have a string delimited by semicolons. The items delimited
may be in any of the following formats:
1) 14 alphanum characters
2) 5 alphanums space 8 alphanums
3) 6 alphanums colon 8 alphanums
4) 5 alphanums colon 8 alphanums

My task is to convert items in the third format to the first
format, and items in the fourth format to the second. Also,
I need to count the number of items in the string, which may
or may not have a trailing semicolon.

Click to expand...

... snips of code ...

I would implement this function using sscanf calls. The result
would be slower, but a lot more readable. The conversion
specifier for alphanumerics can use the following macro:

#define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

Click to expand...

For some reason this problem intrigued me, so ....

Only Dan would recommend sscanf, others find other methods much
easier. I challenge him to match the performance and clarity of
the following code using any scanf() variety. If he succeeds it
will be quite instructive to me.

As I've already suggested, if performance is an overriding concern, you
probably don't want to use sscanf (unless it happens to be fast enough
for your needs, rejecting it a priori is downright stupid). Its main
merit is that it simplifies the code structure and makes each individual
test a lot more clear than hand crafted code. That is, assuming that the
reader knows how scanf works (is there any *valid* reason for failing this
assumption? ;-)

Untested and one of the most boring pieces of code I've ever written,
but clear and easily maintainable if new formats have to be supported.
It returns a structure containing both an item count and a pointer to
the converted string. If the pointer is null, the function failed
to allocate memory for output, otherwise the caller must free it.
If count is negative and the pointer is valid, an input error was
detected, but the output string contains all the valid items already
processed.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

struct foo { int count; char *output; };

struct foo parser(char *p)
{
char *q = calloc(strlen(p) + 1, 1);
char c, field1[15], field2[9];
struct foo ret = { -1, 0 };
int n = 0, rc, count = 0;

if (q != NULL) ret.output = q;
else return ret;

while (p += n, *p != 0) {
count++;

/* formats 2 and 4 */

rc = sscanf(p, "%5" ALNUM "%*1[ :]%8" ALNUM "%c%n",
field1, field2, &c, &n);
if (rc >= 2) {
if (rc > 2 && c != ';') return ret;
if (strlen(field1) != 5 || strlen(field2) != 8) return ret;
memcpy(q, field1, 5), memcpy(q + 6, field2, 8);
q[5] = ' ', q += 14;
if (rc == 2) break;
*q++ = ';';
continue;
}

/* format 3 */

rc = sscanf(p, "%6" ALNUM ":%8" ALNUM "%c%n",
field1, field2, &c, &n);
if (rc >= 2) {
if (rc > 2 && c != ';') return ret;
if (strlen(field1) != 6 || strlen(field2) != 8) return ret;
memcpy(q, field1, 6), memcpy(q + 6, field2, 8), q += 14;
if (rc == 2) break;
*q++ = ';';
continue;
}

/* format 1 */

rc = sscanf(p, "%14" ALNUM "%c%n", field1, &c, &n);
if (rc >= 1) {
if (rc > 1 && c != ';') return ret;
if (strlen(field1) != 14) return ret;
memcpy(q, field1, 14), q += 14;
if (rc == 1) break;
*q++ = ';';
continue;
}
return ret;
}
ret.count = count;
return ret;
}

The code is very repetitive, each succesful scanf call being handled in
the same way:

1. Check that the trailing character, if present, is a semicolon.
2. Check that the fields have the correct sizes.
3. Copy the fields to the output.
4. If there was no trailing character, exit the loop.
5. Add the trailing semicolon to the output string.

But the control structure is straightforward, with a single loop and only
one level of nested if's.

Important note: the order of the three sscanf calls is important, because
each of them will also match the format(s) handled by the previous one(s),
but will reject them at the field length test.

Dan

STRING - Remove small letters from string	1	Jan 20, 2023
C-style string parsing	6	Oct 14, 2003
Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
Parsing a string	44	Nov 16, 2010
Help with Loop	0	Mar 30, 2023
Indeterminate Pointer Value for One-Past	2	Sep 15, 2010
Codeforces problem	0	Apr 25, 2022

String parsing question

Christopher Benson-Manica

Dan Pop

Thomas Matthews

Kevin D. Quitt

Christopher Benson-Manica

Christopher Benson-Manica

Christopher Benson-Manica

Sheldon Simms

Christopher Benson-Manica

Irrwahn Grausewitz

Christopher Benson-Manica

Irrwahn Grausewitz

Christopher Benson-Manica

Kevin D. Quitt

CBFalconer

Glynne

Dan Pop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads