String parsing question

Discussion in 'C Programming' started by Christopher Benson-Manica, Oct 14, 2003.

  1. I'm wondering about the best way to do the following:

    I have a string delimited by semicolons. The items delimited may be in any of
    the following formats:
    1) 14 alphanum characters
    2) 5 alphanums space 8 alphanums
    3) 6 alphanums colon 8 alphanums
    4) 5 alphanums colon 8 alphanums

    My task is to convert items in the third format to the first format, and items
    in the fourth format to the second. Also, I need to count the number of items
    in the string, which may or may not have a trailing semicolon.

    My plan (which I feel is sub-optimal - hence this post), is to step through
    the initial string one character at a time to accomplish these things in one
    pass. While I could count semicolons easily with strchr(), deleting the
    colons properly means stepping through the whole string anyway (right?) and so
    I may as well count semicolons simultaneously. I'd also like to validate the
    data format (i.e., 15-character items are not allowed).

    int myfunc( const char *list )
    {
    int items=0;
    char *cp=strdup( idlist ); /* nonstandard */
    char *newstr=cp;
    int shifts=0;
    int chars=0;

    for( ; *cp ; *cp++ ) {
    if( *cp == ':' ) {
    if( chars == 6 ) {
    shifts++;
    continue;
    }
    if( chars == 5 ) {
    *(cp-shifts)=' ';
    chars++;
    continue;
    }
    return( -1 ); /* error */
    }
    if( *cp == ';' ) {
    items++;
    if( chars != 14 ) {
    return( -1 ); /* error */
    }
    chars=0;
    }
    else if( ++chars > 14 ) {
    return( -1 ); /* error */
    }
    *(cp-shifts)=*cp;
    }
    *(cp-shifts)='\0';
    if( chars == 14 ) {
    items++;
    }
    if( !items || (chars && chars != 14) ) {
    return( -1 ); /* error */
    }
    printf( "The string '%s' has %d items.", newstr, items );
    free( newstr );
    return( 0 ); /* success */
    }

    Is there a better way?

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #1
    1. Advertising

  2. Christopher Benson-Manica

    Dan Pop Guest

    In <bmh0cj$t31$> Christopher Benson-Manica <> writes:

    >I'm wondering about the best way to do the following:
    >
    >I have a string delimited by semicolons. The items delimited may be in any of
    >the following formats:
    >1) 14 alphanum characters
    >2) 5 alphanums space 8 alphanums
    >3) 6 alphanums colon 8 alphanums
    >4) 5 alphanums colon 8 alphanums
    >
    >My task is to convert items in the third format to the first format, and items
    >in the fourth format to the second. Also, I need to count the number of items
    >in the string, which may or may not have a trailing semicolon.
    >
    >My plan (which I feel is sub-optimal - hence this post), is to step through
    >the initial string one character at a time to accomplish these things in one
    >pass. While I could count semicolons easily with strchr(), deleting the
    >colons properly means stepping through the whole string anyway (right?) and so
    >I may as well count semicolons simultaneously. I'd also like to validate the
    >data format (i.e., 15-character items are not allowed).
    >
    >int myfunc( const char *list )
    >{
    > int items=0;
    > char *cp=strdup( idlist ); /* nonstandard */
    > char *newstr=cp;
    > int shifts=0;
    > int chars=0;
    >
    > for( ; *cp ; *cp++ ) {
    > if( *cp == ':' ) {
    > if( chars == 6 ) {
    > shifts++;
    > continue;
    > }
    > if( chars == 5 ) {
    > *(cp-shifts)=' ';
    > chars++;
    > continue;
    > }
    > return( -1 ); /* error */
    > }
    > if( *cp == ';' ) {
    > items++;
    > if( chars != 14 ) {
    > return( -1 ); /* error */
    > }
    > chars=0;
    > }
    > else if( ++chars > 14 ) {
    > return( -1 ); /* error */
    > }
    > *(cp-shifts)=*cp;
    > }
    > *(cp-shifts)='\0';
    > if( chars == 14 ) {
    > items++;
    > }
    > if( !items || (chars && chars != 14) ) {
    > return( -1 ); /* error */
    > }
    > printf( "The string '%s' has %d items.", newstr, items );
    > free( newstr );
    > return( 0 ); /* success */
    >}
    >
    >Is there a better way?


    1. Such a code is a maintenance nightmare (imagine that you'll have to
    make some changes, 5 years from now).

    2. I may be missing something, but I can't find any attempt to test that
    your characters really are alphanums, you're merely looking for your
    separators.

    I would implement this function using sscanf calls. The result would be
    slower, but a lot more readable. The conversion specifier for
    alphanumerics can use the following macro:

    #define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 14, 2003
    #2
    1. Advertising

  3. Christopher Benson-Manica wrote:

    > I'm wondering about the best way to do the following:
    >
    > I have a string delimited by semicolons. The items delimited may be in any of
    > the following formats:
    > 1) 14 alphanum characters
    > 2) 5 alphanums space 8 alphanums
    > 3) 6 alphanums colon 8 alphanums
    > 4) 5 alphanums colon 8 alphanums
    >
    > My task is to convert items in the third format to the first format, and items
    > in the fourth format to the second. Also, I need to count the number of items
    > in the string, which may or may not have a trailing semicolon.
    >
    > My plan (which I feel is sub-optimal - hence this post), is to step through
    > the initial string one character at a time to accomplish these things in one
    > pass. While I could count semicolons easily with strchr(), deleting the
    > colons properly means stepping through the whole string anyway (right?) and so
    > I may as well count semicolons simultaneously. I'd also like to validate the
    > data format (i.e., 15-character items are not allowed).

    [code snipped]

    >
    > Is there a better way?
    >


    Another method would be parse the string like a language. Analyze the
    data to find its current format, then apply the conversion.

    Let's look closer at the formats. Let A represent any character
    from the set of alphanumerics.
    [1] AAAAAAAAAAAAAA
    [2] AAAAA AAAAAAAA
    [3] AAAAAA:AAAAAAAA
    [4] AAAAA:AAAAAAAA
    Looking at the above lines, the formats differ at the 6th
    column (starting with column 1 as the first column).
    The variations are:
    6th char Format Number
    -------- -------------
    ':' 4
    ' ' 2
    A 1 or 3
    This last value requires looking at column 7:
    7th char Format Number
    -------- -------------
    ':' 3
    A 1

    Based on this analysis, format selection looks easy.
    Format conversion is left for the reader & OP.

    Format1 ::= AlphaNum AlphaNum {...} AlphaNum

    Format2 ::= AlphaNum AlphaNum AlphaNum AlphaNum
    AlphaNum ' '

    Etc. You could try using a Lexer tool, such as
    Yacc and Lexx (Bison and Flex).

    --
    Thomas Matthews

    C++ newsgroup welcome message:
    http://www.slack.net/~shiva/welcome.txt
    C++ Faq: http://www.parashift.com/c++-faq-lite
    C Faq: http://www.eskimo.com/~scs/c-faq/top.html
    alt.comp.lang.learn.c-c++ faq:
    http://www.raos.demon.uk/acllc-c++/faq.html
    Other sites:
    http://www.josuttis.com -- C++ STL Library book
     
    Thomas Matthews, Oct 14, 2003
    #3
  4. Have you looked at strspn and strcspn? The latter will locate the (next)
    semi-colon, and the former can verify that the characters from the current
    to the semi-colon are all alphanumerics.

    char *alnum = "abcdefghijklmnopqrstuvwxyz"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "0123456789";

    size_t tokenLength( char *tkn )
    {
    size_t len, semi;

    if ( !tkn )
    return (size_t)0;

    len = strlen( tkn );
    semi = strcspn( tkn, ";" );
    if ( semi == len ) // There's no semi-colon
    return (size_t)0;

    if ( strspn( tkn, alnum ) != semi )
    return (size_t)0; // Not all alpha-num

    return semi;
    }

    --
    #include <standard.disclaimer>
    _
    Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up
    Per the FCA, this address may not be added to any commercial mail list
     
    Kevin D. Quitt, Oct 14, 2003
    #4
  5. Dan Pop <> spoke thus:

    > 1. Such a code is a maintenance nightmare (imagine that you'll have to
    > make some changes, 5 years from now).


    Probably. However, I'd rather not use sscanf, for two reasons: This code is
    for a somewhat performance-sensitive application, and (also) the existing code
    I'm working with generally uses similarly obtuse but efficient code. I've
    added some comments to the source to indicate to the programmer (presumably
    not me) who gets to revisit it 10 years from now.

    > 2. I may be missing something, but I can't find any attempt to test that
    > your characters really are alphanums, you're merely looking for your
    > separators.


    The functions that call this one are assumed to be well-behaved - I used the
    term alphanumeric to distinguish the "other" characters from the delimiters.
    Sorry to be unclear.

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #5
  6. Kevin D. Quitt <> spoke thus:

    > Have you looked at strspn and strcspn? The latter will locate the (next)
    > semi-colon, and the former can verify that the characters from the current
    > to the semi-colon are all alphanumerics.


    If I didn't have to remove the ':' characters, I might do just that.
    Unfortunately I don't have that luxury.

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #6
  7. Thomas Matthews <> spoke thus:

    > Based on this analysis, format selection looks easy.
    > Format conversion is left for the reader & OP.


    It's true that I can easily validate the string without stepping through the
    whole thing; however, I can't think of a good way to delete the semicolons
    efficiently without stepping through the string. The conversion issue is just
    the one I'm trying to improve upon...

    > Etc. You could try using a Lexer tool, such as
    > Yacc and Lexx (Bison and Flex).


    Unfortunately, Lexx is really out of the question, since it doesn't really fit
    the development paradigm I'm working within.

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #7
  8. On Tue, 14 Oct 2003 14:14:43 +0000, Christopher Benson-Manica wrote:

    > I'm wondering about the best way to do the following:
    >
    > I have a string delimited by semicolons. The items delimited may be in any of
    > the following formats:
    > 1) 14 alphanum characters
    > 2) 5 alphanums space 8 alphanums
    > 3) 6 alphanums colon 8 alphanums
    > 4) 5 alphanums colon 8 alphanums
    >
    > My task is to convert items in the third format to the first format, and items
    > in the fourth format to the second. Also, I need to count the number of items
    > in the string, which may or may not have a trailing semicolon.
    >
    > My plan (which I feel is sub-optimal - hence this post), is to step through
    > the initial string one character at a time to accomplish these things in one
    > pass. While I could count semicolons easily with strchr(), deleting the
    > colons properly means stepping through the whole string anyway (right?) and so
    > I may as well count semicolons simultaneously. I'd also like to validate the
    > data format (i.e., 15-character items are not allowed).


    I think your approach is reasonable, and I don't agree that it's
    a maintainance nightmare. It took me less than 5 minutes to understand
    what you are trying to do. I do think your code can be improved a
    little bit

    My main two changes would be 1) Don't use strdup(), you can build the
    new string while scanning the original one. 2) use array notation
    rather than pointer arithmetic to access the characters.

    First some small critiques, then I'll show my "improved" version of
    your code. First critique, this code does not compile, and when the
    obvious correction is made, it doesn't work properly. However, what
    you're trying to do is clear enough to continue.

    > int myfunc( const char *list )


    presumably this should be 'idlist'

    > {
    > int items=0;
    > char *cp=strdup( idlist ); /* nonstandard */


    You have to check for a NULL pointer result here.

    > char *newstr=cp;
    > int shifts=0;


    the way you are using this variable makes the code a little
    bit harder to understand, IMHO. I would prefer to have two
    indices: one for the original string, one for the new string.
    You can keep track of each index independently instead of
    keeping track of the difference between the 'current' location
    in each string.

    > int chars=0;
    >
    > for( ; *cp ; *cp++ ) {
    > if( *cp == ':' ) {
    > if( chars == 6 ) {
    > shifts++;
    > continue;


    I'm not usually one to gripe about using things like 'continue'
    or even 'goto', but here you're just using 'continue' instead of
    'else'. Don't do that, just use 'else'

    > }
    > if( chars == 5 ) {
    > *(cp-shifts)=' ';
    > chars++;
    > continue;
    > }
    > return( -1 ); /* error */
    > }
    > if( *cp == ';' ) {
    > items++;
    > if( chars != 14 ) {
    > return( -1 ); /* error */
    > }
    > chars=0;
    > }
    > else if( ++chars > 14 ) {
    > return( -1 ); /* error */
    > }
    > *(cp-shifts)=*cp;
    > }
    > *(cp-shifts)='\0';
    > if( chars == 14 ) {
    > items++;
    > }
    > if( !items || (chars && chars != 14) ) {
    > return( -1 ); /* error */
    > }
    > printf( "The string '%s' has %d items.", newstr, items );
    > free( newstr );
    > return( 0 ); /* success */
    > }
    >
    > Is there a better way?


    Here's my version:

    #include <ctype.h> /* isalnum() */
    #include <stdio.h> /* printf() */
    #include <stdlib.h> /* malloc() */
    #include <string.h> /* strlen() */

    int myfunc( const char *idlist )
    {
    int items = 0;
    int chars = 0;
    int srcidx = 0;
    int dstidx = 0;
    char *newstr;

    newstr = malloc(strlen(idlist)+1);
    if (newstr == NULL)
    return -1;

    while (idlist[srcidx])
    {
    printf("%c (%d)\n", idlist[srcidx], chars);
    fflush(stdout);

    if (isalnum(idlist[srcidx]) || idlist[srcidx] == ' ')
    {
    newstr[dstidx++] = idlist[srcidx];
    ++chars;
    }
    else if (idlist[srcidx] == ':')
    {
    if (chars == 5)
    {
    newstr[dstidx++] = ' ';
    ++chars;
    }
    else if (chars != 6)
    return -2;

    /* if chars == 6, just act like the ':' didn't exist */
    }
    else if (idlist[srcidx] == ';')
    {
    if (chars != 14)
    return -3;

    newstr[dstidx++] = ';';
    chars = 0;
    ++items;
    }
    else if (chars > 14)
    {
    return -4;
    }

    ++srcidx;
    }

    newstr[dstidx] = '\0';

    if (chars == 14)
    ++items;
    else if (items == 0 || chars != 0)
    return -5;

    printf("\nThe string '%s' has %d items.", newstr, items);
    free(newstr);

    return 0; /* success */
    }

    int main (void)
    {
    int val;
    val = myfunc("abcdefghijklmn;abcde 12345678;"
    "123456:abcdefgh;abcde:12345678;");
    printf("result: %d\n", val);

    return val;
    }
     
    Sheldon Simms, Oct 14, 2003
    #8
  9. Sheldon Simms <> spoke thus:

    > My main two changes would be 1) Don't use strdup(), you can build the
    > new string while scanning the original one. 2) use array notation
    > rather than pointer arithmetic to access the characters.


    Thank you, those both sound like excellent suggestions :) The only problem is
    that this code compiles in a C++ environment, so I have to invoke malloc thus:

    char *newstr=(char *)malloc( strlen(idlist)+1 ); /* forced cast */

    Of course, this is both off-topic and not your problem ;)

    >> int myfunc( const char *list )


    > presumably this should be 'idlist'


    Yes, typo...

    >> {
    >> int items=0;
    >> char *cp=strdup( idlist ); /* nonstandard */


    > You have to check for a NULL pointer result here.


    Wish I could claim *this* one was a typo ;) (translation: whoops!)

    > the way you are using this variable makes the code a little
    > bit harder to understand, IMHO. I would prefer to have two
    > indices: one for the original string, one for the new string.
    > You can keep track of each index independently instead of
    > keeping track of the difference between the 'current' location
    > in each string.


    Since this neatly eliminates the fact that I was wasting time copying
    characters I didn't need to, I've incorporated this idea into my code.
    Thanks.

    > I'm not usually one to gripe about using things like 'continue'
    > or even 'goto', but here you're just using 'continue' instead of
    > 'else'. Don't do that, just use 'else'


    Good call - done.

    > while (idlist[srcidx])


    I've taken the liberty of using for( ; idlist[srcidx] ; srcidx++ )...

    Thanks for your suggestions, they were most helpful.

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #9
  10. Christopher Benson-Manica <> wrote:

    >It's true that I can easily validate the string without stepping through the
    >whole thing; however, I can't think of a good way to delete the semicolons
    >efficiently without stepping through the string.


    What exactly do you mean by "delete": "overwrite" or "move all following
    chars to the left"?

    >The conversion issue is just
    >the one I'm trying to improve upon...


    As for the conversion:

    #include <string.h>

    /*
    ** Convert:
    ** - format 4 to format 2: return 4
    ** - format 3 to format 1: return 3
    ** conversion impossible: return 0
    ** String pointed to by s must be writeable.
    */
    int f4to2_f3to1( char *s )
    {
    int ret = 0;

    if ( s[5] == ':' ) /* f4 -> f2 */
    {
    s[5] = ' ';
    ret = 4;
    }
    else if ( s[6] == ':' ) /* f3 -> f1 */
    {
    memcpy( s+6, s+7, strlen(s+7)+1 );
    ret = 3;
    }
    return ret;
    }

    Hm, still not very efficient, is it?

    Ah, and one additional remark: in your original code you used the
    non-standard strdup() function, which in turn will very likely perform
    some strlen-like operation[1], thus iterating through the string once
    more...

    [1] Unless the implementation does some kind of magic. :)

    Regards
    --
    Irrwahn
    ()
     
    Irrwahn Grausewitz, Oct 14, 2003
    #10
  11. Irrwahn Grausewitz <> spoke thus:

    > What exactly do you mean by "delete": "overwrite" or "move all following
    > chars to the left"?


    Basically,
    "AAAAAA:AAAAAAAA;AAAAA:AAAAAAAA;AAAAAAAAAAAAAA;AAAAAA:AAAAAAAA" ->
    "AAAAAAAAAAAAAA;AAAAA AAAAAAAA;AAAAAAAAAAAAAA;AAAAAAAAAAAAAA"

    > else if ( s[6] == ':' ) /* f3 -> f1 */
    > {
    > memcpy( s+6, s+7, strlen(s+7)+1 );
    > ret = 3;
    > }
    > return ret;
    > }


    > Hm, still not very efficient, is it?


    Depends on how efficient memcpy() relative to what I wrote...

    > Ah, and one additional remark: in your original code you used the
    > non-standard strdup() function, which in turn will very likely perform
    > some strlen-like operation[1], thus iterating through the string once
    > more...


    Indeed, which is why I gratefully used another poster's suggestion for
    eliminating strdup() :)

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #11
  12. Christopher Benson-Manica <> wrote:

    >Irrwahn Grausewitz <> spoke thus:
    >
    >> What exactly do you mean by "delete": "overwrite" or "move all following
    >> chars to the left"?

    >
    >Basically,
    >"AAAAAA:AAAAAAAA;AAAAA:AAAAAAAA;AAAAAAAAAAAAAA;AAAAAA:AAAAAAAA" ->
    >"AAAAAAAAAAAAAA;AAAAA AAAAAAAA;AAAAAAAAAAAAAA;AAAAAAAAAAAAAA"


    Ah, I see, you didn't mean
    "I can't think of a good way to delete the semicolons."
    but
    "I can't think of a good way to delete the colons." [1]

    >
    >> else if ( s[6] == ':' ) /* f3 -> f1 */
    >> {
    >> memcpy( s+6, s+7, strlen(s+7)+1 );
    >> ret = 3;
    >> }
    >> return ret;
    >> }

    >
    >> Hm, still not very efficient, is it?

    >
    >Depends on how efficient memcpy() relative to what I wrote...


    Well, I'm more concerned about the efficiency of strlen(), though.

    >> Ah, and one additional remark: in your original code you used the
    >> non-standard strdup() function, which in turn will very likely perform
    >> some strlen-like operation[1], thus iterating through the string once
    >> more...

    >
    >Indeed, which is why I gratefully used another poster's suggestion for
    >eliminating strdup() :)


    Wise move. :)

    [1] Actually, I wouldn't feel comfortable with my colon deleted. ;-)

    Regards
    --
    Irrwahn
    ()
     
    Irrwahn Grausewitz, Oct 14, 2003
    #12
  13. Irrwahn Grausewitz <> spoke thus:

    > Ah, I see, you didn't mean
    > "I can't think of a good way to delete the semicolons."
    > but
    >"I can't think of a good way to delete the colons." [1]


    Yes. As you can see, however, I did think of a rather dubious way of doing it
    ;)

    > Well, I'm more concerned about the efficiency of strlen(), though.


    Well, calling strlen and memcpy together multiple times seems like it'd be a
    little on the slow side, compared to both my original and revised versions.

    > [1] Actually, I wouldn't feel comfortable with my colon deleted. ;-)


    I doubt I would either, although I don't think I'd miss my semicolon (it's a
    vestigial organ in humans).

    --
    Christopher Benson-Manica | Upon the wheel thy fate doth turn,
    ataru(at)cyberspace.org | upon the rack thy lesson learn.
     
    Christopher Benson-Manica, Oct 14, 2003
    #13
  14. On Tue, 14 Oct 2003 17:44:15 +0000 (UTC), Christopher Benson-Manica
    <> wrote:


    >Kevin D. Quitt <> spoke thus:
    >
    >> Have you looked at strspn and strcspn? The latter will locate the (next)
    >> semi-colon, and the former can verify that the characters from the current
    >> to the semi-colon are all alphanumerics.

    >
    >If I didn't have to remove the ':' characters, I might do just that.
    >Unfortunately I don't have that luxury.


    Huh? Each call to that function tells you how many characters to move to
    your output, excluding the : (I thought you said ;). So use strncpy to
    move each section to the area it belongs.


    --
    #include <standard.disclaimer>
    _
    Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up
    Per the FCA, this address may not be added to any commercial mail list
     
    Kevin D. Quitt, Oct 14, 2003
    #14
  15. Christopher Benson-Manica

    CBFalconer Guest

    Dan Pop wrote:
    > <> writes:
    >
    > > I have a string delimited by semicolons. The items delimited
    > > may be in any of the following formats:
    > > 1) 14 alphanum characters
    > > 2) 5 alphanums space 8 alphanums
    > > 3) 6 alphanums colon 8 alphanums
    > > 4) 5 alphanums colon 8 alphanums
    > >
    > > My task is to convert items in the third format to the first
    > > format, and items in the fourth format to the second. Also,
    > > I need to count the number of items in the string, which may
    > > or may not have a trailing semicolon.

    >

    .... snips of code ...
    >
    > I would implement this function using sscanf calls. The result
    > would be slower, but a lot more readable. The conversion
    > specifier for alphanumerics can use the following macro:
    >
    > #define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"


    For some reason this problem intrigued me, so ....

    Only Dan would recommend sscanf, others find other methods much
    easier. I challenge him to match the performance and clarity of
    the following code using any scanf() variety. If he succeeds it
    will be quite instructive to me.

    This accepts records of types 1 and 2, and converts those of types
    3 and 4. Anything else is an error and the next \n or semicolon
    resynchronizes. After the code there follows a set of test
    inputs. After a run its output passes through unchanged.

    Released to Public Domain, by C.B. Falconer (just in case).

    ---------------------cut convert.c here ---------------
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>

    #define MAXTOKEN 14 /* on output */
    #define MAXLINE (72 - MAXTOKEN) /* limit output line lgh */

    /* allowable return values from getitem() */
    typedef enum results {DONE
    ,OK
    ,TOOLONG
    ,TOOSHORT
    ,BADCHAR
    ,BADLOGIC} results;

    #define SEMI ';' /* record separator */

    /* ---------------- */

    static int flushrecord(int ch)
    {
    while ((SEMI != ch) && (EOF != ch) && ('\n' != ch)) {
    ch = getchar();
    }
    return ch;
    } /* flushrecord */

    /* ---------------- */

    /* Acquire the next record. At exit either EOF has
    occured or the next SEMI or '\n' has been read so
    that the input stream is ready to start the next item. */
    static results getitem(char * buf)
    {
    int ch;
    int ix;
    int kind;
    results err;

    ix = 0; err = OK; kind = 1;
    while (EOF != (ch = getchar())
    && (SEMI != ch)
    && ('\n' != ch)
    && (ix < MAXTOKEN)
    && (OK == err)) {
    buf[ix++] = ch;

    if (!isalnum(ch)) { /* check the transforms */
    if (kind != 1) {
    err = BADCHAR;
    break;
    }
    else if ((' ' == ch) && (6 == ix)) {
    kind = 2;
    }
    else if ((':' == ch) && (7 == ix)) {
    /* type 3 becomes type 1 */
    kind = 3;
    --ix;
    }
    else if ((':' == ch) && (6 == ix)) {
    /* type 4 becomes type 2 */
    kind = 4;
    buf[ix - 1] = ' ';
    }
    else {
    err = BADCHAR;
    break;
    }
    } /* if (isalnum) */
    } /* while */
    buf[ix] = '\0';
    if ((MAXTOKEN == ix) && (OK == err)
    && (('\n' == ch) || (SEMI == ch))) return OK;
    else if (OK != err) /* propagate the already set err. */
    /* Do nothing */;
    else if (EOF == ch) err = DONE;
    else if (MAXTOKEN == ix) err = TOOLONG;
    else if (MAXTOKEN > ix) err = TOOSHORT;
    else /* can't happen */ err = BADLOGIC;

    if (EOF == flushrecord(ch)) err = DONE;
    return err;
    } /* getitem */

    /* ---------------- */

    int main(void)
    {
    int linelgh = 0;
    results result;
    char buffer[MAXTOKEN+1];

    /* First, we handle input and output and termination */
    /* We postpone all the nitty-gritty to getitem() */
    while (DONE != (result = getitem(buffer))) {
    if (OK != result)
    fprintf(stderr, "Error %d: %s\n", result, buffer);
    else {
    if (linelgh != 0) putc(SEMI, stdout);
    fputs(buffer, stdout);
    if (MAXLINE < (linelgh += (1 + strlen(buffer)))) {
    putc('\n', stdout);
    linelgh = 0;
    }
    }
    }
    if (linelgh != 0) putc('\n', stdout);
    if (strlen(buffer)) {
    fprintf(stderr, "Orphan data: %s\n", buffer);
    }
    return 0;
    } /* main of convert */

    -----------------cut convert.c ends here ---------------
    -------------------cut convtest.txt here ---------------
    01RecordType01
    02Rcd TypeNo02
    03This:TypeNo03
    04Rcd:TypeNo04
    05RecordType01
    06Rcd TypeNo02
    07This:TypeNo03
    08Rcd:TypeNo04
    09RecordType01
    10Rcd TypeNo02
    11This:TypeNo03
    12Rcd:TypeNo04
    13ShortType01
    14Rcd Short02
    15Recd:Short03
    16Rcd:Short04
    17LongRcdType1x
    18Rcd LongTyp2x
    19This:LongTyp3x
    20Rcd:LongTyp4x
    2101Bad~Type01
    2202~ TypeNo02
    2303~s:TypeNo03
    2404~:TypeNo04
    25r01bad~ype01
    26r02 Typ~No02
    27r03s:Typ~No03
    28R04:Type~o04
    29RecordType01
    30Rcd TypeNo02
    31This:TypeNo03
    32Rcd:TypeNo04
    ----------------cut convtest.txt ends here ---------------

    --
    Chuck F () ()
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net> USE worldnet address!
     
    CBFalconer, Oct 15, 2003
    #15
  16. Christopher Benson-Manica

    Glynne Guest

    Christopher Benson-Manica <> wrote in message
    news:bmh0cj$t31$...
    > I'm wondering about the best way to do the following:
    >
    > I have a string delimited by semicolons. The items delimited may be in

    any of
    > the following formats:
    > 1) 14 alphanum characters
    > 2) 5 alphanums space 8 alphanums
    > 3) 6 alphanums colon 8 alphanums
    > 4) 5 alphanums colon 8 alphanums
    >
    > My task is to convert items in the third format to the first format, and

    items
    > in the fourth format to the second. Also, I need to count the number of

    items
    > in the string, which may or may not have a trailing semicolon.
    >
    > My plan (which I feel is sub-optimal - hence this post), is to step

    through
    > the initial string one character at a time to accomplish these things in

    one
    > pass. While I could count semicolons easily with strchr(), deleting the
    > colons properly means stepping through the whole string anyway (right?)

    and so
    > I may as well count semicolons simultaneously. I'd also like to validate

    the
    > data format (i.e., 15-character items are not allowed).
    >
    > int myfunc( const char *list )
    > {
    > int items=0;
    > char *cp=strdup( idlist ); /* nonstandard */
    > char *newstr=cp;
    > int shifts=0;
    > int chars=0;
    >
    > for( ; *cp ; *cp++ ) {
    > if( *cp == ':' ) {
    > if( chars == 6 ) {
    > shifts++;
    > continue;
    > }
    > if( chars == 5 ) {
    > *(cp-shifts)=' ';
    > chars++;
    > continue;
    > }
    > return( -1 ); /* error */
    > }
    > if( *cp == ';' ) {
    > items++;
    > if( chars != 14 ) {
    > return( -1 ); /* error */
    > }
    > chars=0;
    > }
    > else if( ++chars > 14 ) {
    > return( -1 ); /* error */
    > }
    > *(cp-shifts)=*cp;
    > }
    > *(cp-shifts)='\0';
    > if( chars == 14 ) {
    > items++;
    > }
    > if( !items || (chars && chars != 14) ) {
    > return( -1 ); /* error */
    > }
    > printf( "The string '%s' has %d items.", newstr, items );
    > free( newstr );
    > return( 0 ); /* success */
    > }
    >
    > Is there a better way?




    /* how's this? */
    int myfunc( char *src )
    {
    char *dst;
    int n, b, i, j, k;

    /* should check for null or empty src string */

    for( dst=src, b=n=i=j=k=0 ; dst[k]=src ; i++, k++ ) {
    if( dst[k]==':' ) {
    if( k-j==6 ) {
    k--; /* eat it */
    }
    else if( k-j==5 ) {
    dst[k]= ' '; /* blank it */
    }
    }
    else if( dst[k]==';' ) {
    n++; /* count it */
    if( k-j > 14+b ) {
    return -1;
    }
    j= k+1; /* start of next item */
    b= 0;
    }
    if( dst[k]==' ' ) {
    b= 1; /* items with a blank are longer */
    }
    }
    n++;

    if( src[i-1]==';' ) {
    n--; /* trailing semicolon */
    }

    printf( "The string '%s' has %d items.", dst, n );
    return 0;
    }
     
    Glynne, Oct 15, 2003
    #16
  17. Christopher Benson-Manica

    Dan Pop Guest

    In <> CBFalconer <> writes:

    >Dan Pop wrote:
    >> <> writes:
    >>
    >> > I have a string delimited by semicolons. The items delimited
    >> > may be in any of the following formats:
    >> > 1) 14 alphanum characters
    >> > 2) 5 alphanums space 8 alphanums
    >> > 3) 6 alphanums colon 8 alphanums
    >> > 4) 5 alphanums colon 8 alphanums
    >> >
    >> > My task is to convert items in the third format to the first
    >> > format, and items in the fourth format to the second. Also,
    >> > I need to count the number of items in the string, which may
    >> > or may not have a trailing semicolon.

    >>

    >... snips of code ...
    >>
    >> I would implement this function using sscanf calls. The result
    >> would be slower, but a lot more readable. The conversion
    >> specifier for alphanumerics can use the following macro:
    >>
    >> #define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

    >
    >For some reason this problem intrigued me, so ....
    >
    >Only Dan would recommend sscanf, others find other methods much
    >easier. I challenge him to match the performance and clarity of
    >the following code using any scanf() variety. If he succeeds it
    >will be quite instructive to me.


    As I've already suggested, if performance is an overriding concern, you
    probably don't want to use sscanf (unless it happens to be fast enough
    for your needs, rejecting it a priori is downright stupid). Its main
    merit is that it simplifies the code structure and makes each individual
    test a lot more clear than hand crafted code. That is, assuming that the
    reader knows how scanf works (is there any *valid* reason for failing this
    assumption? ;-)

    Untested and one of the most boring pieces of code I've ever written,
    but clear and easily maintainable if new formats have to be supported.
    It returns a structure containing both an item count and a pointer to
    the converted string. If the pointer is null, the function failed
    to allocate memory for output, otherwise the caller must free it.
    If count is negative and the pointer is valid, an input error was
    detected, but the output string contains all the valid items already
    processed.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    #define ALNUM "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]"

    struct foo { int count; char *output; };

    struct foo parser(char *p)
    {
    char *q = calloc(strlen(p) + 1, 1);
    char c, field1[15], field2[9];
    struct foo ret = { -1, 0 };
    int n = 0, rc, count = 0;

    if (q != NULL) ret.output = q;
    else return ret;

    while (p += n, *p != 0) {
    count++;

    /* formats 2 and 4 */

    rc = sscanf(p, "%5" ALNUM "%*1[ :]%8" ALNUM "%c%n",
    field1, field2, &c, &n);
    if (rc >= 2) {
    if (rc > 2 && c != ';') return ret;
    if (strlen(field1) != 5 || strlen(field2) != 8) return ret;
    memcpy(q, field1, 5), memcpy(q + 6, field2, 8);
    q[5] = ' ', q += 14;
    if (rc == 2) break;
    *q++ = ';';
    continue;
    }

    /* format 3 */

    rc = sscanf(p, "%6" ALNUM ":%8" ALNUM "%c%n",
    field1, field2, &c, &n);
    if (rc >= 2) {
    if (rc > 2 && c != ';') return ret;
    if (strlen(field1) != 6 || strlen(field2) != 8) return ret;
    memcpy(q, field1, 6), memcpy(q + 6, field2, 8), q += 14;
    if (rc == 2) break;
    *q++ = ';';
    continue;
    }

    /* format 1 */

    rc = sscanf(p, "%14" ALNUM "%c%n", field1, &c, &n);
    if (rc >= 1) {
    if (rc > 1 && c != ';') return ret;
    if (strlen(field1) != 14) return ret;
    memcpy(q, field1, 14), q += 14;
    if (rc == 1) break;
    *q++ = ';';
    continue;
    }
    return ret;
    }
    ret.count = count;
    return ret;
    }

    The code is very repetitive, each succesful scanf call being handled in
    the same way:

    1. Check that the trailing character, if present, is a semicolon.
    2. Check that the fields have the correct sizes.
    3. Copy the fields to the output.
    4. If there was no trailing character, exit the loop.
    5. Add the trailing semicolon to the output string.

    But the control structure is straightforward, with a single loop and only
    one level of nested if's.

    Important note: the order of the three sscanf calls is important, because
    each of them will also match the format(s) handled by the previous one(s),
    but will reject them at the field length test.

    Dan
    --
    Dan Pop
    DESY Zeuthen, RZ group
    Email:
     
    Dan Pop, Oct 15, 2003
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    877
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    586
    Naren
    May 11, 2004
  3. Christopher Diggins
    Replies:
    0
    Views:
    613
    Christopher Diggins
    Jul 9, 2007
  4. Christopher Diggins
    Replies:
    0
    Views:
    442
    Christopher Diggins
    Jul 9, 2007
  5. John Levine
    Replies:
    0
    Views:
    738
    John Levine
    Feb 2, 2012
Loading...

Share This Page