strtok()

Discussion in 'C Programming' started by Mark, Aug 3, 2010.

  1. Mark

    Mark Guest

    Hi

    I'm trying to write a simple parser for my application, the purpose is to
    allow application understand the command line arguments in the form:

    my_app 1-3,5,9
    or
    my_app 1,4,8-24
    ....

    so it should support both ranges and enumerators. But my function doesn't
    print what I expect:

    int parseLine(char *buf)
    {
    char *token, *subtoken;
    char buftmp[20];

    for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
    {
    printf("%s: ", token);
    strcpy(buftmp, token); /* strtok modifies buffer, so we save a
    copy */
    for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    subtoken = strtok(NULL, "-")) {
    printf("%s ", buf,subtoken);
    }
    putchar('\n');
    }

    return 0;
    }

    For example, buf="1-3,5,8", and I'd expect to have such output:
    1-3: 1 3
    5: 5
    8: 8

    Where is my mistake?
    Thanks!

    --
    Mark
     
    Mark, Aug 3, 2010
    #1
    1. Advertising

  2. On Aug 3, 12:26 pm, "Mark" <> wrote:
    > Hi
    >
    > I'm trying to write a simple parser for my application, the purpose is to
    > allow application understand the command line arguments in the form:
    >
    > my_app 1-3,5,9
    > or
    > my_app 1,4,8-24
    > ...
    >
    > so it should support both ranges and enumerators. But my function doesn't
    > print what I expect:
    >
    > int parseLine(char *buf)
    > {
    >     char *token, *subtoken;
    >     char buftmp[20];
    >
    >     for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
    > {
    >             for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    >              subtoken = strtok(NULL, "-")) {
    >             printf("%s ", buf,subtoken);
    >         }
    >         putchar('\n');
    >     }


    >
    > Where is my mistake?
    >

    Nesting strtoks(). The function uses a static to store the current
    pointer position, which you then overwrite witht he nested call.
    strtok is basically a bad function. Write your own strsplit() instead,
    returning a list of strings in allocated memory.
     
    Malcolm McLean, Aug 3, 2010
    #2
    1. Advertising

  3. "Mark" <> writes:

    > I'm trying to write a simple parser for my application, the purpose is
    > to allow application understand the command line arguments in the
    > form:
    >
    > my_app 1-3,5,9
    > or
    > my_app 1,4,8-24
    > ...
    >
    > so it should support both ranges and enumerators. But my function
    > doesn't print what I expect:
    >
    > int parseLine(char *buf)
    > {
    > char *token, *subtoken;
    > char buftmp[20];
    >
    > for (token = strtok(buf, ","); token != NULL; token = strtok(NULL,
    > ",")) {
    > printf("%s: ", token);
    > strcpy(buftmp, token); /* strtok modifies buffer, so we save
    > a copy */
    > for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    > subtoken = strtok(NULL, "-")) {
    > printf("%s ", buf,subtoken);


    The problem with strtok has been pointed out, but you can continue to
    use it because you don't really need it here. You expect only one pair
    or maybe a lone number and you can parse that using sscanf:

    sscanf(token, "%d-%d", &low, &high)

    will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
    is an error and needs to be reported.

    If you need to check that there are no other characters in the token you
    could do something like this:

    sscanf(token, "%d%n-%d%n", &low, &len1, &high, &len1)

    Now, you need a return of 1 and strlen(token) == len1 or a return of 2
    and strlen(token) == len2. Again, anything else is an error.

    > }
    > putchar('\n');
    > }
    >
    > return 0;
    > }


    <snip>
    --
    Ben.
     
    Ben Bacarisse, Aug 3, 2010
    #3
  4. Mark

    Eric Sosman Guest

    On 8/3/2010 5:26 AM, Mark wrote:
    > Hi
    >
    > I'm trying to write a simple parser for my application, the purpose is
    > to allow application understand the command line arguments in the form:
    >
    > my_app 1-3,5,9
    > or
    > my_app 1,4,8-24
    > ...
    >
    > so it should support both ranges and enumerators. But my function
    > doesn't print what I expect:
    >
    > int parseLine(char *buf)
    > {
    > char *token, *subtoken;
    > char buftmp[20];
    >
    > for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
    > printf("%s: ", token);
    > strcpy(buftmp, token); /* strtok modifies buffer, so we save a copy */
    > for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    > subtoken = strtok(NULL, "-")) {
    > printf("%s ", buf,subtoken);
    > }
    > putchar('\n');
    > }
    >
    > return 0;
    > }
    >
    > For example, buf="1-3,5,8", and I'd expect to have such output:
    > 1-3: 1 3
    > 5: 5
    > 8: 8
    >
    > Where is my mistake?


    strtok() doesn't "nest:" It can be working on only one source
    string at a time. When you call strtok(buftmp,...), it forgets
    about the "outer" string.

    If your system has the (non-Standard) strtok_r() function, you
    might be able to use that instead of strtok().

    --
    Eric Sosman
    lid
     
    Eric Sosman, Aug 3, 2010
    #4
  5. On 3 Aug, 10:26, "Mark" <> wrote:
    > Hi
    >
    > I'm trying to write a simple parser for my application, the purpose is to
    > allow application understand the command line arguments in the form:
    >
    > my_app 1-3,5,9
    > or
    > my_app 1,4,8-24
    > ...
    >
    > so it should support both ranges and enumerators. But my function doesn't
    > print what I expect:
    >
    > int parseLine(char *buf)
    > {
    >     char *token, *subtoken;
    >     char buftmp[20];
    >
    >     for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
    > {
    >         printf("%s: ", token);
    >         strcpy(buftmp, token);    /* strtok modifies buffer, so we save a
    > copy */
    >         for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    >              subtoken = strtok(NULL, "-")) {
    >             printf("%s ", buf,subtoken);
    >         }
    >         putchar('\n');
    >     }
    >
    >     return 0;
    >
    > }
    >
    > For example, buf="1-3,5,8", and I'd expect to have such output:
    > 1-3: 1 3
    > 5: 5
    > 8: 8


    be nice if you told us what it did instead...
    other posters have pointed out the nesting problem.
    also not strtok() modifies the string it's parsing so beware

    parseLine ("1-3,5,6");

    might give a problem (its actually undefined behaviour to modify a
    string literal)
     
    Nick Keighley, Aug 3, 2010
    #5
  6. Ben Bacarisse <> writes:
    [snip]
    > The problem with strtok has been pointed out, but you can continue to
    > use it because you don't really need it here. You expect only one pair
    > or maybe a lone number and you can parse that using sscanf:
    >
    > sscanf(token, "%d-%d", &low, &high)
    >
    > will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
    > is an error and needs to be reported.

    [...]

    Keep in mind that sscanf's behavior is undefined if you scan a number
    outside the range of the specified type. For example,
    if INT_MAX==32767, then this:

    sscanf("40000-50000", "%d-%d", &low, &high);

    has undefined behavior. Which is a great pity; it makes the *scanf()
    functions very difficult to use safely for numeric input.

    With a bit of extra work, you can use the strto*() functions instead;
    they're sane enough to tell you if the value is out of range (by
    returning an extreme value and setting errno to ERANGE).

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Aug 3, 2010
    #6
  7. Mark

    Mark Guest

    Keith Thompson wrote:
    [skip]
    > With a bit of extra work, you can use the strto*() functions instead;
    > they're sane enough to tell you if the value is out of range (by
    > returning an extreme value and setting errno to ERANGE).

    My system's strtok man page (Fedore Core 6) doesn't say anything about
    returning extreme value or setting errno to ERANGE.

    --
    Mark
     
    Mark, Aug 4, 2010
    #7
  8. Mark

    Mark Guest

    Vincenzo Mercuri wrote:
    > I've written a scratch I hope will serve. Beware that maybe I am
    > missing some error checkings, also you couldn't write white spaces
    > between the separators "," , "-" and numbers. I didn't add any checks

    Thanks, I'll give it a try.

    --
    Mark
     
    Mark, Aug 4, 2010
    #8
  9. Mark

    Mark Guest

    Eric Sosman wrote:
    [skip]
    > strtok() doesn't "nest:" It can be working on only one source
    > string at a time. When you call strtok(buftmp,...), it forgets
    > about the "outer" string.
    >
    > If your system has the (non-Standard) strtok_r() function, you
    > might be able to use that instead of strtok().


    So for strtok_r() it's safe to pass the same buffer pointer? Like this:

    for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
    printf("%s: ", token);
    /* no need to keep a copy of 'buf' */
    for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken =
    strtok(NULL, "-")) {
    printf("%s ", buf,subtoken);
    }
    }


    --
    Mark
     
    Mark, Aug 4, 2010
    #9
  10. Mark

    Mark Guest

    One more question; when I compile code featuring strtok_r() with
    "gcc -ansi -pedantic -W -Wall" it naturally complains:

    warning: implicit declaration of function 'strtok_r'
    warning: assignment makes pointer from integer without a cast

    First warning is clear, the second refers to strtok_r() call:

    char *token;
    char *saveptr1 = NULL, *saveptr2 = NULL;
    token = strtok_r(buf, ",", &saveptr1);

    I wonder, what is the compiler's logic here: if in ANSI mode a function is
    not prototyped, then the compiler considers that such functions return
    'int', but it actually return 'char *', is that correct?

    These warnings are gone, when compiled with "-posix -W -Wall"

    --
    Mark
     
    Mark, Aug 4, 2010
    #10
  11. "Mark" <> writes:
    > One more question; when I compile code featuring strtok_r() with
    > "gcc -ansi -pedantic -W -Wall" it naturally complains:
    >
    > warning: implicit declaration of function 'strtok_r'
    > warning: assignment makes pointer from integer without a cast
    >
    > First warning is clear, the second refers to strtok_r() call:
    >
    > char *token;
    > char *saveptr1 = NULL, *saveptr2 = NULL;
    > token = strtok_r(buf, ",", &saveptr1);
    >
    > I wonder, what is the compiler's logic here: if in ANSI mode a function is
    > not prototyped, then the compiler considers that such functions return
    > 'int', but it actually return 'char *', is that correct?


    That's correct. In C90, a reference to an undeclared function
    effectively creates an implicit declaration for the function assuming it
    returns int and takes a fixed but unspecified number and type of
    arguments. So writing
    token = strtok_r(buf, ",", &saveptr1);
    implicitly declares
    int strtok_r();

    In C99, a reference to an undeclared function is a constraint violation.
    Even in C90, it's poor style to depend on it; functions should be
    declared, preferably by #include'ing the appropriate header.

    > These warnings are gone, when compiled with "-posix -W -Wall"


    Probably "-posix" causes the declaration of strtok_r to become visible.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Aug 4, 2010
    #11
  12. Ian Collins <> writes:
    > On 08/ 4/10 01:12 PM, Mark wrote:
    >> Keith Thompson wrote:
    >> [skip]
    >>> With a bit of extra work, you can use the strto*() functions instead;
    >>> they're sane enough to tell you if the value is out of range (by
    >>> returning an extreme value and setting errno to ERANGE).

    >> My system's strtok man page (Fedore Core 6) doesn't say anything about
    >> returning extreme value or setting errno to ERANGE.

    >
    > I'm sure Keith was referring to strtol() and strtoll()


    Yes, along with strtoul(), strtoull(), strtod(), strof(), and
    strtold(). I didn't notice that "strtok" matches the same pattern
    (because the "to" in "strtok" is part of "tok", an abbreviation of
    "token", not the word "to").

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Aug 4, 2010
    #12
  13. Mark

    Eric Sosman Guest

    On 8/3/2010 9:43 PM, Mark wrote:
    > Eric Sosman wrote:
    > [skip]
    >> strtok() doesn't "nest:" It can be working on only one source
    >> string at a time. When you call strtok(buftmp,...), it forgets
    >> about the "outer" string.
    >>
    >> If your system has the (non-Standard) strtok_r() function, you
    >> might be able to use that instead of strtok().

    >
    > So for strtok_r() it's safe to pass the same buffer pointer? Like this:
    >
    > for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
    > printf("%s: ", token);
    > /* no need to keep a copy of 'buf' */
    > for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken = strtok(NULL, "-")) {
    > printf("%s ", buf,subtoken);
    > }
    > }


    I don't see *any* strtok_r() calls here ...

    Ordinary strtok() returns a pointer to the start of a token,
    and remembers where it ends so it knows where to start the next
    search. This is why it doesn't nest: It can only remember one
    restart point in its internal variable.

    The non-Standard strtok_r() function behaves similarly, but
    uses a caller-provided variable to store the restart point. If
    the caller can uses one variable for the "outer" calls and another
    for the "inners," the two scanning sequences won't interfere.

    As for the copy, it's perfectly all right to do anything you
    want to a substring located by strtok() or strtok_r(): Once it's
    been located and divided from the surrounding string, they're done
    with it and don't need it any more. (Well, "almost anything:" it
    would be a bad idea to strcat() "Hello" onto its end, because that
    would disrupt the still-unscanned part of the original string. But
    as long as you stay within the bounds of the token string itself,
    you can do whatever you like there.)

    --
    Eric Sosman
    lid
     
    Eric Sosman, Aug 4, 2010
    #13
  14. Mark

    Gene Guest

    On Aug 3, 5:26 am, "Mark" <> wrote:
    > Hi
    >
    > I'm trying to write a simple parser for my application, the purpose is to
    > allow application understand the command line arguments in the form:
    >
    > my_app 1-3,5,9
    > or
    > my_app 1,4,8-24
    > ...
    >
    > so it should support both ranges and enumerators. But my function doesn't
    > print what I expect:
    >
    > int parseLine(char *buf)
    > {
    >     char *token, *subtoken;
    >     char buftmp[20];
    >
    >     for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
    > {
    >         printf("%s: ", token);
    >         strcpy(buftmp, token);    /* strtok modifies buffer, so we save a
    > copy */
    >         for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
    >              subtoken = strtok(NULL, "-")) {
    >             printf("%s ", buf,subtoken);
    >         }
    >         putchar('\n');
    >     }
    >
    >     return 0;
    >
    > }
    >
    > For example, buf="1-3,5,8", and I'd expect to have such output:
    > 1-3: 1 3
    > 5: 5
    > 8: 8
    >
    > Where is my mistake?
    > Thanks!
    >
    > --
    > Mark


    I have been through this so many times: hacking up a little parser
    with strtok() and sscanf()/atoi(), then throwing it away when the
    input language gets just a bit more sophisticated. These days I
    always go ahead and implement a traditional scanner and simple EBNF
    parser. Once you have the framework, it's very quick to adapt it to
    new problems, and it's liberating to know this extra power can be
    tapped with no code rewriting. Here's what I'm talking about:

    #include <stdio.h>
    #include <ctype.h>

    // Tokens our scanner can discover.
    typedef enum token_e {
    T_NULL,
    T_ERROR,
    T_END_OF_INPUT,
    T_INT,
    T_COMMA,
    T_DASH,
    } TOKEN;

    // Encapsulated state of an input token scanner.
    typedef struct scanner_state_s {
    char *text; // Input to scan
    TOKEN token; // Last token found.
    int p0, p1; // Last token string is text[t0..t1).
    } SCANNER_STATE;

    // Initialize a scanner's state.
    void init_scanner_state(SCANNER_STATE *ss, char *text)
    {
    ss->text = text;
    ss->token = T_NULL;
    ss->p0 = ss->p1 = 0;
    }

    // Return current character.
    static int current_char(SCANNER_STATE *ss)
    {
    return ss->text[ss->p1];
    }

    // Advance the scanner to the next token.
    static void advance(SCANNER_STATE *ss)
    {
    if (current_char(ss) != '\0')
    ++ss->p1;
    }

    // Return the current token.
    TOKEN current_token(SCANNER_STATE *ss)
    {
    return ss->token;
    }

    // Return the integer value of an INT token.
    int get_int_value(SCANNER_STATE *ss, int *value) {
    if (ss->token == T_INT) {
    sscanf(&ss->text[ss->p0], "%d", value);
    return 0;
    }
    return 1;
    }

    // Mark the beginning of a token.
    static void start_token(SCANNER_STATE *ss, TOKEN token)
    {
    ss->p0 = ss->p1;
    ss->token = token;
    }

    // Action on discovering the end of a token.
    static void end_token(SCANNER_STATE *ss)
    {
    // Do nothing in this scanner.
    }

    // Scan a token without advancing the input.
    static void scan_zero_char_token(SCANNER_STATE *ss, TOKEN token)
    {
    start_token(ss, token);
    end_token(ss);
    }

    // Scan a single character token from the input.
    static void scan_one_char_token(SCANNER_STATE *ss, TOKEN token)
    {
    start_token(ss, token);
    advance(ss);
    end_token(ss);
    }

    // Scan the next token from the input.
    void scan(SCANNER_STATE *ss)
    {
    // Skip whitespace.
    while (isspace(current_char(ss))) advance(ss);

    // Use a switch() here if speed is necessary.
    // The if's let us use ctype.h predicates.
    if (isdigit(current_char(ss))) {
    start_token(ss, T_INT);
    do {
    advance(ss);
    } while (isdigit(current_char(ss)));
    end_token(ss);
    }
    else if (current_char(ss) == ',')
    scan_one_char_token(ss, T_COMMA);
    else if (current_char(ss) == '-')
    scan_one_char_token(ss, T_DASH);
    else if (current_char(ss) == '\0')
    scan_zero_char_token(ss, T_END_OF_INPUT);
    else
    scan_zero_char_token(ss, T_ERROR);
    }

    // Match a given token and scan past it to the next
    // or else raise a syntax error if it's not there.
    // It's usually best to longjmp out of the parser on error.
    void match(SCANNER_STATE *ss, TOKEN token)
    {
    if (current_token(ss) == token)
    scan(ss);
    else {
    fprintf(stderr, "syntax error (%d) at end of '%.*s'\n",
    ss->token, ss->p1 + 1, ss->text);
    ss->token = T_ERROR;
    }
    }

    // Parse the EBNF form: <range> ::= INT [ '-' INT ]
    static void range(SCANNER_STATE *ss)
    {
    int lo, hi;

    get_int_value(ss, &lo);
    match(ss, T_INT);

    if (current_token(ss) == T_DASH) {
    scan(ss);
    get_int_value(ss, &hi);
    match(ss, T_INT);
    }
    else
    hi = lo;

    // Action code.
    printf(lo == hi ? "%d\n" : "[%d-%d]\n", lo, hi);
    }

    // Parse the EBNF form:
    // <line> ::= [ <range> { ',' <range> } ] END_OF_INPUT
    void parse_line(char *text)
    {
    SCANNER_STATE ss[1];

    init_scanner_state(ss, text);
    scan(ss); // scan the initial token

    if (current_token(ss) == T_INT) {

    range(ss);

    while (current_token(ss) == T_COMMA) {
    scan(ss);
    range(ss);
    }
    }
    match(ss, T_END_OF_INPUT);
    }

    // Simple test.
    int main(int argc, char *argv[])
    {
    if (argc == 2)
    parse_line(argv[1]);
    return 0;
    }
     
    Gene, Aug 4, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Adam Balgach
    Replies:
    2
    Views:
    568
    news-east
    Nov 28, 2004
  2. Alex Vinokur

    strtok() and std::string

    Alex Vinokur, Apr 14, 2005, in forum: C++
    Replies:
    6
    Views:
    4,929
    Pete Becker
    Apr 14, 2005
  3. strtok problem

    , Aug 28, 2003, in forum: C Programming
    Replies:
    4
    Views:
    509
  4. Robert

    strtok trouble

    Robert, Sep 5, 2003, in forum: C Programming
    Replies:
    17
    Views:
    1,228
    Jalapeno
    Sep 6, 2003
  5. Fatih Gey

    segfault on strtok

    Fatih Gey, Oct 23, 2003, in forum: C Programming
    Replies:
    40
    Views:
    1,451
    nobody
    Nov 1, 2003
Loading...

Share This Page