strtok behavior with multiple consecutive delimiters

Discussion in 'C Programming' started by Geometer, May 6, 2006.

  1. Geometer

    Geometer Guest

    Hello, and good whatever daytime is at your place..


    please can somebody tell me, what the standard behavior of strtok shall be,
    if it encounters two or more consecutive delimiters like in
    (checks omitted)

    char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    ^^^^ ^^^^^^
    char *tok = strtok(tst, "\n");
    tok = strtok(NULL, "\n");
    and so on..

    will the groups of '\n' marked above be consumed one by one or the whole
    group together?

    Thank you very much
     
    Geometer, May 6, 2006
    #1
    1. Advertising

  2. Geometer

    Phlip Guest

    Geometer wrote:

    > please can somebody tell me, what the standard behavior of strtok shall
    > be, if it encounters two or more consecutive delimiters like in
    > (checks omitted)
    >
    > char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    > ^^^^ ^^^^^^
    > char *tok = strtok(tst, "\n");
    > tok = strtok(NULL, "\n");
    > and so on..
    >
    > will the groups of '\n' marked above be consumed one by one or the whole
    > group together?


    Yes.

    But why didn't you just write a test case and see?

    Going forward, don't use strtok(). Google for a replacement, possibly
    including a Regex system. Then you can control such details.

    --
    Phlip
    http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!!
     
    Phlip, May 6, 2006
    #2
    1. Advertising

  3. Geometer

    Geometer Guest

    --
    Geometer
    Dipl.Ing. Erwin Lebloch

    Hauptplatz 39
    2130 Mistelbach - NÖ
    Tel.: 02572/4300

    www.lebloch.at

    "Phlip" <> schrieb im Newsbeitrag
    news:%d27g.27531$...
    > Geometer wrote:
    >
    >> please can somebody tell me, what the standard behavior of strtok shall
    >> be, if it encounters two or more consecutive delimiters like in
    >> (checks omitted)
    >>
    >> char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    >> ^^^^ ^^^^^^
    >> char *tok = strtok(tst, "\n");
    >> tok = strtok(NULL, "\n");
    >> and so on..
    >>
    >> will the groups of '\n' marked above be consumed one by one or the whole
    >> group together?

    >
    > Yes.
    >
    > But why didn't you just write a test case and see?


    I did :). I just wanted to know if this is the behavior required by the
    standard and whether there is a difference betwenn C and C++.
    Thanks for your response.

    Robert
     
    Geometer, May 6, 2006
    #3
  4. Geometer wrote:
    > Hello, and good whatever daytime is at your place..
    >
    >
    > please can somebody tell me, what the standard behavior of strtok shall be,
    > if it encounters two or more consecutive delimiters like in
    > (checks omitted)
    >
    > char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    > ^^^^ ^^^^^^
    > char *tok = strtok(tst, "\n");
    > tok = strtok(NULL, "\n");
    > and so on..
    >
    > will the groups of '\n' marked above be consumed one by one or the whole
    > group together?
    >
    > Thank you very much
    >
    >


    <quote src="A man-page for strok.">

    Never use these functions. If you do, note that:
    These functions modify their first argument.
    These functions cannot be used on constant strings.
    The identity of the delimiting character is lost.
    The strtok() function uses a static buffer while parsing,
    so it’s not thread safe.

    </quote>

    Regards,

    Peter Jansson
    http://www.p-jansson.com/
    http://www.jansson.net/
     
    Peter Jansson, May 6, 2006
    #4
  5. Geometer

    CBFalconer Guest

    Peter Jansson wrote:
    > Geometer wrote:
    >>
    >> please can somebody tell me, what the standard behavior of
    >> strtok shall be, if it encounters two or more consecutive
    >> delimiters like in (checks omitted)
    >>
    >> char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    >> ^^^^ ^^^^^^
    >> char *tok = strtok(tst, "\n");
    >> tok = strtok(NULL, "\n");
    >> and so on..
    >>
    >> will the groups of '\n' marked above be consumed one by one or
    >> the whole group together?

    >
    > <quote src="A man-page for strok.">
    >
    > Never use these functions. If you do, note that:
    > These functions modify their first argument.
    > These functions cannot be used on constant strings.
    > The identity of the delimiting character is lost.
    > The strtok() function uses a static buffer while parsing,
    > so it’s not thread safe.
    >
    > </quote>


    The OP can simply use the following replacement function, which
    does not have those objectionable features. The testing code is
    longer than the function.

    /* ------- file toksplit.c ----------*/
    #include "toksplit.h"

    /* copy over the next token from an input string, after
    skipping leading blanks (or other whitespace?). The
    token is terminated by the first appearance of tokchar,
    or by the end of the source string.

    The caller must supply sufficient space in token to
    receive any token, Otherwise tokens will be truncated.

    Returns: a pointer past the terminating tokchar.

    This will happily return an infinity of empty tokens if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tokchar.

    A better name would be "strtkn", except that is reserved
    for the system namespace. Change to that at your risk.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    */

    const char *toksplit(const char *src, /* Source of tokens */
    char tokchar, /* token delimiting char */
    char *token, /* receiver of parsed token */
    size_t lgh) /* length token can receive */
    /* not including final '\0' */
    {
    if (src) {
    while (' ' == *src) *src++;

    while (*src && (tokchar != *src)) {
    if (lgh) {
    *token++ = *src;
    --lgh;
    }
    src++;
    }
    if (*src && (tokchar == *src)) src++;
    }
    *token = '\0';
    return src;
    } /* toksplit */

    #ifdef TESTING
    #include <stdio.h>

    #define ABRsize 6 /* length of acceptable token abbreviations */

    int main(void)
    {
    char teststring[] = "This is a test, ,, abbrev, more";

    const char *t, *s = teststring;
    int i;
    char token[ABRsize + 1];

    puts(teststring);
    t = s;
    for (i = 0; i < 4; i++) {
    t = toksplit(t, ',', token, ABRsize);
    putchar(i + '1'); putchar(':');
    puts(token);
    }

    puts("\nHow to detect 'no more tokens'");
    t = s; i = 0;
    while (*t) {
    t = toksplit(t, ',', token, 3);
    putchar(i + '1'); putchar(':');
    puts(token);
    i++;
    }

    puts("\nUsing blanks as token delimiters");
    t = s; i = 0;
    while (*t) {
    t = toksplit(t, ' ', token, ABRsize);
    putchar(i + '1'); putchar(':');
    puts(token);
    i++;
    }
    return 0;
    } /* main */

    #endif
    /* ------- end file toksplit.c ----------*/

    I have set follow-ups to exclude c.l.c++. Although the above code
    is usable there, it is seldom a good idea to mix the two
    languages. I have not provided a header file with a C++ linkage
    provision.

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
     
    CBFalconer, May 6, 2006
    #5
  6. Geometer

    Jerry Coffin Guest

    In article <1n27g.55903$>,
    says...

    [ ... ]

    > The strtok() function uses a static buffer while parsing,
    > so it’s not thread safe.


    More accurately, it uses a static pointer while parsing,
    so the vendor has to go to extra work to make it thread
    safe. The same is true with a number of other functions
    as well, though -- much of what's defined in time.h, to
    give only one obvious example.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
     
    Jerry Coffin, May 6, 2006
    #6
  7. Geometer

    Ben Pfaff Guest

    "Geometer" <> writes:

    > please can somebody tell me, what the standard behavior of strtok shall be,
    > if it encounters two or more consecutive delimiters like in


    strtok() has at least these problems:

    * It merges adjacent delimiters. If you use a comma as your
    delimiter, then "a,,b,c" will be divided into three tokens,
    not four. This is often the wrong thing to do. In fact, it
    is only the right thing to do, in my experience, when the
    delimiter set contains white space (for dividing a string
    into "words") or it is known in advance that there will be
    no adjacent delimiters.

    * The identity of the delimiter is lost, because it is
    changed to a null terminator.

    * It modifies the string that it tokenizes. This is bad
    because it forces you to make a copy of the string if
    you want to use it later. It also means that you can't
    tokenize a string literal with it; this is not
    necessarily something you'd want to do all the time but
    it is surprising.

    * It can only be used once at a time. If a sequence of
    strtok() calls is ongoing and another one is started,
    the state of the first one is lost. This isn't a
    problem for small programs but it is easy to lose track
    of such things in hierarchies of nested functions in
    large programs. In other words, strtok() breaks
    encapsulation.

    --
    "What is appropriate for the master is not appropriate for the novice.
    You must understand the Tao before transcending structure."
    --The Tao of Programming
     
    Ben Pfaff, May 6, 2006
    #7
  8. Geometer

    Guest

    CBFalconer wrote:

    > The OP can simply use the following replacement function, which
    > does not have those objectionable features. The testing code is
    > longer than the function.


    OTOH By using C++ life becomes more productive, less error prone,
    less complicated and more elegant:

    #include <sstream>
    #include <string>
    #include <vector>
    #include <iostream>

    int main()
    {

    char tst[] = "this\nis\n\nan\nempty\n\n\nline";

    std::stringstream s;
    s << tst;

    std::vector<std::string> tokens;
    while (! s.eof() ){
    std::string str;
    getline(s,str,'\n');
    tokens.push_back(str);
    }

    for (std::vector<std::string>::const_iterator iter
    = tokens.begin();
    iter !=tokens.end();
    ++iter){
    std::cout << "token: \""<< *iter <<"\"\n";
    }

    }

    regards
    Andy Little
     
    , May 6, 2006
    #8
  9. Geometer

    Pete Becker Guest

    Peter Jansson wrote:
    >
    > <quote src="A man-page for strok.">


    The name of the function is strtok.

    >
    > Never use these functions. If you do, note that:
    > These functions modify their first argument.
    > These functions cannot be used on constant strings.


    These two say the same thing. Sounds like someone is trying too hard.

    > The identity of the delimiting character is lost.


    Which has nothing to do with the claim that you should never use it. You
    shouldn't use it if you need to know which of the delimiters was
    actually encountered.

    > The strtok() function uses a static buffer while parsing,


    No, it uses a static variable to hold its result BETWEEN calls.

    > so it’s not thread safe.
    >


    Non sequitur. It's easy enough to implement with a per-thread static
    pointer, which is thread safe.

    Yup, definitely trying too hard. strtok is well suited for what it does.
    If you need something more elaborate, go for it.

    --

    Pete Becker
    Roundhouse Consulting, Ltd.
     
    Pete Becker, May 6, 2006
    #9
  10. Geometer

    jacob navia Guest

    You forgot toksplit.h Chuck

    Can you post it too?

    jacob
     
    jacob navia, May 6, 2006
    #10
  11. Geometer

    jacob navia Guest

    a écrit :
    > CBFalconer wrote:
    >
    >
    >>The OP can simply use the following replacement function, which
    >>does not have those objectionable features. The testing code is
    >>longer than the function.

    >
    >
    > OTOH By using C++ life becomes more productive, less error prone,
    > less complicated and more elegant:
    >
    > #include <sstream>
    > #include <string>
    > #include <vector>
    > #include <iostream>
    >
    > int main()
    > {
    >
    > char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    >
    > std::stringstream s;
    > s << tst;
    >
    > std::vector<std::string> tokens;
    > while (! s.eof() ){
    > std::string str;
    > getline(s,str,'\n');
    > tokens.push_back(str);
    > }
    >
    > for (std::vector<std::string>::const_iterator iter
    > = tokens.begin();
    > iter !=tokens.end();
    > ++iter){
    > std::cout << "token: \""<< *iter <<"\"\n";
    > }
    >
    > }
    >
    > regards
    > Andy Little
    >


    I compiled your program in C++ using the VS 2005 compiler. The
    executable size of that stuff was 180 224 bytes.

    Then I compiled Chuck's version using his strtok function using the
    lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.

    Then I eliminated output from both programs. Compiled them without any
    optimizations and inserted a loop of 1 million times.

    C++ took 1.234 seconds
    C took 0.375 seconds

    Then I compiled both programs using VS 2005 (64 bits) with full
    optimization:

    C++ took 0.234 seconds
    C took 0.156 seconds

    I do not say that this measurements are important for everybody. But
    maybe they are important for *some* people.

    jacob
     
    jacob navia, May 6, 2006
    #11
  12. Geometer

    Guest

    jacob navia wrote:

    > I compiled your program in C++ using the VS 2005 compiler. The
    > executable size of that stuff was 180 224 bytes.


    It comes out at around 112 K for me. What were your command line
    options?

    > Then I compiled Chuck's version using his strtok function using the
    > lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.
    >
    > Then I eliminated output from both programs. Compiled them without any
    > optimizations and inserted a loop of 1 million times.
    >
    > C++ took 1.234 seconds
    > C took 0.375 seconds
    >
    > Then I compiled both programs using VS 2005 (64 bits) with full
    > optimization:
    >
    > C++ took 0.234 seconds
    > C took 0.156 seconds


    (It would be nice to see the full source code that you were testing
    FWIW). C++ version did rather better than I would expect, good
    optimiser! ...;-)

    > I do not say that this measurements are important for everybody. But
    > maybe they are important for *some* people.


    Sure, C++ will handle the C-style code as well if necessary, but the
    amount of time you need to spend writing, testing and debugging is a
    major factor to some people too.

    And of course ... In what real situation are you going to be spending a
    long time tokenising string literals?

    regards
    Andy Little
     
    , May 6, 2006
    #12
  13. Geometer

    jacob navia Guest

    Command line for non optimized version:
    cl /EHsc toksplit.cpp
    lc toksplit.c

    Command line for the optimized version:
    cl /Ox /EHsc toksplit.cpp
    cl /OX toksplit.c


    Here is the code
    --------------------------------------------------toksplit.h
    #ifndef H_toksplit_h
    # define H_toksplit_h

    # ifdef __cplusplus
    extern "C" {
    # endif

    #include <stddef.h>

    /* copy over the next token from an input string, after
    skipping leading blanks (or other whitespace?). The
    token is terminated by the first appearance of tokchar,
    or by the end of the source string.

    The caller must supply sufficient space in token to
    receive any token, Otherwise tokens will be truncated.

    Returns: a pointer past the terminating tokchar.

    This will happily return an infinity of empty tokens if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tokchar.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    */

    const char *toksplit(const char *src, /* Source of tokens */
    char tokchar, /* token delimiting char */
    char *token, /* receiver of parsed token */
    size_t lgh); /* length token can receive */
    /* not including final '\0' */

    # ifdef __cplusplus
    }
    # endif
    #endif
    --------------------------------------------end of toksplit.h
    Now toksplit.c
    /* ------- file toksplit.c ----------*/
    #include "toksplit.h"

    /* copy over the next token from an input string, after
    skipping leading blanks (or other whitespace?). The
    token is terminated by the first appearance of tokchar,
    or by the end of the source string.

    The caller must supply sufficient space in token to
    receive any token, Otherwise tokens will be truncated.

    Returns: a pointer past the terminating tokchar.

    This will happily return an infinity of empty tokens if
    called with src pointing to the end of a string. Tokens
    will never include a copy of tokchar.

    A better name would be "strtkn", except that is reserved
    for the system namespace. Change to that at your risk.

    released to Public Domain, by C.B. Falconer.
    Published 2006-02-20. Attribution appreciated.
    */

    const char *toksplit(const char *src, /* Source of tokens */
    char tokchar, /* token delimiting char */
    char *token, /* receiver of parsed token */
    size_t lgh) /* length token can receive */
    /* not including final '\0' */
    {
    if (src) {
    while (' ' == *src) *src++;

    while (*src && (tokchar != *src)) {
    if (lgh) {
    *token++ = *src;
    --lgh;
    }
    src++;
    }
    if (*src && (tokchar == *src)) src++;
    }
    *token = '\0';
    return src;
    } /* toksplit */

    #include <stdio.h>

    #define ABRsize 64 /* length of acceptable token abbreviations */

    int main(void)
    {
    char teststring[] = "this\nis\n\nan\nempty\n\n\nline";

    const char *t, *s = teststring;
    int i;
    char token[ABRsize + 1];
    int count;

    count=0;
    do {
    t = s; i = 0;
    while (*t) {
    t = toksplit(t, '\n', token, 64);
    //putchar(i + '1'); putchar(':');
    //puts(token);
    i++;
    }
    count++;
    } while (count < 1000000);
    return 0;
    } /* main */

    --------------------------------------------------------------toksplit.c

    Now the C++ version:
    --------------------------------------------------------------toksplit.cpp
    #include <sstream>
    #include <string>
    #include <vector>
    #include <iostream>

    int main()
    {

    char tst[] = "this\nis\n\nan\nempty\n\n\nline";

    std::stringstream s;
    s << tst;

    std::vector<std::string> tokens;
    int count=0;
    do {
    s << tst;
    while (! s.eof() ){
    std::string str;
    getline(s,str,'\n');
    tokens.push_back(str);
    }

    for (std::vector<std::string>::const_iterator iter
    = tokens.begin();
    iter !=tokens.end();
    ++iter){
    //std::cout << "token: \""<< *iter <<"\"\n";
    }
    count++;
    } while (count < 1000000);

    }
    --------------------------------------------------------------end of
    toksplit.cpp
     
    jacob navia, May 6, 2006
    #13
  14. Geometer

    Ben C Guest

    On 2006-05-06, jacob navia <> wrote:
    > [...]
    > I compiled your program in C++ using the VS 2005 compiler. The
    > executable size of that stuff was 180 224 bytes.
    >
    > Then I compiled Chuck's version using his strtok function using the
    > lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.
    >
    > Then I eliminated output from both programs. Compiled them without any
    > optimizations and inserted a loop of 1 million times.
    >
    > C++ took 1.234 seconds
    > C took 0.375 seconds
    >
    > Then I compiled both programs using VS 2005 (64 bits) with full
    > optimization:
    >
    > C++ took 0.234 seconds
    > C took 0.156 seconds
    >
    > I do not say that this measurements are important for everybody. But
    > maybe they are important for *some* people.


    Interesting. Can you do a timing of VS 2005 with full optimizations on
    the C version? I think this would complete the picture.
     
    Ben C, May 6, 2006
    #14
  15. Geometer

    Ian Collins Guest

    jacob navia wrote:
    >
    > Now the C++ version:
    > --------------------------------------------------------------toksplit.cpp
    > #include <sstream>
    > #include <string>
    > #include <vector>
    > #include <iostream>
    >
    > int main()
    > {
    >
    > char tst[] = "this\nis\n\nan\nempty\n\n\nline";
    >
    > std::stringstream s;
    > s << tst;
    >
    > std::vector<std::string> tokens;
    > int count=0;
    > do {
    > s << tst;
    > while (! s.eof() ){
    > std::string str;
    > getline(s,str,'\n');
    > tokens.push_back(str);
    > }
    >
    > for (std::vector<std::string>::const_iterator iter
    > = tokens.begin();
    > iter !=tokens.end();
    > ++iter){
    > //std::cout << "token: \""<< *iter <<"\"\n";
    > }
    > count++;
    > } while (count < 1000000);
    >
    > }
    > --------------------------------------------------------------end of
    > toksplit.cpp
    >

    If you are going to eliminate output for comparison, you should comment
    out the entire last for loop as the C version outputs inline.

    Also, to make things more equal, remove the vector, as this is only used
    to store tokens for output.

    --
    Ian Collins.
     
    Ian Collins, May 6, 2006
    #15
  16. In comp.lang.c wrote:

    > OTOH By using C++ life becomes more productive, less error prone,
    > less complicated and more elegant:


    Not always... (digression warning)

    > std::cout << "token: \""<< *iter <<"\"\n";


    IMHO this is harder for the programmer to read than

    printf( "token: \"%s\"\n", str );

    To a certain extent this is a question of religion, but the difference
    between the prevailing styles becomes more pronounced with heavily
    formatted output:

    printf( "%6s %2.2f %-18s:%u\n", val1, val2, val3, val4 );

    Accomplishing the same thing with std::cout would be messy.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
     
    Christopher Benson-Manica, May 7, 2006
    #16
  17. In comp.lang.c Ben Pfaff <> wrote:

    > * It can only be used once at a time. If a sequence of
    > strtok() calls is ongoing and another one is started,
    > the state of the first one is lost.


    <ot>For OP, if this is a problem for you, strtok_r() may be
    available, depending on your system and portability constraints.</ot>

    For all the pitfalls of strtok(), it is still possible to use it
    correctly for fun and profit, a point which I think has not been
    emphasized in this thread. It may well be the appropriate function
    for the OP, but of course given his question it might also be
    unsuable.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
     
    Christopher Benson-Manica, May 7, 2006
    #17
  18. Geometer

    Phlip Guest

    Geometer wrote:

    > I did :). I just wanted to know if this is the behavior required by the
    > standard and whether there is a difference betwenn C and C++.


    Can we add to the FAQ "Please don't ask about strtok(), because everyone
    here is ready to complain about it in endless ways"?

    --
    Phlip
    http://c2.com/cgi/wiki?ZeekLand <-- NOT a blog!!!
     
    Phlip, May 7, 2006
    #18
  19. Geometer

    Guest

    jacob navia wrote:
    > Command line for non optimized version:
    > cl /EHsc toksplit.cpp


    (Assuming this is my original as above)
    Using these switches comes out at 120 kb on my system

    > lc toksplit.c
    >
    > Command line for the optimized version:
    > cl /Ox /EHsc toksplit.cpp


    Using these switches, comes out at 112 kb on my system

    regards
    Andy Little
     
    , May 7, 2006
    #19
  20. Geometer

    CBFalconer Guest

    jacob navia wrote:
    >

    .... snip ...
    >
    > I compiled your program in C++ using the VS 2005 compiler. The
    > executable size of that stuff was 180 224 bytes.
    >
    > Then I compiled Chuck's version using his strtok function using the
    > lcc-win32 compiler (a C compiler, not a C++ one). The size was
    > 14 645 bytes.


    I compiled toksplit without the testing code, using gcc -Os, and
    the generated object code was 0x5b bytes long. That's less than
    100 bytes of object code.

    The point is: measure the routine, not the testing program.

    --
    "If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers." - Keith Thompson
    More details at: <http://cfaj.freeshell.org/google/>
    Also see <http://www.safalra.com/special/googlegroupsreply/>
     
    CBFalconer, May 7, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Geometer
    Replies:
    33
    Views:
    2,378
    Richard Herring
    May 9, 2006
  2. Replies:
    11
    Views:
    5,071
    Mark Bluemel
    Mar 13, 2008
  3. Greg N.
    Replies:
    2
    Views:
    2,352
    Greg N.
    May 7, 2008
  4. Replies:
    11
    Views:
    582
    James Taylor
    Jul 29, 2005
  5. deuteros
    Replies:
    3
    Views:
    700
Loading...

Share This Page