Splitting a string into an array words

Discussion in 'C++' started by Simon, Jul 19, 2006.

  1. Simon

    Simon Guest

    Well, the title's pretty descriptive; how would I be able to take a
    line of input like this:

    getline(cin,mostrecentline);

    And split into an (flexible) array of strings. For example: "do this
    action"
    would go to:

    item 0: do
    item 1: this
    item 2: action

    Thanks in advance,
    Simon
     
    Simon, Jul 19, 2006
    #1
    1. Advertising

  2. Simon

    Marcus Kwok Guest

    Simon <> wrote:
    > Well, the title's pretty descriptive; how would I be able to take a
    > line of input like this:
    >
    > getline(cin,mostrecentline);
    >
    > And split into an (flexible) array of strings. For example: "do this
    > action"
    > would go to:
    >
    > item 0: do
    > item 1: this
    > item 2: action


    If you are splitting the words by whitespace, you could create a
    std::istringstream and push them into a std::vector<std::string>.

    Something like: (untested and uncompiled)

    std::istringstream line(mostrecentline);
    std::vector<std::string> words;
    std::string temp;

    while (line >> temp) {
    words.push_back(temp);
    }

    You will need to #include <sstream>, <string>, and <vector> for this
    method.

    --
    Marcus Kwok
    Replace 'invalid' with 'net' to reply
     
    Marcus Kwok, Jul 19, 2006
    #2
    1. Advertising

  3. Simon

    Daniel T. Guest

    In article <>,
    "Simon" <> wrote:

    > Well, the title's pretty descriptive; how would I be able to take a
    > line of input like this:
    >
    > getline(cin,mostrecentline);
    >
    > And split into an (flexible) array of strings. For example: "do this
    > action"
    > would go to:
    >
    > item 0: do
    > item 1: this
    > item 2: action


    #include <vector>
    #include <string>
    #include <iostream>
    #include <iterator>
    // other includes as necessary

    template < typename OutIt >
    void split( const std::string& in, OutIt result )
    {
    // add code here...
    }

    int main() {
    string seed = "step1";
    vector<string> result;
    split( seed, back_inserter( result ) );
    assert( result.size() == 1 );
    assert( result[0] == "step1" );
    std::cout << "You did it! Good job!\n"
    }

    Run the above program. Make chances to the part labeled "add code here"
    until the program compiles and prints out "You did it! Good job!".

    When it does, post back here with the code and I'll help you with the
    next step.
     
    Daniel T., Jul 19, 2006
    #3
  4. Simon

    Rolf Magnus Guest

    Simon wrote:

    > Well, the title's pretty descriptive; how would I be able to take a
    > line of input like this:
    >
    > getline(cin,mostrecentline);
    >
    > And split into an (flexible) array of strings.


    What do you mean by "flexible", and which separators do you want to use?

    > For example: "do this
    > action"
    > would go to:
    >
    > item 0: do
    > item 1: this
    > item 2: action


    In this case, I'd use a stringstream and operator>>.
     
    Rolf Magnus, Jul 19, 2006
    #4
  5. Simon

    Mark P Guest

    Simon wrote:
    > Well, the title's pretty descriptive; how would I be able to take a
    > line of input like this:
    >
    > getline(cin,mostrecentline);
    >
    > And split into an (flexible) array of strings. For example: "do this
    > action"
    > would go to:
    >
    > item 0: do
    > item 1: this
    > item 2: action
    >
    > Thanks in advance,
    > Simon
    >


    Here's a little tokenizer fcn I've used before. Not necessarily the
    most elegant or compact way to do this (and criticisms are welcomed):

    // Populates "out" with delimited substrings of "in".
    int tokenize (const string& in, vector<string>& out, const char* delims)
    {
    string::size_type wordStart = 0; // current word start position
    string::size_type wordEnd = 0; // last word end position

    while (true)
    {
    wordStart = in.find_first_not_of(delims,wordEnd);
    if (wordStart == in.npos)
    break;
    wordEnd = in.find_first_of(delims,wordStart);
    if (wordEnd == in.npos)
    wordEnd = in.size();
    out.push_back(in.substr(wordStart,wordEnd - wordStart));
    }
    return out.size();
    }

    Mark
     
    Mark P, Jul 19, 2006
    #5
  6. Simon

    Daniel T. Guest

    In article <Jfwvg.10246$>,
    Mark P <> wrote:

    > Here's a little tokenizer fcn I've used before. Not necessarily the
    > most elegant or compact way to do this (and criticisms are welcomed):


    Well since criticisms are welcomed... :)

    > // Populates "out" with delimited substrings of "in".
    > int tokenize (const string& in, vector<string>& out, const char* delims)
    > {
    > string::size_type wordStart = 0; // current word start position
    > string::size_type wordEnd = 0; // last word end position
    >
    > while (true)
    > {
    > wordStart = in.find_first_not_of(delims,wordEnd);
    > if (wordStart == in.npos)
    > break;
    > wordEnd = in.find_first_of(delims,wordStart);
    > if (wordEnd == in.npos)
    > wordEnd = in.size();
    > out.push_back(in.substr(wordStart,wordEnd - wordStart));
    > }
    > return out.size();
    > }


    From least important to most important:

    1) The while true and break is not a style I prefer.

    2) Returning out.size() isn't very useful since the caller can find out
    what out.size() equals without the functions help.

    3) It only works for vectors, I'd write something that works for deques
    and lists as well.

    4) A cyclomatic complexity of 4 seems a tad excessive for what is
    supposed to be such a simple job. You can drop that to 3 by removing
    the unnecessary "if (wordEnd == in.npos)" logic. Heeding item (1)
    above can reduce the complexity to 2.

    Here's how I would write it:

    template <typename OutIt>
    void tokenize( const string& str, OutIt os, const string& delims = " ")
    {
    string::size_type start = str.find_first_not_of( delims );
    while ( start != string::npos ) {
    string::size_type end = str.find_first_of( delims, start );
    *os++ = str.substr( start, end - start );
    start = str.find_first_not_of( delims, end );
    }
    }
     
    Daniel T., Jul 20, 2006
    #6
  7. Simon

    Mark P Guest

    Daniel T. wrote:
    > In article <Jfwvg.10246$>,
    > Mark P <> wrote:
    >
    >> Here's a little tokenizer fcn I've used before. Not necessarily the
    >> most elegant or compact way to do this (and criticisms are welcomed):

    >
    > Well since criticisms are welcomed... :)
    >
    >> // Populates "out" with delimited substrings of "in".
    >> int tokenize (const string& in, vector<string>& out, const char* delims)
    >> {
    >> string::size_type wordStart = 0; // current word start position
    >> string::size_type wordEnd = 0; // last word end position
    >>
    >> while (true)
    >> {
    >> wordStart = in.find_first_not_of(delims,wordEnd);
    >> if (wordStart == in.npos)
    >> break;
    >> wordEnd = in.find_first_of(delims,wordStart);
    >> if (wordEnd == in.npos)
    >> wordEnd = in.size();
    >> out.push_back(in.substr(wordStart,wordEnd - wordStart));
    >> }
    >> return out.size();
    >> }

    >
    > From least important to most important:
    >
    > 1) The while true and break is not a style I prefer.


    Fair enough-- I'm not a fan either, but see my comment to item 4.

    >
    > 2) Returning out.size() isn't very useful since the caller can find out
    > what out.size() equals without the functions help.


    True. In my case, I pulled this function out of some actual code where
    the return value is sometimes used as a check. E.g., when parsing a
    particular file format, I expect a certain number of tokens per line.
    It saves the calling function a line of code by having the size of out
    returned automatically (and of course this fcn is called in multiple
    places).

    >
    > 3) It only works for vectors, I'd write something that works for deques
    > and lists as well.


    Agreed, I very much prefer your templated approach that takes any Output
    Iterator. In my case, using a known type allowed me to return the
    container size (cf. item 2), but this is just my own particular
    situation and at times excessive code parsimony.

    >
    > 4) A cyclomatic complexity of 4 seems a tad excessive for what is
    > supposed to be such a simple job. You can drop that to 3 by removing
    > the unnecessary "if (wordEnd == in.npos)" logic. Heeding item (1)
    > above can reduce the complexity to 2.
    >
    > Here's how I would write it:
    >
    > template <typename OutIt>
    > void tokenize( const string& str, OutIt os, const string& delims = " ")
    > {
    > string::size_type start = str.find_first_not_of( delims );
    > while ( start != string::npos ) {
    > string::size_type end = str.find_first_of( delims, start );
    > *os++ = str.substr( start, end - start );
    > start = str.find_first_not_of( delims, end );
    > }
    > }


    Looks good. In my case it was a bit more complicated because I also
    have an additional parameter for a comment character. When a comment
    character is encountered at the beginning of a token, that token is
    discarded and the loop breaks. (So in my original implementation there
    were multiple breakpoints out of the loop, although I hastily trimmed
    these before I posted my code, thereby leaving some unattractive vestiges.)

    In any event, I appreciate your comments and don't mean to simply make
    excuses and argue all of your points. The only significant hitch to my
    adopting your cleaner implementation is that I really do need support
    for the comment character break. Luckily this is just a bit of a little
    file parser I use for testing, so I don't stress too much about these
    details, but feel free to propose a svelte implementation that supports
    a comment char. :)

    Mark
     
    Mark P, Jul 21, 2006
    #7
  8. Simon

    Daniel T. Guest

    In article <VOVvg.128148$>,
    Mark P <> wrote:

    > > template <typename OutIt>
    > > void tokenize( const string& str, OutIt os, const string& delims = " ")
    > > {
    > > string::size_type start = str.find_first_not_of( delims );
    > > while ( start != string::npos ) {
    > > string::size_type end = str.find_first_of( delims, start );
    > > *os++ = str.substr( start, end - start );
    > > start = str.find_first_not_of( delims, end );
    > > }
    > > }

    >
    > Looks good. In my case it was a bit more complicated because I also
    > have an additional parameter for a comment character. When a comment
    > character is encountered at the beginning of a token, that token is
    > discarded and the loop breaks. (So in my original implementation there
    > were multiple breakpoints out of the loop, although I hastily trimmed
    > these before I posted my code, thereby leaving some unattractive vestiges.)
    >
    > In any event, I appreciate your comments and don't mean to simply make
    > excuses and argue all of your points.


    No problem. Your code was rather good in general, I only saw a few nits
    to pick at.

    > The only significant hitch to my
    > adopting your cleaner implementation is that I really do need support
    > for the comment character break. Luckily this is just a bit of a little
    > file parser I use for testing, so I don't stress too much about these
    > details, but feel free to propose a svelte implementation that supports
    > a comment char. :)


    If I understand what you mean then:

    void tokenize( const string& str, OutIt os, const string& delims = " ",
    char comment = '\0' )
    {
    string::size_type start = str.find_first_not_of( delims );
    while ( start != string::npos && start[0] != comment ) {
    string::size_type end = str.find_first_of( delims, start );
    *os++ = str.substr( start, end - start );
    start = str.find_first_not_of( delims, end );
    }
    }

    Of course you should probably change the defaults to whatever is most
    common in your code...
     
    Daniel T., Jul 21, 2006
    #8
  9. Simon

    Daniel T. Guest

    In article <VOVvg.128148$>,
    Mark P <> wrote:

    > >
    > > 2) Returning out.size() isn't very useful since the caller can find out
    > > what out.size() equals without the functions help.

    >
    > True. In my case, I pulled this function out of some actual code where
    > the return value is sometimes used as a check. E.g., when parsing a
    > particular file format, I expect a certain number of tokens per line.
    > It saves the calling function a line of code by having the size of out
    > returned automatically (and of course this fcn is called in multiple
    > places).


    Here you go, now it returns the size. :)

    int tokenize( const string& str, OutIt os, const string& delims = " ",
    char comment = '\0' )
    {
    int result = 0;
    string::size_type start = str.find_first_not_of( delims );
    while ( start != string::npos && start[0] != comment ) {
    string::size_type end = str.find_first_of( delims, start );
    *os++ = str.substr( start, end - start );
    ++result;
    start = str.find_first_not_of( delims, end );
    }
    return result;
    }
     
    Daniel T., Jul 21, 2006
    #9
  10. Simon

    Alex Vinokur Guest

    Simon wrote:
    > Well, the title's pretty descriptive; how would I be able to take a
    > line of input like this:
    >
    > getline(cin,mostrecentline);
    >
    > And split into an (flexible) array of strings.

    [snip]

    See "Splitting string into vector of vectors":
    http://groups.google.com/group/sources/msg/77993fb8841382c8
    http://groups.google.com/group/perfo/msg/9d49a1be3a5c6335
    http://groups.google.com/group/perfo/msg/f3c775cf7e3cdcf0


    Alex Vinokur
    email: alex DOT vinokur AT gmail DOT com
    http://mathforum.org/library/view/10978.html
    http://sourceforge.net/users/alexvn
     
    Alex Vinokur, Jul 21, 2006
    #10
  11. Simon

    Old Wolf Guest

    Daniel T. wrote:
    >
    > int tokenize( const string& str, OutIt os, const string& delims = " ",
    > char comment = '\0' )


    You should return the size type of the output iterator,
    rather than int.

    I am suspicious of the code. Suppose str is "x".

    > {
    > int result = 0;
    > string::size_type start = str.find_first_not_of( delims );


    start == 0.

    > while ( start != string::npos && start[0] != comment ) {


    condition is true

    > string::size_type end = str.find_first_of( delims, start );


    end is string::npos

    > *os++ = str.substr( start, end - start );


    Here you subtract a value from npos. I am not sure if this is a
    legal operation (although it will happen to work on my system).

    > ++result;
    > start = str.find_first_not_of( delims, end );


    is npos a legal argument for the second parameter to find_first_not_of
    ?

    > }
    > return result;
    > }
     
    Old Wolf, Jul 22, 2006
    #11
  12. Simon

    Daniel T. Guest

    In article <>,
    "Old Wolf" <> wrote:

    > Daniel T. wrote:
    > >
    > > int tokenize( const string& str, OutIt os, const string& delims = " ",
    > > char comment = '\0' )

    >
    > You should return the size type of the output iterator,
    > rather than int.


    That would be fine too...

    I had to fix the code, start[0] of course is silly.

    template < typename OutIt >
    int tokenize( const string& str, OutIt os, const string& delims = " ",
    char comment = '\0' )
    {
    int result = 0;
    string::size_type start = str.find_first_not_of( delims );
    while ( start != string::npos && str[start] != comment ) {
    string::size_type end = str.find_first_of( delims, start );
    *os++ = str.substr( start, end - start );
    ++result;
    start = str.find_first_not_of( delims, end );
    }
    return result;
    }


    > end is string::npos
    >
    > > *os++ = str.substr( start, end - start );

    >
    > Here you subtract a value from npos. I am not sure if this is a
    > legal operation (although it will happen to work on my system).


    I have several sources that say that npos is "The largest possible value
    of type size_type." Most importantly, it is *not* a flag but a defined
    value. Subtracting from a value is quite legal.


    > > ++result;
    > > start = str.find_first_not_of( delims, end );

    >
    > is npos a legal argument for the second parameter to find_first_not_of


    Or the broader question, what is the defined result of find_first_not_of
    if the second argument is greater than str.length().

    It could be that my implementation (and yours) is doing the wrong thing,
    Stroustrup in "The C++ Programming Language" says that in general
    specifying a index >= the length() should throw an exception which would
    require me to add another conditional here:

    start = ( end == string::npos ) ?
    end : str.find_first_not_of( delims, end );

    Maybe someone can check the standard for me?
     
    Daniel T., Jul 22, 2006
    #12
  13. Simon

    Guest

    Mark P wrote:
    > Simon wrote:
    > > Well, the title's pretty descriptive; how would I be able to take a
    > > line of input like this:
    > >
    > > getline(cin,mostrecentline);
    > >
    > > And split into an (flexible) array of strings. For example: "do this
    > > action"
    > > would go to:
    > >
    > > item 0: do
    > > item 1: this
    > > item 2: action
    > >
    > > Thanks in advance,
    > > Simon
    > >

    >
    > Here's a little tokenizer fcn I've used before. Not necessarily the
    > most elegant or compact way to do this (and criticisms are welcomed):
    >
    > // Populates "out" with delimited substrings of "in".
    > int tokenize (const string& in, vector<string>& out, const char* delims)
    > {
    > string::size_type wordStart = 0; // current word start position
    > string::size_type wordEnd = 0; // last word end position
    >
    > while (true)
    > {
    > wordStart = in.find_first_not_of(delims,wordEnd);
    > if (wordStart == in.npos)
    > break;
    > wordEnd = in.find_first_of(delims,wordStart);
    > if (wordEnd == in.npos)
    > wordEnd = in.size();
    > out.push_back(in.substr(wordStart,wordEnd - wordStart));
    > }
    > return out.size();
    > }
    >
    > Mark


    Along the same lines, here is something from a while back...

    http://groups.google.com/group/comp...st&q=davidrubin split&rnum=1#79258d2ea71e3e03
     
    , Jul 23, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Ericson
    Replies:
    0
    Views:
    435
    John Ericson
    Jul 19, 2003
  2. Mark
    Replies:
    0
    Views:
    449
  3. John Dibling
    Replies:
    0
    Views:
    423
    John Dibling
    Jul 19, 2003
  4. Joe
    Replies:
    2
    Views:
    140
    Bob Barrows [MVP]
    Oct 15, 2004
  5. pantagruel
    Replies:
    8
    Views:
    465
    Dr John Stockton
    Jul 22, 2006
Loading...

Share This Page