string tokenizing

Discussion in 'C++' started by David Rubin, Oct 7, 2003.

  1. David Rubin

    David Rubin Guest

    I looked on google for an answer, but I didn't find anything short of
    using boost which sufficiently answers my question: what is a good way
    of doing string tokenization (note: I cannot use boost). For example, I
    have tried this:

    #include <algorithm>
    #include <cctype>
    #include <climits>
    #include <deque>
    #include <iostream>
    #include <iterator>
    #include <string>

    using namespace std;

    int
    main()
    {
    string delim;
    int c;

    /* fill delim */
    for(c=0; c < CHAR_MAX; c++){ // I tried #include <limits>, but
    failed...
    if((isspace(c) || ispunct(c))
    && !(c == '_' || c == '#')
    delim += c;
    }

    string buf;
    string::size_type op, np;
    deque<string> tok;

    while(std::getline(cin, buf) && !cin.fail()){
    op = 0;
    while((np=buf.find_first_of(delim, op)) != buf.npos){
    tok.push_back(string(&buf[op], np-op));
    if((op=buf.find_first_not_of(delim, np)) == buf.npos)
    break;
    }
    tok.push_back(string(&buf[op]));

    cout << buf << endl;
    copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
    "\n"));
    cout << endl;
    tok.clear();
    }
    return 0;
    }

    The inner loop basically finds tokens delimited by any character in
    delim where multiple delimiters may appear between tokens (algorithm
    follows some advice found on clc++). However, the method seems a little
    clumsy, especially with respect to temporary objects. (Also, it does not
    seem to work correctly. For example, the last token gets corrupted in
    the second outer loop iteration.)

    Also, it would be very nice to have a function like

    int tokenize(const string& s, container<string>& c);

    which returns the number of tokens, inserted into the container.
    However, how do you write this so c is any container model? I'm not sure
    you can since they don't share a base class. Is there any better way?

    Certainly, this is easy to do with a mix of C and C++:

    for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
    tok.push_back(t);

    where buf and delim are essentially char*'s. However, this seems
    unsatisfactory as well.

    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 7, 2003
    #1
    1. Advertising

  2. David Rubin

    Mike Wahler Guest

    "David Rubin" <> wrote in message
    news:...
    > I looked on google for an answer, but I didn't find anything short of
    > using boost which sufficiently answers my question: what is a good way
    > of doing string tokenization (note: I cannot use boost). For example, I
    > have tried this:


    Remarks below.

    >
    > #include <algorithm>
    > #include <cctype>
    > #include <climits>
    > #include <deque>
    > #include <iostream>
    > #include <iterator>
    > #include <string>
    >
    > using namespace std;
    >
    > int
    > main()
    > {
    > string delim;
    > int c;
    >
    > /* fill delim */
    > for(c=0; c < CHAR_MAX; c++)



    Is there a particular reason you're excluding the
    value 'CHAR_MAX' from the loop? (using < instead of <=)

    >{ // I tried #include <limits>, but
    > failed...


    What happened?

    #include <limits>

    std::numeric_limits<char>::max();

    should work.

    More below.

    > if((isspace(c) || ispunct(c))
    > && !(c == '_' || c == '#')
    > delim += c;
    > }
    >
    > string buf;
    > string::size_type op, np;
    > deque<string> tok;
    >
    > while(std::getline(cin, buf) && !cin.fail()){
    > op = 0;
    > while((np=buf.find_first_of(delim, op)) != buf.npos){
    > tok.push_back(string(&buf[op], np-op));
    > if((op=buf.find_first_not_of(delim, np)) == buf.npos)
    > break;
    > }
    > tok.push_back(string(&buf[op]));
    >
    > cout << buf << endl;
    > copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
    > "\n"));
    > cout << endl;
    > tok.clear();
    > }
    > return 0;
    > }
    >
    > The inner loop basically finds tokens delimited by any character in
    > delim where multiple delimiters may appear between tokens (algorithm
    > follows some advice found on clc++). However, the method seems a little
    > clumsy, especially with respect to temporary objects. (Also, it does not
    > seem to work correctly. For example, the last token gets corrupted in
    > the second outer loop iteration.)
    >
    > Also, it would be very nice to have a function like
    >
    > int tokenize(const string& s, container<string>& c);
    >
    > which returns the number of tokens, inserted into the container.
    > However, how do you write this so c is any container model? I'm not sure
    > you can since they don't share a base class. Is there any better way?


    I find your code interesting, so I'll probably play around
    with it for a bit, and let you know if I have any ideas.

    But here's some food for thought: one way to 'generalize'
    container access is with iterators, as do the functions in
    <algorithm>.

    template <typename T>
    T::size_type tokenize(const std::string& s,
    T::iterator beg,
    T::iterator end)
    {
    }

    For inserting new elements, you can use an iterator
    adapter, e.g. std::insert_iterator. You could even
    use an output stream as a 'container' using
    ostream_iterator.

    >
    > Certainly, this is easy to do with a mix of C and C++:
    >
    > for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
    > tok.push_back(t);


    This contradicts your parameter type of const reference to string,
    since 'strtok()' modifies its argument.

    > where buf and delim are essentially char*'s. However, this seems
    > unsatisfactory as well.


    Yes, 'strtok()' can be problematic, if only for the reason that
    it modifies its argument, necessitating creation of a copy if
    you want to keep the argument const.

    HTH,
    -Mike
    Mike Wahler, Oct 7, 2003
    #2
    1. Advertising

  3. David Rubin

    Mike Wahler Guest

    Re: (corr) string tokenizing

    "Mike Wahler" <> wrote in message
    news:FkFgb.1741$...

    > template <typename T>
    > T::size_type tokenize(const std::string& s,
    > T::iterator beg,
    > T::iterator end)
    > {
    > }


    Oops, I meant to make those iterator parameters const refs
    as well

    const T::iterator& beg, const T::iterator& end

    -Mike
    Mike Wahler, Oct 7, 2003
    #3
  4. David Rubin

    David Rubin Guest

    Mike Wahler wrote:

    [snip]
    > >{ // I tried #include <limits>, but
    > > failed...

    >
    > What happened?


    foo.cc:9: limits: No such file or directory

    For some reason, my compiler can't find the file. Otherwise, I agree
    with you...

    > #include <limits>
    >
    > std::numeric_limits<char>::max();
    >
    > should work.


    [snip]
    > But here's some food for thought: one way to 'generalize'
    > container access is with iterators, as do the functions in
    > <algorithm>.


    > template <typename T>
    > T::size_type tokenize(const std::string& s,
    > T::iterator beg,
    > T::iterator end)
    > {
    > }


    This is a nice idea! I knew you should be able to do this, but I
    couldn't see how. Here is the refactored code:

    template <typename InsertIter>
    void
    tokenize(const string& buf, const string& delim, InsertIter ii)
    {
    string word;
    string::size_type sp, ep; // start/end position

    sp = 0;
    do{
    sp = buf.find_first_not_of(delim, sp);
    ep = buf.find_first_of(delim, sp);
    if(sp != ep){
    if(ep == buf.npos)
    ep = buf.length();
    word = buf.substr(sp, ep-sp);
    *ii++ = lc(word);
    sp = buf.find_first_not_of(delim, ep+1);
    }
    }while(sp != buf.npos);

    if(sp != buf.npos){
    word = buf.substr(sp, buf.length()-sp);
    *ii++ = lc(word);
    }
    }

    called as

    tokenize(buf, delim, insert_iter<deque<string> >(tokens,
    tokens.begin()));

    The orignal spec returned the number of tokens parsed. Now I have to
    settle for checking

    if(tokens.size() > 0){ ... }

    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 7, 2003
    #4
  5. David Rubin

    David Rubin Guest

    David Rubin wrote:

    template <typename InsertIter>
    void
    tokenize(const string& buf, const string& delim, InsertIter ii)
    {
    string::size_type sp, ep; // start/end position

    ep = -1;

    do{
    sp = buf.find_first_not_of(delim, ep+1);
    ep = buf.find_first_of(delim, sp);
    if(sp != ep){
    if(ep == buf.npos)
    ep = buf.length();
    *ii++ = buf.substr(sp, ep-sp);
    }
    }while(sp != buf.npos);
    }

    That's better. The 'ep+1' is a small optimization. I'm not sure it makes
    any difference, and really, starting with ep=0 makes the code a bit
    clearer.

    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 8, 2003
    #5
  6. David Rubin

    Mike Wahler Guest

    "David Rubin" <> wrote in message
    news:...
    > Mike Wahler wrote:
    >
    > [snip]
    > > >{ // I tried #include <limits>, but
    > > > failed...

    > >
    > > What happened?

    >
    > foo.cc:9: limits: No such file or directory
    >
    > For some reason, my compiler can't find the file.


    Configuration problem, installation problem, or perhaps
    simply not provided by your implemenation. Which one
    are you using? More below.

    > Otherwise, I agree
    > with you...
    >
    > > #include <limits>
    > >
    > > std::numeric_limits<char>::max();
    > >
    > > should work.


    > [snip]


    > > But here's some food for thought: one way to 'generalize'
    > > container access is with iterators, as do the functions in
    > > <algorithm>.

    >
    > > template <typename T>
    > > T::size_type tokenize(const std::string& s,
    > > T::iterator beg,
    > > T::iterator end)
    > > {
    > > }

    >
    > This is a nice idea!



    Yes it is. But not my idea. I "stole" it from
    the standard library, the design of which is imo
    rich with Good Ideas.

    If you don't have the Josuttis book, get it.
    www.josuttis.com/libbook


    >I knew you should be able to do this, but I
    > couldn't see how. Here is the refactored code:


    [snip code] (I didn't look at it very closely, so if
    obvious errors, I didn't see them)

    > called as
    >
    > tokenize(buf, delim, insert_iter<deque<string> >(tokens,
    > tokens.begin()));
    >
    > The orignal spec returned the number of tokens parsed. Now I have to
    > settle for checking
    >
    > if(tokens.size() > 0){ ... }


    or

    if(!tokens.empty())

    which *might* improve performance, and imo is
    more expressive.

    -Mike
    Mike Wahler, Oct 8, 2003
    #6
  7. David Rubin

    Mike Wahler Guest

    "David Rubin" <> wrote in message
    news:...
    > David Rubin wrote:
    >
    > template <typename InsertIter>
    > void
    > tokenize(const string& buf, const string& delim, InsertIter ii)


    You might squeeze out some more performance by making
    that last parameter a const reference.


    > {
    > string::size_type sp, ep; // start/end position
    >
    > ep = -1;
    >
    > do{
    > sp = buf.find_first_not_of(delim, ep+1);
    > ep = buf.find_first_of(delim, sp);
    > if(sp != ep){
    > if(ep == buf.npos)
    > ep = buf.length();
    > *ii++ = buf.substr(sp, ep-sp);
    > }
    > }while(sp != buf.npos);
    > }
    >
    > That's better. The 'ep+1' is a small optimization. I'm not sure it makes
    > any difference, and really, starting with ep=0 makes the code a bit
    > clearer.


    "I love it when a plan comes together."
    -George Peppard, as "Hannibal" in "The A Team"

    :)

    -Mike
    Mike Wahler, Oct 8, 2003
    #7
  8. David Rubin

    SomeDumbGuy Guest

    Mike Wahler wrote:


    >>tokenize(const string& buf, const string& delim, InsertIter ii)

    >
    >
    > You might squeeze out some more performance by making
    > that last parameter a const reference.


    Sorry to bother you. I was wondering how making it a const reference
    would help performance?
    SomeDumbGuy, Oct 8, 2003
    #8
  9. David Rubin

    Mike Wahler Guest

    "SomeDumbGuy" <abuse@127.0.0.1> wrote in message
    news:bbKgb.29244$...
    > Mike Wahler wrote:
    >
    >
    > >>tokenize(const string& buf, const string& delim, InsertIter ii)

    > >
    > >
    > > You might squeeze out some more performance by making
    > > that last parameter a const reference.

    >
    > Sorry to bother you. I was wondering how making it a const reference
    > would help performance?


    Note the word "might". Depending upon the implementation,
    an iterator might be a simple pointer, but it also could
    be an elaborate large structure, in which case passing
    by reference could be faster than passing a copy by
    value, and would still probably be just as fast as
    pass by value if the iterator is represented with a
    pointer type.

    Since OP's quest was to 'generalize' for any container
    type (via a template), we cannot know what the actual
    representation of the iterator will be.

    Also the same iterator type might be implemented in different
    ways among library implemenations, some large and complex,
    some not. A reference can prevent possible performance
    degradation, without having to be concerned whether it's
    actually an issue.

    The 'const' part of my suggestion doesn't really have anything
    to do with performance. I say 'const' reference, since the
    function does not need to modify the iterator's value. This
    is the recommendation for any parameter of non-built-in type
    which a function does not modify, especially when the parameter
    type is a template argument, since you can't know how large and
    complex it might be.

    The "traditional wisdom" concerning parameter types is
    essentially:

    For non-built-in types or "unknown" types (specified by
    a template argument), pass by const reference by default.

    If the function needs to modify the caller's argument,
    regardless of type, pass by nonconst reference.

    If the parameter is always a built-in type and the function
    need not modify the caller's argument, pass by value or
    const reference.

    Or something like that. :)

    Does that help?

    -Mike
    Mike Wahler, Oct 8, 2003
    #9
  10. David Rubin

    Jerry Coffin Guest

    In article <>,
    says...
    > I looked on google for an answer, but I didn't find anything short of
    > using boost which sufficiently answers my question: what is a good way
    > of doing string tokenization (note: I cannot use boost). For example, I
    > have tried this:


    This may look a bit odd at first, but it works quite nicely. Its basic
    idea is to hijack the tokenizer built into the standard iostream
    classes, and put it to our purposes. The main thing necessary to do
    that is to create a facet that classifies our delimiters as "space" and
    everything else as not-"space". Once we have a stream using our facet,
    we can simply read tokens from the stream and use them as we please.

    Here's the code:

    #include <iostream>
    #include <deque>
    #include <sstream>
    #include <iterator>

    #include "facet"

    template <class T>
    class delims : public ctype_table<T>
    {
    public:
    delims(size_t refs = 0)
    : ctype_table<T>(ctype_table<T>::empty)
    {
    for (int i=0; i<table_size; i++)
    if((isspace(i) || ispunct(i)) && !(i == '_' || i == '#'))
    table()[widen(i)] = mask(space);
    }
    };

    int main() {
    std::string buf;
    std::locale d(std::locale::classic(), new delims<char>);

    while ( std::getline(std::cin, buf)) {
    // deque to hold tokens.
    std::deque<std::string> tok;

    // create istringstream from the input string and have it use our facet.
    std::istringstream is(buf);
    is.imbue(d);

    std::istream_iterator<std::string> in(is), end;

    // copy tokens from our stream into a deque
    std::copy(in, end,
    std::back_inserter<std::deque<std::string> >(tok));

    // show tokens, one per line.
    std::copy(tok.begin(), tok.end(),
    std::eek:stream_iterator<std::string>(std::cout, "\n"));
    }

    return 0;
    }

    Note that if we knew we were going to use std::cin for this, we could
    just imbue std::cin with the locale we created, and read directly from
    it into the deque:

    // same facet as above.

    int main() {
    std::locale d(std::locale::classic(), new delims<char>);
    std::cin.imbue(d);

    std::istream_iterator<std::string> in(std::cin), end;
    std::eek:stream_iterator<std::string> out(std::cout, "\n");

    std::copy(in, end, out);
    return 0;
    }

    Of course, even if you're planning to put the tokens into some
    container, you can still imbue the input stream with the locale instead
    of reading from stream to string, then imbuing a stringstream with the
    locale.

    The facet header I'm using contains some code I posted a while back --
    it looks like this:

    #include <locale>
    #include <algorithm>

    template<class T>
    class table {
    typedef typename std::ctype<T>::mask tmask;

    tmask *t;
    public:
    table() : t(new std::ctype<T>::mask[std::ctype<T>::table_size]) {}
    ~table() { delete [] t; }
    tmask *the_table() { return t; }
    };

    template<class T>
    class ctype_table : table<T>, public std::ctype<T> {
    protected:
    typedef typename std::ctype<T>::mask tmask;

    enum inits { empty, classic };

    ctype_table(size_t refs = 0, inits init=classic)
    : std::ctype<T>(the_table(), false, refs)
    {
    if (classic == init)
    std::copy(classic_table(),
    classic_table()+table_size,
    the_table());
    else
    std::fill_n(the_table(), table_size, mask());
    }
    public:
    tmask *table() {
    return the_table();
    }
    };

    This handles most of the dirty work of creating a facet, so about all
    you're left with is specifying what characters you want treated as
    delimiters, and then telling the stream to use a locale that includes
    the facet.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Oct 8, 2003
    #10
  11. David Rubin

    SomeDumbGuy Guest

    Mike Wahler wrote:

    >>>You might squeeze out some more performance by making
    >>>that last parameter a const reference.

    >>
    >>Sorry to bother you. I was wondering how making it a const reference
    >>would help performance?

    >
    >
    > Note the word "might". Depending upon the implementation,
    > an iterator might be a simple pointer, but it also could
    > be an elaborate large structure, in which case passing
    > by reference could be faster than passing a copy by
    > value, and would still probably be just as fast as
    > pass by value if the iterator is represented with a
    > pointer type.



    > The "traditional wisdom" concerning parameter types is
    > essentially:
    >
    > For non-built-in types or "unknown" types (specified by
    > a template argument), pass by const reference by default.
    >
    > If the function needs to modify the caller's argument,
    > regardless of type, pass by nonconst reference.
    >
    > If the parameter is always a built-in type and the function
    > need not modify the caller's argument, pass by value or
    > const reference.
    >
    > Or something like that. :)
    >
    > Does that help?


    Yes. :)
    SomeDumbGuy, Oct 8, 2003
    #11
  12. David Rubin

    David Rubin Guest

    Jerry Coffin wrote:

    > This may look a bit odd at first, but it works quite nicely. Its basic
    > idea is to hijack the tokenizer built into the standard iostream
    > classes, and put it to our purposes. The main thing necessary to do
    > that is to create a facet that classifies our delimiters as "space" and
    > everything else as not-"space". Once we have a stream using our facet,
    > we can simply read tokens from the stream and use them as we please.


    Interesting idea. It seems rather complex compared to the code I posted
    though. One nice thing about your solution is that you encapsulate
    delims in a class. I suppose you could extend the class a bit with
    various methods to extend the delimiters (equivalent to isspace,
    ispunct, etc), although this is reasonably accomplished via subclassing.
    What are the other benefits of this approach compared to mine?

    Also, I have a few questions...

    [snip]
    > int main() {
    > std::locale d(std::locale::classic(), new delims<char>);


    Can you explain the role of locales in a little more detail? Can't you
    just skip this part in most cases?

    > std::cin.imbue(d);


    > std::istream_iterator<std::string> in(std::cin), end;
    > std::eek:stream_iterator<std::string> out(std::cout, "\n");


    > std::copy(in, end, out);


    How does end function here? I've only seen copy specified with iterators
    associated with a container. In this case, in and end seem to have no
    association with each other.

    > return 0;
    > }


    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 8, 2003
    #12
  13. David Rubin

    David Rubin Guest

    Mike Wahler wrote:
    >
    > "David Rubin" <> wrote in message
    > news:...
    > > David Rubin wrote:
    > >
    > > template <typename InsertIter>
    > > void
    > > tokenize(const string& buf, const string& delim, InsertIter ii)

    >
    > You might squeeze out some more performance by making
    > that last parameter a const reference.


    Actually, I tried defining this function as

    template <typename InsertIter>
    void tokenize(const string& buf, const string& delim, InsertIter& ii);

    (and const InsertIter&) with little success. I got compiler errors both
    times. For exapmle, with the non-const refernce, I got:

    ; g++ tokenize.cc
    tokenize.cc: In function `int main()':
    tokenize.cc:31: error: could not convert `
    insert_iterator<std::deque<std::basic_string<char,
    std::char_traits<char>,
    std::allocator<char> >, std::allocator<std::basic_string<char,
    std::char_traits<char>, std::allocator<char> > > > >((&tokens),
    std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
    std::allocator<std::string>]())' to `
    std::insert_iterator<std::deque<std::string,
    std::allocator<std::string> >
    >&'

    tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
    std::string&, const std::string&, InsertIter&) [with InsertIter =
    std::insert_iterator<std::deque<std::string,
    std::allocator<std::string> >
    >]'


    when invoked (12) as

    tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
    tokens.begin()));

    (g++ 3.3.1, Solaris 2.6).

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 8, 2003
    #13
  14. "David Rubin" <> wrote in message
    news:...
    > [...]
    > (note: I cannot use boost).
    > [...]


    Why can't you use Boost? Legal department?

    Dave



    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.521 / Virus Database: 319 - Release Date: 9/23/2003
    David B. Held, Oct 8, 2003
    #14
  15. David Rubin

    Mike Wahler Guest

    "David Rubin" <> wrote in message
    news:...
    > Mike Wahler wrote:
    > >
    > > "David Rubin" <> wrote in message
    > > news:...
    > > > David Rubin wrote:
    > > >
    > > > template <typename InsertIter>
    > > > void
    > > > tokenize(const string& buf, const string& delim, InsertIter ii)

    > >
    > > You might squeeze out some more performance by making
    > > that last parameter a const reference.

    >
    > Actually, I tried defining this function as
    >
    > template <typename InsertIter>
    > void tokenize(const string& buf, const string& delim, InsertIter& ii);
    >
    > (and const InsertIter&) with little success. I got compiler errors both
    > times. For exapmle, with the non-const refernce, I got:
    >
    > ; g++ tokenize.cc
    > tokenize.cc: In function `int main()':
    > tokenize.cc:31: error: could not convert `
    > insert_iterator<std::deque<std::basic_string<char,
    > std::char_traits<char>,
    > std::allocator<char> >, std::allocator<std::basic_string<char,
    > std::char_traits<char>, std::allocator<char> > > > >((&tokens),
    > std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
    > std::allocator<std::string>]())' to `
    > std::insert_iterator<std::deque<std::string,
    > std::allocator<std::string> >
    > >&'

    > tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
    > std::string&, const std::string&, InsertIter&) [with InsertIter =
    > std::insert_iterator<std::deque<std::string,
    > std::allocator<std::string> >
    > >]'

    >
    > when invoked (12) as
    >
    > tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
    > tokens.begin()));


    If it's really an issue for you, I'll take a look and
    see if I can locate the problem.

    -Mike
    Mike Wahler, Oct 8, 2003
    #15
  16. David Rubin

    David Rubin Guest

    Mike Wahler wrote:

    > > [...] I got compiler errors both
    > > times. For exapmle, with the non-const refernce, I got:

    [snip]
    > If it's really an issue for you, I'll take a look and
    > see if I can locate the problem.


    I'd appreciate it. I just don't understand what the compiler is doing.
    Thanks.

    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 8, 2003
    #16
  17. David Rubin

    Mike Wahler Guest

    "David Rubin" <> wrote in message
    news:...
    > Mike Wahler wrote:
    >
    > > > [...] I got compiler errors both
    > > > times. For exapmle, with the non-const refernce, I got:

    > [snip]
    > > If it's really an issue for you, I'll take a look and
    > > see if I can locate the problem.

    >
    > I'd appreciate it. I just don't understand what the compiler is doing.


    I didn't get exactly the same diagnostics, but I did get
    complaints about 'insert_iterator::eek:perator++()', because
    it modifies the iterator, so we cannot make it a const
    reference, but it works for me as a nonconst reference:

    #include <algorithm>
    #include <deque>
    #include <iterator>
    #include <string>


    template <typename InsertIter>
    void tokenize(const std::string& buf,
    const std::string& delim,
    InsertIter& ii)
    {
    std::string::size_type sp(0); /* start position */
    std::string::size_type ep(-1); /* end position */

    do
    {
    sp = buf.find_first_not_of(delim, ep + 1);
    ep = buf.find_first_of(delim, sp);

    if(sp != ep)
    {
    if(ep == buf.npos)
    ep = buf.length();

    *ii++ = buf.substr(sp, ep-sp);
    }

    } while(sp != buf.npos);
    }

    int main()
    {
    std::string buf("We* are/parsing [a---string");
    std::string delim(" */[-");
    std::deque<std::string> tokens;

    tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

    std::copy(tokens.begin(), tokens.end(),
    std::eek:stream_iterator<std::string>(std::cout, "\n"));

    return 0;

    }


    Output:

    We
    are
    parsing
    a
    string


    HTH,
    -Mike
    Mike Wahler, Oct 9, 2003
    #17
  18. David Rubin

    Mike Wahler Guest

    Re: (corr) string tokenizing

    "Mike Wahler" <> wrote in message
    news:CG1hb.3532$...
    >
    > std::copy(tokens.begin(), tokens.end(),
    > std::eek:stream_iterator<std::string>(std::cout, "\n"));


    I forgot the obvious #include <iostream> :)

    (I suppose my implementation brought the decls in via
    #include <iterator>, so didn'tcomplain about no 'cout')

    -Mike
    Mike Wahler, Oct 9, 2003
    #18
  19. David Rubin

    Jerry Coffin Guest

    In article <>,
    says...

    [ ... ]

    > Interesting idea. It seems rather complex compared to the code I posted
    > though.


    Yes and no -- the framework for creating a facet is a bit more complex
    than I'd like, but creating a facet is fairly easy and using it is
    pretty nearly dead simple.

    > One nice thing about your solution is that you encapsulate
    > delims in a class. I suppose you could extend the class a bit with
    > various methods to extend the delimiters (equivalent to isspace,
    > ispunct, etc), although this is reasonably accomplished via subclassing.
    > What are the other benefits of this approach compared to mine?


    Ease of use -- it supports normal stream extractors. Since the example
    at hand was taking the tokens from a stream and putting them into a
    deque, I used iterators and std::copy to copy from one to the other. It
    could have been done with normal stream extraction though:

    std::string token;

    while (stream >> token)
    std::cout << token << std::endl;

    would have worked just fine for the job at hand as well. For that
    matter, to read them and put them into the deque, we could have done
    something like:

    std::string token;
    std::deque tok;

    while ( stream >> token)
    tok.push_back(token);


    To make a long story short, once you've imbued a stream with the facet
    that defines the delimiters between tokens, you don't have to learn
    anything new at all -- you're just extracting strings from a stream.

    > Also, I have a few questions...
    >
    > [snip]
    > > int main() {
    > > std::locale d(std::locale::classic(), new delims<char>);

    >
    > Can you explain the role of locales in a little more detail? Can't you
    > just skip this part in most cases?


    A locale is basically a collection of facets (one facet each of a number
    of different types). A stream, however, only knows about a complete
    locale, though, not individual facets. Therefore, we create an otherwise
    normal locale, but have it use our ctype facet. The rest of it probably
    doesn't matter for the job at hand, but (at least TTBOMK) there's no
    function that tells a stream to use on facet at a time.

    > > std::cin.imbue(d);

    >
    > > std::istream_iterator<std::string> in(std::cin), end;
    > > std::eek:stream_iterator<std::string> out(std::cout, "\n");

    >
    > > std::copy(in, end, out);

    >
    > How does end function here? I've only seen copy specified with iterators
    > associated with a container. In this case, in and end seem to have no
    > association with each other.


    An istream iterator that's not associated with a stream is basically the
    equivalent of EOF -- another istream_iterator will become equal to it
    when the end of the stream is encountered.

    That's strictly related to using istream_iterator's though -- it's
    completely independent of changing the facet to get the parsing we want.
    Just for example, if we had a file of numbers (integers for the moment)
    and we wanted to read those numbers into a vector, we could do it
    similarly:

    std::vector<int> v;

    std::istream_iterator<int> in(std::cin), end;

    std::copy(in, end, std::back_inserter<std::vector<int> >(v));

    Similarly, if we have a file of floating point values that we wanted to
    read into a list, we could do:

    std::list<double> ld;

    std::istream_iterator<double> in(std::cin), end;

    std::copy(in, end, std::back_inserter<std::list<double> >(ld));

    and so on.

    To summarize: an istream_iterator allows us to treat the contents of a
    stream like a collection, so we can apply a standard algorithm to its
    contents (it's basically an input_iterator, so we can't, for example,
    apply an algorithm that requires a random access iterator though).

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Oct 9, 2003
    #19
  20. David Rubin

    David Rubin Guest

    Mike Wahler wrote:

    [snip]
    > template <typename InsertIter>
    > void tokenize(const std::string& buf,
    > const std::string& delim,
    > InsertIter& ii)


    [snip]
    > int main()
    > {
    > std::string buf("We* are/parsing [a---string");
    > std::string delim(" */[-");
    > std::deque<std::string> tokens;
    >
    > tokenize(buf, delim, std::inserter(tokens, tokens.begin()));


    This only works with my compiler if I do

    std::insert_iterator<std::deque<std::string> > ii(tokens,
    tokens.begin());
    tokenize(buf, delim, ii);

    Otherwise, I get the same errors as before. I guess this means my
    compiler is broken?

    ; g++ --version
    g++ (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)

    /david

    --
    Andre, a simple peasant, had only one thing on his mind as he crept
    along the East wall: 'Andre, creep... Andre, creep... Andre, creep.'
    -- unknown
    David Rubin, Oct 9, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. 9GB

    Tokenizing Problem...

    9GB, Jun 19, 2006, in forum: Java
    Replies:
    5
    Views:
    350
    Eric Sosman
    Jun 20, 2006
  2. Queue

    Tokenizing files?

    Queue, Sep 9, 2006, in forum: C Programming
    Replies:
    0
    Views:
    437
    Queue
    Sep 9, 2006
  3. wreckingcru
    Replies:
    11
    Views:
    1,200
    red floyd
    Feb 1, 2006
  4. magedoll13
    Replies:
    0
    Views:
    379
    magedoll13
    Jan 30, 2009
  5. Juan Alvarez

    Tokenizing text

    Juan Alvarez, Feb 24, 2009, in forum: Ruby
    Replies:
    0
    Views:
    69
    Juan Alvarez
    Feb 24, 2009
Loading...

Share This Page