remove certain characters from a string

Discussion in 'C++' started by Brad, May 24, 2008.

  1. Brad

    Brad Guest

    I'm writing a function to remove certain characters from strings. For
    example, I often get strings with commas... they look like this:

    "12,384"

    I'd like to take that string, remove the comma and return "12384"

    What is the most efficient, fastest way to approach this?

    Thanks,

    Brad
    Brad, May 24, 2008
    #1
    1. Advertising

  2. Brad

    Ian Collins Guest

    Brad wrote:
    > I'm writing a function to remove certain characters from strings. For
    > example, I often get strings with commas... they look like this:
    >
    > "12,384"
    >
    > I'd like to take that string, remove the comma and return "12384"
    >
    > What is the most efficient, fastest way to approach this?
    >

    Probably a character by character copy, skipping the characters you want
    to remove.

    --
    Ian Collins.
    Ian Collins, May 24, 2008
    #2
    1. Advertising

  3. Brad

    tarakant_sethy

    Joined:
    May 24, 2008
    Messages:
    1
    what i feel is, u can create a temp array, and start searching from the existing array for numbers( ucan use isdigit()) and dump it to the temp array.
    then apply atoi() to the emp array and get the result.

    i dont know if it efficient or not. Do u have any other solution for it ?
    tarakant_sethy, May 24, 2008
    #3
  4. Brad

    Brad Guest

    Paavo Helde wrote:

    >
    > std::string s("12,384,586");
    > std::string::size_type k = 0;
    > while((k=s.find(',',k))!=s.npos) {
    > s.erase(k, 1);
    > }
    >
    > this ok?
    > Paavo


    Thanks, works well with one character... How would I make it work with
    several... like so:

    #include <iostream>

    std::string clean(std::string the_string)
    {
    char bad[] = {'.', ',', ' ', '|', '\0'};
    std::string::size_type n = 0;
    while ((n=the_string.find(bad, n))!=the_string.npos)
    {
    the_string.erase(n,1);
    }
    std::cout << the_string << std::endl;
    return the_string;
    }

    int main()
    {
    clean("12,34.45|78 9");
    return 0;
    }
    Brad, May 24, 2008
    #4
  5. Brad schrieb:
    > Thanks, works well with one character... How would I make it work with
    > several... like so:


    struct CheckForBad
    {
    const string chars;
    CheckForBad(string const& chars)
    : chars(sortUniq(chars))
    {}
    static string sortUniq(string tmp)
    {
    sort(tmp.begin(), tmp.end());
    tmp.erase(
    unique(tmp.begin(), tmp.end()),
    tmp.end()
    );
    return tmp;
    }
    bool operator() (char const c) const
    {
    return binary_search(
    chars.begin(), chars.end(),
    c
    );
    }
    };

    string removeStuff(string text, string const& stuff)
    {
    text.erase(
    remove_if(
    text.begin(), text.end(),
    CheckForBad(stuff)
    ),
    text.end()
    );
    return text;
    }

    //or:
    string removeStd(string text)
    {
    static const CheckForBad check(".,| ");
    text.erase(
    remove_if(
    text.begin(), text.end(),
    check
    ),
    text.end()
    );
    return text;
    }

    not tested

    Regards,
    Frank
    Frank Birbacher, May 24, 2008
    #5
  6. Brad

    Guest

    On May 24, 5:01 pm, Paavo Helde <> wrote:
    > Brad <> kirjutas:
    >
    > > I'm writing a function to remove certain characters from strings. For
    > > example, I often get strings with commas... they look like this:

    >
    > > "12,384"

    >
    > > I'd like to take that string, remove the comma and return "12384"

    >
    > > What is the most efficient, fastest way to approach this?

    >
    > > Thanks,

    >
    > > Brad

    >
    > std::string s("12,384,586");
    > std::string::size_type k = 0;
    > while((k=s.find(',',k))!=s.npos) {
    > s.erase(k, 1);
    >
    > }
    >
    > this ok?
    > Paavo


    This is ugly... based on what Pavvo posted. It works. What do you guys
    think? Bad?

    #include <iostream>

    std::string clean(std::string the_string)
    {
    char bad[] = {'.', ',', ' ', '|', '\0'};
    std::string::size_type n = 0;

    while ((n=the_string.find(bad[0], n))!=the_string.npos)
    {
    the_string.erase(n,1);
    }

    n = 0;
    while ((n=the_string.find(bad[1], n))!=the_string.npos)
    {
    the_string.erase(n,1);
    }

    n = 0;
    while ((n=the_string.find(bad[2], n))!=the_string.npos)
    {
    the_string.erase(n,1);
    }

    n = 0;
    while ((n=the_string.find(bad[3], n))!=the_string.npos)
    {
    the_string.erase(n,1);
    }

    std::cout << the_string << std::endl;
    return the_string;
    }

    int main()
    {
    clean("12,,34..56||78 9");
    return 0;
    }
    , May 24, 2008
    #6
  7. Brad

    Ian Collins Guest

    Brad wrote:
    > Paavo Helde wrote:
    >
    >>
    >> std::string s("12,384,586");
    >> std::string::size_type k = 0;
    >> while((k=s.find(',',k))!=s.npos) {
    >> s.erase(k, 1);
    >> }
    >>
    >> this ok?
    >> Paavo

    >
    > Thanks, works well with one character... How would I make it work with
    > several... like so:
    >
    > #include <iostream>
    >
    > std::string clean(std::string the_string)
    > {
    > char bad[] = {'.', ',', ' ', '|', '\0'};
    > std::string::size_type n = 0;
    > while ((n=the_string.find(bad, n))!=the_string.npos)
    > {
    > the_string.erase(n,1);
    > }
    > std::cout << the_string << std::endl;
    > return the_string;
    > }
    >


    If you want a simple loop:

    std::string clean( const std::string& the_string )
    {
    static const std::string bad("., |");

    std::string result;

    for( std::string::const_iterator c = the_string.begin();
    c != the_string.end(); ++c )
    {
    if( bad.find( *c ) == std::string::npos )
    {
    result += *c;
    }
    }

    return result;
    }

    --
    Ian Collins.
    Ian Collins, May 24, 2008
    #7
  8. Hi!

    Looking at the various solutions to the original problem, I wanted to
    state my design goals so one could make a resonable decision about which
    code to use.

    I try to keep the algorithmic complexity low. I try to reuse code, that
    is I use the STL and therefore I stick to its idioms.

    Regards,
    Frank
    Frank Birbacher, May 25, 2008
    #8
  9. Brad

    Guest

    On May 24, 7:43 pm, Frank Birbacher <> wrote:
    > Hi!
    >
    > Looking at the various solutions to the original problem, I wanted to
    > state my design goals so one could make a resonable decision about which
    > code to use.
    >
    > I try to keep the algorithmic complexity low. I try to reuse code, that
    > is I use the STL and therefore I stick to its idioms.


    Hmm... I see a lot of really complex and strange code here when it's
    not really necessary. Most of what people posted requires multiple
    passes through the string, or a lot of shifting of bytes around (e.g.
    something like Paavo's "while (string contains char) remove_char" is
    going to do -way- more moving of data than necessary -- it shifts the
    entire end of the string back every time through the loop). Sticking
    to generic STL calls for finding and removing characters in the string
    gains you nothing unless you are going to be finding and removing
    elements from generic containers that don't provide random access
    iterators (in which case the generic programming is a benefit). The
    use of remove_if, such as in Frank's example, will get you equal
    performance to the example below (remove_if may very well be
    implemented the same way), except Frank's sort + binary search is
    likely to have more overhead then a simple linear search for your
    original requirements of removing a set of 3 or 4 bad characters only
    (however, for removing large character sets, a binary search will
    perform better, the sort is unnecessary if the input is sorted to
    begin with -- but you can do the search in -constant- time, with no
    pre-sorting either, if you make some assumptions about the max value
    of a char and load valid characters into a lookup table first). You
    know that you are using a string (or any type with random access
    iterators). Just do something like this:

    In-place, single pass through string, no unnecessary copies or moves:

    void remove_chars (const string &bad, string &str) {
    string::iterator s, d;
    for (s = str.begin(), d = s; s != str.end(); ++ s)
    if (bad.find(*s) == string::npos)
    *(d ++) = *s;
    str.resize(d - str.begin());
    }

    That works because 'd' will always be behind or at the same position
    as 's'. That in-place version can be made to work with generic
    iterators as well as random access iterators if you replace the
    resize() call with "erase(d, str.end())". Here is the same thing,
    places result in destination buffer:

    void remove_chars (const string &bad, const string &str, string
    &clean) {
    string::const_iterator s;
    clean = "";
    clean.reserve(str.size()); // do not perform extra realloc + copies.
    for (s = str.begin(); s != str.end(); ++ s)
    if (bad.find(*s) == string::npos)
    clean += *s;
    }

    Example use:

    {
    string s = "a m|e|s|s|y s,t,r,i,n,g", c;
    remove_chars("mesy", s, c);
    remove_chars("|,", s);
    cout << c << endl << s << endl;
    }


    Jason
    , May 25, 2008
    #9
  10. Brad

    Ian Collins Guest

    wrote:
    >
    > void remove_chars (const string &bad, const string &str, string
    > &clean) {
    > string::const_iterator s;
    > clean = "";
    > clean.reserve(str.size()); // do not perform extra realloc + copies.
    > for (s = str.begin(); s != str.end(); ++ s)
    > if (bad.find(*s) == string::npos)
    > clean += *s;
    > }
    >

    Isn't that just about (excluding the reserve) identical to the solution
    I posted?

    --
    Ian Collins.
    Ian Collins, May 25, 2008
    #10
  11. Paavo Helde wrote:
    >> What is the most efficient, fastest way to approach this?

    >
    > std::string s("12,384,586");
    > std::string::size_type k = 0;
    > while((k=s.find(',',k))!=s.npos) {
    > s.erase(k, 1);
    > }


    That's certainly not the most efficient way to do it.
    Juha Nieminen, May 25, 2008
    #11
  12. Brad

    Guest

    On May 25, 12:12 am, Ian Collins <> wrote:
    > Isn't that just about (excluding the reserve) identical to the solution
    > I posted?


    Yes it is identical. I was responding to the big blocks of code I saw
    being posted. The OP wandered down the "while (character found) remove
    character" path instead of using your suggestion right off the bat.
    I'm sorry, I should have mentioned it.

    Jason
    , May 25, 2008
    #12
  13. Hi!

    Daniel T. schrieb:
    > Jason's solution is simply to write the remove_if algorithm out by hand.


    Which doesn't necessarily make the code more readable. But readability
    contraditcs writing code fast/short. And writing code fast/short doesn't
    mean the code will work. <ironic>But anyways, who actually wants code
    that works?</ironic>

    Well, remove_if is tested, the while-loop solution is not. I recently
    was terrified by similar char-shifting code. I dislike it.

    Frank
    Frank Birbacher, May 25, 2008
    #13
  14. Brad

    Jerry Coffin Guest

    In article <3329c5e1-0b62-46da-92e7-714fa83b5b69
    @m44g2000hsc.googlegroups.com>, says...

    [ ... ]

    > In-place, single pass through string, no unnecessary copies or moves:
    >
    > void remove_chars (const string &bad, string &str) {
    > string::iterator s, d;
    > for (s = str.begin(), d = s; s != str.end(); ++ s)
    > if (bad.find(*s) == string::npos)
    > *(d ++) = *s;
    > str.resize(d - str.begin());
    > }


    While this is strictly linear in terms of the manipulation of the target
    string, it's still O(N*N) with respect to searches of the string holding
    characters to remove.

    If your list of characters to remove is very long, you might want to use
    a bitmap indexed by the characters to determine whether a particular
    character is allowed or not:

    bool bitmap[UCHAR_MAX] = { false };

    while (char *pos=bad; *pos; ++pos)
    bitmap[*pos] = true;

    Then instead of:

    if (bad.find(*s) == string::npos)

    you'd use something like:

    if (!bitmap[*s])

    giving constant complexity for each search. If you have a large
    character set, that could waste a lot of storage though. If that's a
    problem, you could sort the "bad" string, and use a binary search
    instead. For a list of only 5 characters (or so) that would be a waste
    of time and effort -- but if (for example) its a few hundred characters
    long, the time saved could be considerable.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, May 25, 2008
    #14
  15. Brad

    Guest

    On May 25, 6:33 am, ""
    <> wrote:
    > On May 25, 12:12 am, Ian Collins <> wrote:
    >
    > > Isn't that just about (excluding the reserve) identical to the solution
    > > I posted?

    >
    > Yes it is identical. I was responding to the big blocks of code I saw
    > being posted. The OP wandered down the "while (character found) remove
    > character" path instead of using your suggestion right off the bat.
    > I'm sorry, I should have mentioned it.
    >
    > Jason


    wrote:
    > On May 25, 12:12 am, Ian Collins <> wrote:
    >> Isn't that just about (excluding the reserve) identical to the solution
    >> I posted?

    >
    > Yes it is identical. I was responding to the big blocks of code I saw
    > being posted. The OP wandered down the "while (character found) remove
    > character" path instead of using your suggestion right off the bat.
    > I'm sorry, I should have mentioned it.
    >
    > Jason


    I wanted to thank everyone for this thread. Very good information. Are
    there any C++ books that go into these sorts of considerations...
    specifically performance when considering how to best optimize code or
    does it just come with experience?

    Thanks again,

    Brad
    , May 25, 2008
    #15
  16. Brad

    Guest

    Now we're going in circles :)

    Ian Collins wrote:
    > Probably a character by character copy, skipping the characters you
    > want to remove.


    Frank Birbacher wrote:
    >...
    > sort(tmp.begin(), tmp.end());
    > tmp.erase(
    > unique(tmp.begin(), tmp.end()),
    > tmp.end()
    > );
    >...
    > return binary_search(


    Jason wrote:
    > Frank's sort + binary search is
    > likely to have more overhead then a simple linear search for your
    > original requirements of removing a set of 3 or 4 bad characters only
    > (however, for removing large character sets, a binary search will
    > perform better, the sort is unnecessary if the input is sorted to
    > begin with -- but you can do the search in -constant- time, with no
    > pre-sorting either, if you make some assumptions about the max value
    > of a char and load valid characters into a lookup table first)

    (well; no assumptions, there's UCHAR_MAX, oops)

    Jerry Coffin wrote:
    > If your list of characters to remove is very long, you might want to use
    > a bitmap indexed by the characters to determine whether a particular
    > character is allowed or not:
    >...
    > If that's a
    > problem, you could sort the "bad" string, and use a binary search
    > instead. For a list of only 5 characters (or so) that would be a waste
    > of time and effort -- but if (for example) its a few hundred characters
    > long, the time saved could be considerable.
    , May 25, 2008
    #16
  17. Frank Birbacher, May 25, 2008
    #17
  18. Brad

    William Xu Guest

    "Daniel T." <> writes:

    > struct contains_any_of : unary_function< char, bool >
    > {
    > set<char> str;
    > contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
    > bool operator()( char c ) const {
    > return str.count( c ) == 1;


    Maybe better `return str.find(c) != str.end();' ? That seems more
    natural
    to me.

    > }
    > };


    --
    William

    http://williamxu.net9.org

    I hate small towns because once you've seen the cannon in the park
    there's nothing else to do.
    -- Lenny Bruce
    William Xu, May 26, 2008
    #18
  19. Brad

    Kai-Uwe Bux Guest

    Daniel T. wrote:

    > Jerry Coffin <> wrote:
    >
    >> says...
    >>
    >> [ ... ]
    >>
    >> > In-place, single pass through string, no unnecessary copies or moves:
    >> >
    >> > void remove_chars (const string &bad, string &str) {
    >> > string::iterator s, d;
    >> > for (s = str.begin(), d = s; s != str.end(); ++ s)
    >> > if (bad.find(*s) == string::npos)
    >> > *(d ++) = *s;
    >> > str.resize(d - str.begin());
    >> > }

    >>
    >> While this is strictly linear in terms of the manipulation of the target
    >> string, it's still O(N*N) with respect to searches of the string holding
    >> characters to remove.
    >>
    >> If your list of characters to remove is very long, you might want to use
    >> a bitmap indexed by the characters to determine whether a particular
    >> character is allowed or not:
    >>
    >> bool bitmap[UCHAR_MAX] = { false };
    >>
    >> while (char *pos=bad; *pos; ++pos)
    >> bitmap[*pos] = true;
    >>
    >> Then instead of:
    >>
    >> if (bad.find(*s) == string::npos)
    >>
    >> you'd use something like:
    >>
    >> if (!bitmap[*s])
    >>
    >> giving constant complexity for each search. If you have a large
    >> character set, that could waste a lot of storage though. If that's a
    >> problem, you could sort the "bad" string, and use a binary search
    >> instead. For a list of only 5 characters (or so) that would be a waste
    >> of time and effort -- but if (for example) its a few hundred characters
    >> long, the time saved could be considerable.

    >
    > Good comment! I didn't think of that angle. The nice thing about using
    > the remove_copy_if algorithm and a functor (the solution I presented
    > yesterday) is that the time and effort to convert to a binary search is
    > minimal...
    >
    > struct contains_any_of : unary_function< char, bool >
    > {
    > string str;
    > contains_any_of( const string& s ): str( s ) { }
    > bool operator()( char c ) const {
    > return find( str.begin(), str.end(), c ) != str.end();
    > }
    > };
    >
    > str.erase( remove_if( str.begin(), str.end(),
    > contains_any_of( str2 ) ), str.end() );
    >
    > Changing it simply requires a small chanage in the functor:
    >
    > struct contains_any_of : unary_function< char, bool >
    > {
    > set<char> str;
    > contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
    > bool operator()( char c ) const {
    > return str.count( c ) == 1;
    > }
    > };
    >
    > You are right, much more efficient.


    Hm, this might well be a case where logarithmic is slower than linear for
    all practically relevant cases. Due to its tree-like implementation,
    std::set is not very local in memory and can cause cache misses. For a
    logarithmic solution, I would instead consider something based on
    std::vector, e.g., like so:


    struct matches_any_of : std::unary_function< char, bool > {

    std::vector<char> the_str;

    template < typename Iterator >
    void assign ( Iterator from, Iterator to ) {
    the_str.assign( from, to );
    std::sort( the_str.begin(), the_str.end() );
    the_str.erase( std::unique( the_str.begin(), the_str.end() ),
    the_str.end() );
    }

    template < typename Iterator >
    matches_any_of ( Iterator from, Iterator to ) {
    assign( from, to );
    }

    matches_any_of ( char const * str ) {
    assign( str, str+std::strlen( str ) );
    }

    matches_any_of ( std::string const & str ) {
    assign( str.begin(), str.end() );
    }

    bool operator() ( char chr ) const {
    return( std::binary_search
    ( the_str.begin(), the_str.end(), chr ) );
    }

    };


    Of course, one should measure before betting on any of the solutions to be
    the most efficient.


    Best

    Kai-Uwe Bux
    Kai-Uwe Bux, May 26, 2008
    #19
  20. Brad

    Jerry Coffin Guest

    In article <483aa261$0$25950$>,
    says...
    > Daniel T. wrote:
    >
    > > Jerry Coffin <> wrote:


    [ ... ]

    > >> giving constant complexity for each search. If you have a large
    > >> character set, that could waste a lot of storage though. If that's a
    > >> problem, you could sort the "bad" string, and use a binary search
    > >> instead. For a list of only 5 characters (or so) that would be a waste
    > >> of time and effort -- but if (for example) its a few hundred characters
    > >> long, the time saved could be considerable.


    [ ... ]

    > > You are right, much more efficient.

    >
    > Hm, this might well be a case where logarithmic is slower than linear for
    > all practically relevant cases. Due to its tree-like implementation,
    > std::set is not very local in memory and can cause cache misses. For a
    > logarithmic solution, I would instead consider something based on
    > std::vector, e.g., like so:


    Yes, this is much closer to what I suggested -- I had suggested sorting
    the string (i.e. sorting the characters in the string) but there's
    little practical difference between sorting characters in a string and
    sorting them in a vector instead. In any case, I agree that a set is
    rarely (if ever) likely to be what you really want.

    A set works well if you're going to modify the contents on a regular
    basis, as it allows logarithmic inserts and deletes. When/if the
    collection remains (nearly) static after creation, you're generally
    better off using something with contiguous storage (string, vector,
    deque,...).

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, May 26, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SklNLkgu?=

    Q: remove a certain string in string

    =?Utf-8?B?SklNLkgu?=, Feb 18, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    341
    Joshua Flanagan
    Feb 28, 2005
  2. Replies:
    0
    Views:
    644
  3. Replies:
    18
    Views:
    1,675
    Mike Wahler
    Oct 26, 2005
  4. rvino
    Replies:
    0
    Views:
    4,648
    rvino
    Aug 14, 2007
  5. Replies:
    2
    Views:
    516
    bruce barker
    Mar 25, 2008
Loading...

Share This Page