remove certain characters from a string

Brad · May 24, 2008

I'm writing a function to remove certain characters from strings. For
example, I often get strings with commas... they look like this:

"12,384"

I'd like to take that string, remove the comma and return "12384"

What is the most efficient, fastest way to approach this?

Thanks,

Brad

Ian Collins · May 24, 2008

Brad said:
I'm writing a function to remove certain characters from strings. For
example, I often get strings with commas... they look like this:

"12,384"

I'd like to take that string, remove the comma and return "12384"

What is the most efficient, fastest way to approach this?

Probably a character by character copy, skipping the characters you want
to remove.

tarakant_sethy · May 24, 2008

what i feel is, u can create a temp array, and start searching from the existing array for numbers( ucan use isdigit()) and dump it to the temp array.
then apply atoi() to the emp array and get the result.

i dont know if it efficient or not. Do u have any other solution for it ?

Brad · May 24, 2008

Paavo said:
std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

this ok?
Paavo

Thanks, works well with one character... How would I make it work with
several... like so:

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;
while ((n=the_string.find(bad, n))!=the_string.npos)
{
the_string.erase(n,1);
}
std::cout << the_string << std::endl;
return the_string;
}

int main()
{
clean("12,34.45|78 9");
return 0;
}

Frank Birbacher · May 24, 2008

Brad said:
Thanks, works well with one character... How would I make it work with
several... like so:

struct CheckForBad
{
const string chars;
CheckForBad(string const& chars)
: chars(sortUniq(chars))
{}
static string sortUniq(string tmp)
{
sort(tmp.begin(), tmp.end());
tmp.erase(
unique(tmp.begin(), tmp.end()),
tmp.end()
);
return tmp;
}
bool operator() (char const c) const
{
return binary_search(
chars.begin(), chars.end(),
c
);
}
};

string removeStuff(string text, string const& stuff)
{
text.erase(
remove_if(
text.begin(), text.end(),
CheckForBad(stuff)
),
text.end()
);
return text;
}

//or:
string removeStd(string text)
{
static const CheckForBad check(".,| ");
text.erase(
remove_if(
text.begin(), text.end(),
check
),
text.end()
);
return text;
}

not tested

Regards,
Frank

byte8bits · May 24, 2008

std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);

}

this ok?
Paavo

This is ugly... based on what Pavvo posted. It works. What do you guys
think? Bad?

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;

while ((n=the_string.find(bad[0], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[1], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[2], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[3], n))!=the_string.npos)
{
the_string.erase(n,1);
}

std::cout << the_string << std::endl;
return the_string;
}

int main()
{
clean("12,,34..56||78 9");
return 0;
}

Ian Collins · May 24, 2008

Brad said:
Paavo said:

std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

this ok?
Paavo

Click to expand...

Thanks, works well with one character... How would I make it work with
several... like so:

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;
while ((n=the_string.find(bad, n))!=the_string.npos)
{
the_string.erase(n,1);
}
std::cout << the_string << std::endl;
return the_string;
}

If you want a simple loop:

std::string clean( const std::string& the_string )
{
static const std::string bad("., |");

std::string result;

for( std::string::const_iterator c = the_string.begin();
c != the_string.end(); ++c )
{
if( bad.find( *c ) == std::string::npos )
{
result += *c;
}
}

return result;
}

Frank Birbacher · May 25, 2008

Hi!

Looking at the various solutions to the original problem, I wanted to
state my design goals so one could make a resonable decision about which
code to use.

I try to keep the algorithmic complexity low. I try to reuse code, that
is I use the STL and therefore I stick to its idioms.

Regards,
Frank

jason.cipriani · May 25, 2008

Hi!

Looking at the various solutions to the original problem, I wanted to
state my design goals so one could make a resonable decision about which
code to use.

I try to keep the algorithmic complexity low. I try to reuse code, that
is I use the STL and therefore I stick to its idioms.

Hmm... I see a lot of really complex and strange code here when it's
not really necessary. Most of what people posted requires multiple
passes through the string, or a lot of shifting of bytes around (e.g.
something like Paavo's "while (string contains char) remove_char" is
going to do -way- more moving of data than necessary -- it shifts the
entire end of the string back every time through the loop). Sticking
to generic STL calls for finding and removing characters in the string
gains you nothing unless you are going to be finding and removing
elements from generic containers that don't provide random access
iterators (in which case the generic programming is a benefit). The
use of remove_if, such as in Frank's example, will get you equal
performance to the example below (remove_if may very well be
implemented the same way), except Frank's sort + binary search is
likely to have more overhead then a simple linear search for your
original requirements of removing a set of 3 or 4 bad characters only
(however, for removing large character sets, a binary search will
perform better, the sort is unnecessary if the input is sorted to
begin with -- but you can do the search in -constant- time, with no
pre-sorting either, if you make some assumptions about the max value
of a char and load valid characters into a lookup table first). You
know that you are using a string (or any type with random access
iterators). Just do something like this:

In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

That works because 'd' will always be behind or at the same position
as 's'. That in-place version can be made to work with generic
iterators as well as random access iterators if you replace the
resize() call with "erase(d, str.end())". Here is the same thing,
places result in destination buffer:

void remove_chars (const string &bad, const string &str, string
&clean) {
string::const_iterator s;
clean = "";
clean.reserve(str.size()); // do not perform extra realloc + copies.
for (s = str.begin(); s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
clean += *s;
}

Example use:

{
string s = "a m|e|s|s|y s,t,r,i,n,g", c;
remove_chars("mesy", s, c);
remove_chars("|,", s);
cout << c << endl << s << endl;
}

Jason

Ian Collins · May 25, 2008

void remove_chars (const string &bad, const string &str, string
&clean) {
string::const_iterator s;
clean = "";
clean.reserve(str.size()); // do not perform extra realloc + copies.
for (s = str.begin(); s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
clean += *s;
}

Isn't that just about (excluding the reserve) identical to the solution
I posted?

Juha Nieminen · May 25, 2008

Paavo said:
std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

That's certainly not the most efficient way to do it.

jason.cipriani · May 25, 2008

Isn't that just about (excluding the reserve) identical to the solution
I posted?

Yes it is identical. I was responding to the big blocks of code I saw
being posted. The OP wandered down the "while (character found) remove
character" path instead of using your suggestion right off the bat.
I'm sorry, I should have mentioned it.

Jason

Frank Birbacher · May 25, 2008

Hi!

Jason's solution is simply to write the remove_if algorithm out by hand.

Which doesn't necessarily make the code more readable. But readability
contraditcs writing code fast/short. And writing code fast/short doesn't
mean the code will work. <ironic>But anyways, who actually wants code
that works?</ironic>

Well, remove_if is tested, the while-loop solution is not. I recently
was terrified by similar char-shifting code. I dislike it.

Frank

Jerry Coffin · May 25, 2008

[ ... ]

In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.

If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:

bool bitmap[UCHAR_MAX] = { false };

while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;

Then instead of:

if (bad.find(*s) == string::npos)

you'd use something like:

if (!bitmap[*s])

giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.

byte8bits · May 25, 2008

Yes it is identical. I was responding to the big blocks of code I saw
being posted. The OP wandered down the "while (character found) remove
character" path instead of using your suggestion right off the bat.
I'm sorry, I should have mentioned it.

Jason

I wanted to thank everyone for this thread. Very good information. Are
there any C++ books that go into these sorts of considerations...
specifically performance when considering how to best optimize code or
does it just come with experience?

Thanks again,

Brad

jason.cipriani · May 25, 2008

Now we're going in circles

Ian said:
Probably a character by character copy, skipping the characters you
want to remove.

Frank said:
...
sort(tmp.begin(), tmp.end());
tmp.erase(
unique(tmp.begin(), tmp.end()),
tmp.end()
);
...
return binary_search(

Frank's sort + binary search is
likely to have more overhead then a simple linear search for your
original requirements of removing a set of 3 or 4 bad characters only
(however, for removing large character sets, a binary search will
perform better, the sort is unnecessary if the input is sorted to
begin with -- but you can do the search in -constant- time, with no
pre-sorting either, if you make some assumptions about the max value
of a char and load valid characters into a lookup table first)

(well; no assumptions, there's UCHAR_MAX, oops)

Frank Birbacher · May 25, 2008

Hi!

I wanted to thank everyone for this thread. Very good information. Are
there any C++ books that go into these sorts of considerations...
specifically performance when considering how to best optimize code or
does it just come with experience?

Two rules of program optimization:
http://wordaligned.org/articles/the-third-rule-of-program-optimisation

And:
http://en.wikipedia.org/wiki/Optimization_(computer_science)#When_to_optimize

Frank

William Xu · May 26, 2008

Daniel T. said:
struct contains_any_of : unary_function< char, bool >
{
set<char> str;
contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
bool operator()( char c ) const {
return str.count( c ) == 1;

Maybe better `return str.find(c) != str.end();' ? That seems more
natural
to me.

}
};

--
William

http://williamxu.net9.org

I hate small towns because once you've seen the cannon in the park
there's nothing else to do.
-- Lenny Bruce

Kai-Uwe Bux · May 26, 2008

Daniel said:
Jerry Coffin said:

(e-mail address removed) says...

[ ... ]

In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

Click to expand...

While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.

If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:

bool bitmap[UCHAR_MAX] = { false };

while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;

Then instead of:

if (bad.find(*s) == string::npos)

you'd use something like:

if (!bitmap[*s])

giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.

Click to expand...

Good comment! I didn't think of that angle. The nice thing about using
the remove_copy_if algorithm and a functor (the solution I presented
yesterday) is that the time and effort to convert to a binary search is
minimal...

struct contains_any_of : unary_function< char, bool >
{
string str;
contains_any_of( const string& s ): str( s ) { }
bool operator()( char c ) const {
return find( str.begin(), str.end(), c ) != str.end();
}
};

str.erase( remove_if( str.begin(), str.end(),
contains_any_of( str2 ) ), str.end() );

Changing it simply requires a small chanage in the functor:

struct contains_any_of : unary_function< char, bool >
{
set<char> str;
contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
bool operator()( char c ) const {
return str.count( c ) == 1;
}
};

You are right, much more efficient.

Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:

struct matches_any_of : std::unary_function< char, bool > {

std::vector<char> the_str;

template < typename Iterator >
void assign ( Iterator from, Iterator to ) {
the_str.assign( from, to );
std::sort( the_str.begin(), the_str.end() );
the_str.erase( std::unique( the_str.begin(), the_str.end() ),
the_str.end() );
}

template < typename Iterator >
matches_any_of ( Iterator from, Iterator to ) {
assign( from, to );
}

matches_any_of ( char const * str ) {
assign( str, str+std::strlen( str ) );
}

matches_any_of ( std::string const & str ) {
assign( str.begin(), str.end() );
}

bool operator() ( char chr ) const {
return( std::binary_search
( the_str.begin(), the_str.end(), chr ) );
}

};

Of course, one should measure before betting on any of the solutions to be
the most efficient.

Best

Kai-Uwe Bux

Jerry Coffin · May 26, 2008

[ ... ]

[ ... ]

Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:

Yes, this is much closer to what I suggested -- I had suggested sorting
the string (i.e. sorting the characters in the string) but there's
little practical difference between sorting characters in a string and
sorting them in a vector instead. In any case, I agree that a set is
rarely (if ever) likely to be what you really want.

A set works well if you're going to modify the contents on a regular
basis, as it allows logarithmic inserts and deletes. When/if the
collection remains (nearly) static after creation, you're generally
better off using something with contiguous storage (string, vector,
deque,...).

Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
remove certain words from a c++ string	18	Oct 23, 2005
FAQ 9.4 How do I remove HTML from a string?	0	Apr 10, 2011
Measuring a string of text	1	Sep 15, 2022
Remove Data from basic::string	8	Apr 1, 2009
Remove unwanted characters from column	9	Jan 27, 2014
casting from 'const string' to a 'non-constant string'	2	Oct 19, 2012

remove certain characters from a string

Brad

Ian Collins

tarakant_sethy

Brad

Frank Birbacher

byte8bits

Ian Collins

Frank Birbacher

jason.cipriani

Ian Collins

Juha Nieminen

jason.cipriani

Frank Birbacher

Jerry Coffin

byte8bits

jason.cipriani

Frank Birbacher

William Xu

Kai-Uwe Bux

Jerry Coffin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads