remove certain characters from a string

B

Brad

I'm writing a function to remove certain characters from strings. For
example, I often get strings with commas... they look like this:

"12,384"

I'd like to take that string, remove the comma and return "12384"

What is the most efficient, fastest way to approach this?

Thanks,

Brad
 
I

Ian Collins

Brad said:
I'm writing a function to remove certain characters from strings. For
example, I often get strings with commas... they look like this:

"12,384"

I'd like to take that string, remove the comma and return "12384"

What is the most efficient, fastest way to approach this?
Probably a character by character copy, skipping the characters you want
to remove.
 
Joined
May 24, 2008
Messages
1
Reaction score
0
what i feel is, u can create a temp array, and start searching from the existing array for numbers( ucan use isdigit()) and dump it to the temp array.
then apply atoi() to the emp array and get the result.

i dont know if it efficient or not. Do u have any other solution for it ?
 
B

Brad

Paavo said:
std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

this ok?
Paavo

Thanks, works well with one character... How would I make it work with
several... like so:

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;
while ((n=the_string.find(bad, n))!=the_string.npos)
{
the_string.erase(n,1);
}
std::cout << the_string << std::endl;
return the_string;
}

int main()
{
clean("12,34.45|78 9");
return 0;
}
 
F

Frank Birbacher

Brad said:
Thanks, works well with one character... How would I make it work with
several... like so:

struct CheckForBad
{
const string chars;
CheckForBad(string const& chars)
: chars(sortUniq(chars))
{}
static string sortUniq(string tmp)
{
sort(tmp.begin(), tmp.end());
tmp.erase(
unique(tmp.begin(), tmp.end()),
tmp.end()
);
return tmp;
}
bool operator() (char const c) const
{
return binary_search(
chars.begin(), chars.end(),
c
);
}
};

string removeStuff(string text, string const& stuff)
{
text.erase(
remove_if(
text.begin(), text.end(),
CheckForBad(stuff)
),
text.end()
);
return text;
}

//or:
string removeStd(string text)
{
static const CheckForBad check(".,| ");
text.erase(
remove_if(
text.begin(), text.end(),
check
),
text.end()
);
return text;
}

not tested

Regards,
Frank
 
B

byte8bits

std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);

}

this ok?
Paavo

This is ugly... based on what Pavvo posted. It works. What do you guys
think? Bad?

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;

while ((n=the_string.find(bad[0], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[1], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[2], n))!=the_string.npos)
{
the_string.erase(n,1);
}

n = 0;
while ((n=the_string.find(bad[3], n))!=the_string.npos)
{
the_string.erase(n,1);
}

std::cout << the_string << std::endl;
return the_string;
}

int main()
{
clean("12,,34..56||78 9");
return 0;
}
 
I

Ian Collins

Brad said:
Paavo said:
std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

this ok?
Paavo

Thanks, works well with one character... How would I make it work with
several... like so:

#include <iostream>

std::string clean(std::string the_string)
{
char bad[] = {'.', ',', ' ', '|', '\0'};
std::string::size_type n = 0;
while ((n=the_string.find(bad, n))!=the_string.npos)
{
the_string.erase(n,1);
}
std::cout << the_string << std::endl;
return the_string;
}

If you want a simple loop:

std::string clean( const std::string& the_string )
{
static const std::string bad("., |");

std::string result;

for( std::string::const_iterator c = the_string.begin();
c != the_string.end(); ++c )
{
if( bad.find( *c ) == std::string::npos )
{
result += *c;
}
}

return result;
}
 
F

Frank Birbacher

Hi!

Looking at the various solutions to the original problem, I wanted to
state my design goals so one could make a resonable decision about which
code to use.

I try to keep the algorithmic complexity low. I try to reuse code, that
is I use the STL and therefore I stick to its idioms.

Regards,
Frank
 
J

jason.cipriani

Hi!

Looking at the various solutions to the original problem, I wanted to
state my design goals so one could make a resonable decision about which
code to use.

I try to keep the algorithmic complexity low. I try to reuse code, that
is I use the STL and therefore I stick to its idioms.

Hmm... I see a lot of really complex and strange code here when it's
not really necessary. Most of what people posted requires multiple
passes through the string, or a lot of shifting of bytes around (e.g.
something like Paavo's "while (string contains char) remove_char" is
going to do -way- more moving of data than necessary -- it shifts the
entire end of the string back every time through the loop). Sticking
to generic STL calls for finding and removing characters in the string
gains you nothing unless you are going to be finding and removing
elements from generic containers that don't provide random access
iterators (in which case the generic programming is a benefit). The
use of remove_if, such as in Frank's example, will get you equal
performance to the example below (remove_if may very well be
implemented the same way), except Frank's sort + binary search is
likely to have more overhead then a simple linear search for your
original requirements of removing a set of 3 or 4 bad characters only
(however, for removing large character sets, a binary search will
perform better, the sort is unnecessary if the input is sorted to
begin with -- but you can do the search in -constant- time, with no
pre-sorting either, if you make some assumptions about the max value
of a char and load valid characters into a lookup table first). You
know that you are using a string (or any type with random access
iterators). Just do something like this:

In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

That works because 'd' will always be behind or at the same position
as 's'. That in-place version can be made to work with generic
iterators as well as random access iterators if you replace the
resize() call with "erase(d, str.end())". Here is the same thing,
places result in destination buffer:

void remove_chars (const string &bad, const string &str, string
&clean) {
string::const_iterator s;
clean = "";
clean.reserve(str.size()); // do not perform extra realloc + copies.
for (s = str.begin(); s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
clean += *s;
}

Example use:

{
string s = "a m|e|s|s|y s,t,r,i,n,g", c;
remove_chars("mesy", s, c);
remove_chars("|,", s);
cout << c << endl << s << endl;
}


Jason
 
I

Ian Collins

void remove_chars (const string &bad, const string &str, string
&clean) {
string::const_iterator s;
clean = "";
clean.reserve(str.size()); // do not perform extra realloc + copies.
for (s = str.begin(); s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
clean += *s;
}
Isn't that just about (excluding the reserve) identical to the solution
I posted?
 
J

Juha Nieminen

Paavo said:
std::string s("12,384,586");
std::string::size_type k = 0;
while((k=s.find(',',k))!=s.npos) {
s.erase(k, 1);
}

That's certainly not the most efficient way to do it.
 
J

jason.cipriani

Isn't that just about (excluding the reserve) identical to the solution
I posted?

Yes it is identical. I was responding to the big blocks of code I saw
being posted. The OP wandered down the "while (character found) remove
character" path instead of using your suggestion right off the bat.
I'm sorry, I should have mentioned it.

Jason
 
F

Frank Birbacher

Hi!
Jason's solution is simply to write the remove_if algorithm out by hand.

Which doesn't necessarily make the code more readable. But readability
contraditcs writing code fast/short. And writing code fast/short doesn't
mean the code will work. <ironic>But anyways, who actually wants code
that works?</ironic>

Well, remove_if is tested, the while-loop solution is not. I recently
was terrified by similar char-shifting code. I dislike it.

Frank
 
J

Jerry Coffin

[ ... ]
In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.

If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:

bool bitmap[UCHAR_MAX] = { false };

while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;

Then instead of:

if (bad.find(*s) == string::npos)

you'd use something like:

if (!bitmap[*s])

giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.
 
B

byte8bits

Yes it is identical. I was responding to the big blocks of code I saw
being posted. The OP wandered down the "while (character found) remove
character" path instead of using your suggestion right off the bat.
I'm sorry, I should have mentioned it.

Jason

Yes it is identical. I was responding to the big blocks of code I saw
being posted. The OP wandered down the "while (character found) remove
character" path instead of using your suggestion right off the bat.
I'm sorry, I should have mentioned it.

Jason

I wanted to thank everyone for this thread. Very good information. Are
there any C++ books that go into these sorts of considerations...
specifically performance when considering how to best optimize code or
does it just come with experience?

Thanks again,

Brad
 
J

jason.cipriani

Now we're going in circles :)

Ian said:
Probably a character by character copy, skipping the characters you
want to remove.

Frank said:
...
sort(tmp.begin(), tmp.end());
tmp.erase(
unique(tmp.begin(), tmp.end()),
tmp.end()
);
...
return binary_search(
Frank's sort + binary search is
likely to have more overhead then a simple linear search for your
original requirements of removing a set of 3 or 4 bad characters only
(however, for removing large character sets, a binary search will
perform better, the sort is unnecessary if the input is sorted to
begin with -- but you can do the search in -constant- time, with no
pre-sorting either, if you make some assumptions about the max value
of a char and load valid characters into a lookup table first)
(well; no assumptions, there's UCHAR_MAX, oops)
 
F

Frank Birbacher

W

William Xu

Daniel T. said:
struct contains_any_of : unary_function< char, bool >
{
set<char> str;
contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
bool operator()( char c ) const {
return str.count( c ) == 1;

Maybe better `return str.find(c) != str.end();' ? That seems more
natural
to me.

--
William

http://williamxu.net9.org

I hate small towns because once you've seen the cannon in the park
there's nothing else to do.
-- Lenny Bruce
 
K

Kai-Uwe Bux

Daniel said:
Jerry Coffin said:
(e-mail address removed) says...

[ ... ]
In-place, single pass through string, no unnecessary copies or moves:

void remove_chars (const string &bad, string &str) {
string::iterator s, d;
for (s = str.begin(), d = s; s != str.end(); ++ s)
if (bad.find(*s) == string::npos)
*(d ++) = *s;
str.resize(d - str.begin());
}

While this is strictly linear in terms of the manipulation of the target
string, it's still O(N*N) with respect to searches of the string holding
characters to remove.

If your list of characters to remove is very long, you might want to use
a bitmap indexed by the characters to determine whether a particular
character is allowed or not:

bool bitmap[UCHAR_MAX] = { false };

while (char *pos=bad; *pos; ++pos)
bitmap[*pos] = true;

Then instead of:

if (bad.find(*s) == string::npos)

you'd use something like:

if (!bitmap[*s])

giving constant complexity for each search. If you have a large
character set, that could waste a lot of storage though. If that's a
problem, you could sort the "bad" string, and use a binary search
instead. For a list of only 5 characters (or so) that would be a waste
of time and effort -- but if (for example) its a few hundred characters
long, the time saved could be considerable.

Good comment! I didn't think of that angle. The nice thing about using
the remove_copy_if algorithm and a functor (the solution I presented
yesterday) is that the time and effort to convert to a binary search is
minimal...

struct contains_any_of : unary_function< char, bool >
{
string str;
contains_any_of( const string& s ): str( s ) { }
bool operator()( char c ) const {
return find( str.begin(), str.end(), c ) != str.end();
}
};

str.erase( remove_if( str.begin(), str.end(),
contains_any_of( str2 ) ), str.end() );

Changing it simply requires a small chanage in the functor:

struct contains_any_of : unary_function< char, bool >
{
set<char> str;
contains_any_of( const string& s ): str( s.begin(), s.end() ) { }
bool operator()( char c ) const {
return str.count( c ) == 1;
}
};

You are right, much more efficient.

Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:


struct matches_any_of : std::unary_function< char, bool > {

std::vector<char> the_str;

template < typename Iterator >
void assign ( Iterator from, Iterator to ) {
the_str.assign( from, to );
std::sort( the_str.begin(), the_str.end() );
the_str.erase( std::unique( the_str.begin(), the_str.end() ),
the_str.end() );
}

template < typename Iterator >
matches_any_of ( Iterator from, Iterator to ) {
assign( from, to );
}

matches_any_of ( char const * str ) {
assign( str, str+std::strlen( str ) );
}

matches_any_of ( std::string const & str ) {
assign( str.begin(), str.end() );
}

bool operator() ( char chr ) const {
return( std::binary_search
( the_str.begin(), the_str.end(), chr ) );
}

};


Of course, one should measure before betting on any of the solutions to be
the most efficient.


Best

Kai-Uwe Bux
 
J

Jerry Coffin

[ ... ]

[ ... ]
Hm, this might well be a case where logarithmic is slower than linear for
all practically relevant cases. Due to its tree-like implementation,
std::set is not very local in memory and can cause cache misses. For a
logarithmic solution, I would instead consider something based on
std::vector, e.g., like so:

Yes, this is much closer to what I suggested -- I had suggested sorting
the string (i.e. sorting the characters in the string) but there's
little practical difference between sorting characters in a string and
sorting them in a vector instead. In any case, I agree that a set is
rarely (if ever) likely to be what you really want.

A set works well if you're going to modify the contents on a regular
basis, as it allows logarithmic inserts and deletes. When/if the
collection remains (nearly) static after creation, you're generally
better off using something with contiguous storage (string, vector,
deque,...).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,591
Members
45,102
Latest member
GregoryGri
Top