Q: most efficient way to remove duplicate spaces in a string?

Mark Watson · Jan 9, 2009

I don't usually worry too much about efficiency unless runtime
performance becomes an issue so I use a_string.split.join(' ') to
remove extra spaces because the code is short and readable.

This is certainly inefficient. What is the fastest idiom for this
operation?

Thanks,
Mark

Chris Shea · Jan 9, 2009

I don't usually worry too much about efficiency unless runtime
performance becomes an issue so I use a_string.split.join(' ') to
remove extra spaces because the code is short and readable.

This is certainly inefficient. What is the fastest idiom for this
operation?

Thanks,
Mark

Haven't benchmarked it, but this sounds like a case for regular
expressions.

a_string.gsub!(/ +/, ' ')

HTH,
Chris

pat eyler · Jan 9, 2009

Haven't benchmarked it,

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
"This is a test string.".gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
"This is a test string.".gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
"This is a test string.".split.join(' ')
end
}
end

user system total real
gsub 6.310000 0.080000 6.390000 ( 7.135028)
gsub! 6.300000 0.070000 6.370000 ( 7.064374)
split.join 3.940000 0.070000 4.010000 ( 4.160529)

or, with bmbm

pate@pate:~/scratch$ ./stringcleanup.rb
Rehearsal ----------------------------------------------
gsub 6.290000 0.170000 6.460000 ( 7.956565)
gsub! 6.370000 0.170000 6.540000 ( 8.128309)
split.join 3.990000 0.080000 4.070000 ( 4.832332)
------------------------------------ total: 17.070000sec

user system total real
gsub 6.220000 0.130000 6.350000 ( 7.613061)
gsub! 6.390000 0.080000 6.470000 ( 8.284855)
split.join 3.970000 0.120000 4.090000 ( 5.684406)

Please, benchmark first. It's not that hard:

http://on-ruby.blogspot.com/2008/12/benchmarking-makes-it-better.html

Patrick Doyle · Jan 9, 2009

[Note: parts of this message were removed to make it a legal post.]

I don't usually worry too much about efficiency unless runtime
performance becomes an issue so I use a_string.split.join(' ') to
remove extra spaces because the code is short and readable.

This is certainly inefficient. What is the fastest idiom for this
operation?

I was looking at the docs the other day and stumbled across String#squeeze.
You might try benchmarking that.

--wpd

Robert Klemme · Jan 9, 2009

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
"This is a test string.".gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
"This is a test string.".gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
"This is a test string.".split.join(' ')
end
}
end

Note that these behave differently. A more appropriate comparison would
be to use /\s+/ as regular expression for gsub!.

Interesting figures btw. I would have guessed that gsub! is fastest -
live and learn.

Kind regards

robert

Craig Demyanovich · Jan 9, 2009

[Note: parts of this message were removed to make it a legal post.]

I was just doing that.

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
"This is a test string.".gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
"This is a test string.".gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
"This is a test string.".split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
"This is a test string.".squeeze(' ').strip
end
}
end

user system total real
gsub 4.470000 0.040000 4.510000 ( 4.578173)
gsub! 4.390000 0.020000 4.410000 ( 4.468695)
split.join 3.500000 0.020000 3.520000 ( 3.556669)
squeeze.strip 1.980000 0.010000 1.990000 ( 2.003622)

Regards,
Craig

Tim Greer · Jan 9, 2009

Craig said:
[Note: parts of this message were removed to make it a legal post.]

I was just doing that.

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
"This is a test string.".gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
"This is a test string.".gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
"This is a test string.".split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
"This is a test string.".squeeze(' ').strip
end
}
end

user system total real
gsub 4.470000 0.040000 4.510000 ( 4.578173)
gsub! 4.390000 0.020000 4.410000 ( 4.468695)
split.join 3.500000 0.020000 3.520000 ( 3.556669)
squeeze.strip 1.980000 0.010000 1.990000 ( 2.003622)

Regards,
Craig

The ruby version plays a big role in squeeze.strip, as it's much slower
than split.join with an older version on one of my systems.

Also, try the following variations and see the slight speed increase for
gsub/gsub! (small increase, but interesting to note):

I.e.:

n = 1_000_000

string = "This is a test string."

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
string.gsub(/ +/, ' ')
end
}
x.report("gsub #2") { n.times do
string.gsub(/ {2,}/, ' ')
end
}
x.report("gsub #3") { n.times do
string.gsub(/ +/, ' ')
end
}

old ruby 1.8.1 on one system:

user system total real
gsub 9.090000 0.000000 9.090000 ( 9.111154)
gsub #2 8.600000 0.000000 8.600000 ( 8.643407)
gsub #3 7.560000 0.000000 7.560000 ( 7.579185)
gsub! 8.060000 0.000000 8.060000 ( 8.061861)
gsub! #2 8.110000 0.000000 8.110000 ( 8.115992)
split.join 4.730000 0.000000 4.730000 ( 4.733248)
squeeze.strip 6.140000 0.000000 6.140000 ( 6.132690)

Ruby 1.8.7 on another system:

user system total real
gsub 3.660000 0.000000 3.660000 ( 3.663209)
gsub #2 3.450000 0.000000 3.450000 ( 3.455171)
gsub #3 3.070000 0.000000 3.070000 ( 3.065939)
gsub! 3.500000 0.000000 3.500000 ( 3.517176)
gsub! #2 3.510000 0.000000 3.510000 ( 3.506881)
split.join 2.550000 0.000000 2.550000 ( 2.560460)
squeeze.strip 1.580000 0.000000 1.580000 ( 1.588717)

I'll see if I can get a chance to upgrade ruby on the older system to
see the results.

Mark Watson · Jan 9, 2009

I don't usually worry too much about efficiency unless runtime
performance becomes an issue so I use a_string.split.join(' ') to
remove extra spaces because the code is short and readable.

This is certainly inefficient. What is the fastest idiom for this
operation?

Thanks,
Mark

Thanks to everyone who responded!

Best regards,
Mark

Clifford Heath · Jan 9, 2009

Robert said:
Note that these behave differently...
Interesting figures btw. I would have guessed that gsub! is fastest -
live and learn

It still might be - the benchmark doesn't run long enough to
compare the GC overhead of making dozens of little strings
that get used once each.

pat eyler · Jan 10, 2009

It still might be - the benchmark doesn't run long enough to
compare the GC overhead of making dozens of little strings
that get used once each.

Is this better?

#!/usr/bin/ruby

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
a_string = "This is a test string."
a_string.gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
a_string = "This is a test string."
a_string.split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
a_string = "This is a test string."
a_string.squeeze(' ').strip
end
}
end

$ ./stringcleanup.rb
user system total real
gsub 7.170000 0.140000 7.310000 ( 11.034766)
gsub! 6.900000 0.150000 7.050000 ( 14.113575)
split.join 4.450000 0.110000 4.560000 ( 7.518476)
squeeze.strip 3.180000 0.100000 3.280000 ( 4.943830)

Tim Greer · Jan 10, 2009

pat said:
Is this better?

#!/usr/bin/ruby

require 'benchmark'

n = 1_000_000

Benchmark.bm(10) do |x|
x.report("gsub") { n.times do
a_string = "This is a test string."
a_string.gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ +/, ' ')
end
}
x.report("split.join") { n.times do
a_string = "This is a test string."
a_string.split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
a_string = "This is a test string."
a_string.squeeze(' ').strip
end
}
end

$ ./stringcleanup.rb
user system total real
gsub 7.170000 0.140000 7.310000 ( 11.034766)
gsub! 6.900000 0.150000 7.050000 ( 14.113575)
split.join 4.450000 0.110000 4.560000 ( 7.518476)
squeeze.strip 3.180000 0.100000 3.280000 ( 4.943830)

For gsub/gsub!, instead of replacing one or more white space with a
white space, speed it up by replacing two or more white space with a
white space. This saves unneeded processing by not replacing single
white space. I.e., instead of gsub(/ +/, ' '), try gsub(/ +/, ' ') or
gsub(/ {2,}/, ' ') and benchmark them (they should be faster). Of
course, it's still not going to be as fast as split.join or
squeeze.strip (at least depending on the version of ruby used, as older
ruby versions may put squeeze.strip markedly slower than split.join.

pat eyler · Jan 10, 2009

For gsub/gsub!, instead of replacing one or more white space with a
white space, speed it up by replacing two or more white space with a
white space. This saves unneeded processing by not replacing single
white space. I.e., instead of gsub(/ +/, ' '), try gsub(/ +/, ' ') or
gsub(/ {2,}/, ' ') and benchmark them (they should be faster). Of
course, it's still not going to be as fast as split.join or
squeeze.strip (at least depending on the version of ruby used, as older
ruby versions may put squeeze.strip markedly slower than split.join.

#!/usr/bin/ruby

require 'benchmark'

n = 1_000_000

Benchmark.bm(15) do |x|
x.report("gsub") { n.times do
a_string = "This is a test string."
a_string.gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ +/, ' ')
end
}
x.report("gsub2") { n.times do
a_string = "This is a test string."
a_string.gsub(/ {2,}/, ' ')
end
}
x.report("gsub!2") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ {2,}/, ' ')
end
}
x.report("gsub s") { n.times do
a_string = "This is a test string."
a_string.gsub(/\s+/, ' ')
end
}
x.report("gsub! s") { n.times do
a_string = "This is a test string."
a_string.gsub!(/\s+/, ' ')
end
}
x.report("gsub2 s") { n.times do
a_string = "This is a test string."
a_string.gsub(/\s{2,}/, ' ')
end
}
x.report("gsub!2 s") { n.times do
a_string = "This is a test string."
a_string.gsub!(/\s{2,}/, ' ')
end
}
x.report("split.join") { n.times do
a_string = "This is a test string."
a_string.split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
a_string = "This is a test string."
a_string.squeeze(' ').strip
end
}
end

$ ./stringcleanup.rb
user system total real
gsub 7.250000 0.120000 7.370000 ( 8.386002)
gsub! 7.240000 0.130000 7.370000 ( 9.007479)
gsub2 7.110000 0.140000 7.250000 ( 9.302915)
gsub!2 6.830000 0.150000 6.980000 ( 9.309362)
gsub s 7.410000 0.140000 7.550000 ( 10.864572)
gsub! s 7.400000 0.130000 7.530000 ( 10.286886)
gsub2 s 7.020000 0.100000 7.120000 ( 8.977424)
gsub!2 s 6.860000 0.100000 6.960000 ( 8.421220)
split.join 4.420000 0.110000 4.530000 ( 5.716684)
squeeze.strip 3.120000 0.110000 3.230000 ( 3.651918)

I haven't run it enough to do any statistical analysis, but after
5 runs it seems like the various flavors of gsub/gsub! don't really
perform any differently. splitting and squeezing look like much
better options.

Tim Greer · Jan 10, 2009

pat said:
#!/usr/bin/ruby

require 'benchmark'

n = 1_000_000

Benchmark.bm(15) do |x|
x.report("gsub") { n.times do
a_string = "This is a test string."
a_string.gsub(/ +/, ' ')
end
}
x.report("gsub!") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ +/, ' ')
end
}
x.report("gsub2") { n.times do
a_string = "This is a test string."
a_string.gsub(/ {2,}/, ' ')
end
}
x.report("gsub!2") { n.times do
a_string = "This is a test string."
a_string.gsub!(/ {2,}/, ' ')
end
}
x.report("gsub s") { n.times do
a_string = "This is a test string."
a_string.gsub(/\s+/, ' ')
end
}
x.report("gsub! s") { n.times do
a_string = "This is a test string."
a_string.gsub!(/\s+/, ' ')
end
}
x.report("gsub2 s") { n.times do
a_string = "This is a test string."
a_string.gsub(/\s{2,}/, ' ')
end
}
x.report("gsub!2 s") { n.times do
a_string = "This is a test string."
a_string.gsub!(/\s{2,}/, ' ')
end
}
x.report("split.join") { n.times do
a_string = "This is a test string."
a_string.split.join(' ')
end
}
x.report("squeeze.strip") { n.times do
a_string = "This is a test string."
a_string.squeeze(' ').strip
end
}
end

$ ./stringcleanup.rb
user system total real
gsub 7.250000 0.120000 7.370000 ( 8.386002)
gsub! 7.240000 0.130000 7.370000 ( 9.007479)
gsub2 7.110000 0.140000 7.250000 ( 9.302915)
gsub!2 6.830000 0.150000 6.980000 ( 9.309362)
gsub s 7.410000 0.140000 7.550000 ( 10.864572)
gsub! s 7.400000 0.130000 7.530000 ( 10.286886)
gsub2 s 7.020000 0.100000 7.120000 ( 8.977424)
gsub!2 s 6.860000 0.100000 6.960000 ( 8.421220)
split.join 4.420000 0.110000 4.530000 ( 5.716684)
squeeze.strip 3.120000 0.110000 3.230000 ( 3.651918)

I haven't run it enough to do any statistical analysis, but after
5 runs it seems like the various flavors of gsub/gsub! don't really
perform any differently. splitting and squeezing look like much
better options.

Yeah, it doesn't make a huge difference, but it's just a little faster
is all. Definitely split.join is faster. squeeze.strip is the fastest
(though that will actually take longer on older versions of ruby). My
follow up was really just a reference for gsub/gsub!, not that it would
be faster than the other two.

Clifford Heath · Jan 10, 2009

pat said:
Is this better?

No. Before I elaborate, I'm not saying I don't believe the result.
I'm saying your benchmark figure won't accurately represent the
cost in a long-running application.

All versions create an initial million strings - perhaps 40 MB.
Your computer has what, 2GB of memory? At what point does the GC run?

The gsub version creates a million extra strings.
The gsub! version creates, perhaps no extra strings, perhaps a million.
The split version creates *six* million extra strings (one per word and one from join)
The squeeze version creates two million (from squeeze, and from strip).

Now depending on whether the string in your real-life application
is an HTML document with a thousand white-space runs, how many
extra strings do the respective versions take? split makes a *billion*.

A benchmark environment must consider that the code being tested
will run in the context of an application where there are many
other objects created by all the *other* code - perhaps a thousand
times as many objects. At some point, the garbage collector is
likely to run. That takes time, and the time should be part of the
benchmark.

Try squeezing said HTML document a million times, and run the GC
inside each benchmark timer (after the n.times loop). Then I'll
be happy ;-).

Clifford Heath.

[ANN] SAMSUNG to produce "Ruby on Rails in Silicon" System on a Chip	1	Aug 31, 2011
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
This is why Ruby 1.8.6 can never be made to run anywhere near as fast as Python 2.5.1	61	Sep 23, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
PATCH: Speed up direct string concatenation by 20+%!	34	Sep 28, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
Sencha Touch--Support 2 browsers in just 228K!	64	Jul 15, 2010
[ANN] JRuby 1.2.0 Released	1	Mar 16, 2009

Q: most efficient way to remove duplicate spaces in a string?

Mark Watson

Chris Shea

pat eyler

Patrick Doyle

Robert Klemme

Craig Demyanovich

Tim Greer

Mark Watson

Clifford Heath

pat eyler

Tim Greer

pat eyler

Tim Greer

Clifford Heath

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads