Search string for occurneces of words stored in array

John Butler · Apr 30, 2008

Hi,

I have a sentence "This is my test sentence" and an array["is", "the",
"my"] and what i need to do is find the occurence of any of thearray
words in the sentence.

I have this working in a loop but i was wondering is there a way to do
it using one of rubys string methods.

Its sililar to the include method but searching for multiple words not
just one.

"This is my test sentence".include?("This") returns true

but i want something like

"This is my test sentence".include?("This", "is", "my")

anyone got a nice way to do this? I only need to find if one of the
words occure and then i exit.

JB

Phillip Gawlowski · Apr 30, 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is", "the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end

- --
Phillip Gawlowski
Twitter: twitter.com/cynicalryan
Blog: http://justarubyist.blogspot.com

~ - You know you've been hacking too long when...
...you dream that your SO and yourself are icons in a GUI and you can't
get close to each other because the window manager demands minimum space
between icons...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkgYfXsACgkQbtAgaoJTgL8swgCfW6ixKWKPo2HT8CQzGFeDcaNu
w6sAnRTk5hihGfh0hZMRBiCiOHEceZpA
=JpPT
-----END PGP SIGNATURE-----

David A. Black · Apr 30, 2008

Hi --

Hi,

I have a sentence "This is my test sentence" and an array["is", "the",
"my"] and what i need to do is find the occurence of any of thearray
words in the sentence.

I have this working in a loop but i was wondering is there a way to do
it using one of rubys string methods.

Its sililar to the include method but searching for multiple words not
just one.

"This is my test sentence".include?("This") returns true

but i want something like

"This is my test sentence".include?("This", "is", "my")

anyone got a nice way to do this? I only need to find if one of the
words occure and then i exit.

You could use any?

irb(main):001:0> words = %w{ This is my }
=> ["This", "is", "my"]
irb(main):002:0> sentence = "This is my test sentence"
=> "This is my test sentence"
irb(main):003:0> words.any? {|word| sentence.include?(word) }
=> true
irb(main):004:0> sentence = "Hi"
=> "Hi"
irb(main):005:0> words.any? {|word| sentence.include?(word) }
=> false

Another possibility:

irb(main):009:0> sentence = "This is my test sentence"
=> "This is my test sentence"
irb(main):010:0> re = Regexp.new(words.join('|'))
=> /This|is|my/
irb(main):011:0> sentence =~ re
=> 0

David

--
Rails training from David A. Black and Ruby Power and Light:
INTRO TO RAILS June 9-12 Berlin
ADVANCING WITH RAILS June 16-19 Berlin
INTRO TO RAILS June 24-27 London (Skills Matter)
See http://www.rubypal.com for details and updates!

Jens Wille · Apr 30, 2008

Phillip Gawlowski [2008-04-30 16:09]:

John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is", "the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end

i'd prefer Enumerable#any?:

sentence, words = "This is my test sentence", ["This", "is", "my"]
words.any? { |word| sentence.include?(word) }

or Regexp:

sentence =~ Regexp.union(*words)

cheers
jens

--
Jens Wille, Dipl.-Bibl. (FH)
prometheus - Das verteilte digitale Bildarchiv fÃ¼r Forschung & Lehre
Kunsthistorisches Institut der UniversitÃ¤t zu KÃ¶ln
Albertus-Magnus-Platz, D-50923 KÃ¶ln
Tel.: +49 (0)221 470-6668, E-Mail: (e-mail address removed)
http://www.prometheus-bildarchiv.de/

Jens Wille · Apr 30, 2008

ok, i withdraw my post. david's just quicker... ;-)

Jens Wille [2008-04-30 16:18]:

sentence =~ Regexp.union(*words)

one addition regarding the regexp, though. in case words may contain
special characters, it's safer to escape them first:

sentence =~ Regexp.union(*words.map { |word| Regexp.escape(word) })

cheers
jens

Ken Bloom · Apr 30, 2008

Hi,

I have a sentence "This is my test sentence" and an array["is", "the",
"my"] and what i need to do is find the occurence of any of thearray
words in the sentence.

I have this working in a loop but i was wondering is there a way to do
it using one of rubys string methods.

Its sililar to the include method but searching for multiple words not
just one.

"This is my test sentence".include?("This") returns true

but i want something like

"This is my test sentence".include?("This", "is", "my")

anyone got a nice way to do this? I only need to find if one of the
words occure and then i exit.

JB

Ruby quiz #103: the DictionaryMatcher
http://www.rubyquiz.com/quiz103.html

You may need to do "This is my test sentence".split.any?{...} if it has
to specifically be on words. Note that
"I am running home".include? "run"
returns true, as does "abc def".include? "c d"

--Ken

David A. Black · Apr 30, 2008

Hi --

ok, i withdraw my post. david's just quicker... ;-)

Yeah, but yours is cooler because you remembered Regexp.union

Jens Wille [2008-04-30 16:18]:

sentence =~ Regexp.union(*words)

Click to expand...

one addition regarding the regexp, though. in case words may contain
special characters, it's safer to escape them first:

sentence =~ Regexp.union(*words.map { |word| Regexp.escape(word) })

It actually does it for you:

Regexp.union("a",".b")
=> /a|\.b/

David

--
Rails training from David A. Black and Ruby Power and Light:
INTRO TO RAILS June 9-12 Berlin
ADVANCING WITH RAILS June 16-19 Berlin
INTRO TO RAILS June 24-27 London (Skills Matter)
See http://www.rubypal.com for details and updates!

Jens Wille · Apr 30, 2008

David A. Black [2008-04-30 16:29]:

Jens Wille [2008-04-30 16:18]:

sentence =~ Regexp.union(*words)

Click to expand...

one addition regarding the regexp, though. in case words may
contain special characters, it's safer to escape them first:

sentence =~ Regexp.union(*words.map { |word| Regexp.escape(word) })

Click to expand...

It actually does it for you:

Regexp.union("a",".b") => /a|\.b/

ha, didn't know that ;-) thank you!

Roger Pack · Apr 30, 2008

I'd write my own
class String
def includes_all? array
# stuff
end
end

Robert Klemme · Apr 30, 2008

Phillip Gawlowski [2008-04-30 16:09]:

John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is", "the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end

Click to expand...

i'd prefer Enumerable#any?:

sentence, words = "This is my test sentence", ["This", "is", "my"]
words.any? { |word| sentence.include?(word) }

I'd rather do it the other way round, i.e. iterate over the sentence and
test words since the sentence is potentially longer:

irb(main):001:0> require 'enumerator'
=> true
irb(main):002:0> require 'set'
=> true
irb(main):003:0> words = %w{This is my}.to_set
=> #<Set: {"my", "This", "is"}>
irb(main):004:0> "This is my test sentence".to_enum

scan,/\w+/).any?
{|w| words.include? w}
=> true
irb(main):005:0>

Kind regards

robert

David A. Black · Apr 30, 2008

Hi --

Phillip Gawlowski [2008-04-30 16:09]:

John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is", "the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end

Click to expand...

i'd prefer Enumerable#any?:

sentence, words = "This is my test sentence", ["This", "is", "my"]
words.any? { |word| sentence.include?(word) }

Click to expand...

I'd rather do it the other way round, i.e. iterate over the sentence and test
words since the sentence is potentially longer:

irb(main):001:0> require 'enumerator'
=> true
irb(main):002:0> require 'set'
=> true
irb(main):003:0> words = %w{This is my}.to_set
=> #<Set: {"my", "This", "is"}>
irb(main):004:0> "This is my test sentence".to_enumscan,/\w+/).any? {|w|
words.include? w}
=> true
irb(main):005:0>

Is there any reason not to just do:

"This is my test sentence".scan(/\w+/).any? {|w| words.include? w }

David

--
Rails training from David A. Black and Ruby Power and Light:
INTRO TO RAILS June 9-12 Berlin
ADVANCING WITH RAILS June 16-19 Berlin
INTRO TO RAILS June 24-27 London (Skills Matter)
See http://www.rubypal.com for details and updates!

Robert Klemme · Apr 30, 2008

Hi --

Phillip Gawlowski [2008-04-30 16:09]:
John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is", "the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end
i'd prefer Enumerable#any?:

sentence, words = "This is my test sentence", ["This", "is", "my"]
words.any? { |word| sentence.include?(word) }

Click to expand...

I'd rather do it the other way round, i.e. iterate over the sentence and test
words since the sentence is potentially longer:

irb(main):001:0> require 'enumerator'
=> true
irb(main):002:0> require 'set'
=> true
irb(main):003:0> words = %w{This is my}.to_set
=> #<Set: {"my", "This", "is"}>
irb(main):004:0> "This is my test sentence".to_enumscan,/\w+/).any? {|w|
words.include? w}
=> true
irb(main):005:0>

Click to expand...

Is there any reason not to just do:

"This is my test sentence".scan(/\w+/).any? {|w| words.include? w }

Yes. I used to_enum

scan,/\w+/) because in this class of problems the
text (sentence) is tends to be large. The approach using to_enum does
the test while traversing while scan approach first converts the whole
text into words and then applies the test thus iterating twice over the
whole text plus doing more conversions (to words) and needs more
temporary memory (i.e. for the whole sequence of words, although the
overhead might be small because of internal String memory sharing).

The Set approach scales better for larger sets of words because the Set
lookup is O(1) while an Array based lookup is O(n).

I am not saying that my approach is faster under all circumstances. But
it surely scales better.

Kind regards

robert

David A. Black · Apr 30, 2008

Hi --

Hi --

Hi,

I have a sentence "This is my test sentence" and an array["is", "the",
"my"] and what i need to do is find the occurence of any of thearray
words in the sentence.

I have this working in a loop but i was wondering is there a way to do
it using one of rubys string methods.

Its sililar to the include method but searching for multiple words not
just one.

"This is my test sentence".include?("This") returns true

but i want something like

"This is my test sentence".include?("This", "is", "my")

anyone got a nice way to do this? I only need to find if one of the
words occure and then i exit.

Click to expand...

You could use any?

irb(main):001:0> words = %w{ This is my }
=> ["This", "is", "my"]
irb(main):002:0> sentence = "This is my test sentence"
=> "This is my test sentence"
irb(main):003:0> words.any? {|word| sentence.include?(word) }
=> true
irb(main):004:0> sentence = "Hi"
=> "Hi"
irb(main):005:0> words.any? {|word| sentence.include?(word) }
=> false

Actually, sentence.include?(word) isn't good, because it will give
false positives (for substrings).

David

--
Rails training from David A. Black and Ruby Power and Light:
INTRO TO RAILS June 9-12 Berlin
ADVANCING WITH RAILS June 16-19 Berlin
INTRO TO RAILS June 24-27 London (Skills Matter)
See http://www.rubypal.com for details and updates!

Robert Klemme · May 1, 2008

Hi --

On 30.04.2008 16:18, Jens Wille wrote:
Phillip Gawlowski [2008-04-30 16:09]:
John Butler wrote:
| Hi,
|
| I have a sentence "This is my test sentence" and an array["is",
"the",
| "my"] and what i need to do is find the occurence of any of thearray
| words in the sentence.
|
| I have this working in a loop but i was wondering is there a way
to do
| it using one of rubys string methods.
|
| Its sililar to the include method but searching for multiple
words not
| just one.
|
| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")
|
| anyone got a nice way to do this? I only need to find if one of the
| words occure and then i exit.
|
| JB

How about '["is", "the", "my"].each'?

I.e.:

["is", "the", "my"].each do |word|
~ break if "the test sentence'.include? word
end
i'd prefer Enumerable#any?:

sentence, words = "This is my test sentence", ["This", "is", "my"]
words.any? { |word| sentence.include?(word) }
I'd rather do it the other way round, i.e. iterate over the sentence
and test words since the sentence is potentially longer:

irb(main):001:0> require 'enumerator'
=> true
irb(main):002:0> require 'set'
=> true
irb(main):003:0> words = %w{This is my}.to_set
=> #<Set: {"my", "This", "is"}>
irb(main):004:0> "This is my test sentence".to_enumscan,/\w+/).any?
{|w| words.include? w}
=> true
irb(main):005:0>

Click to expand...

Is there any reason not to just do:

"This is my test sentence".scan(/\w+/).any? {|w| words.include? w }

Click to expand...

Yes. I used to_enumscan,/\w+/) because in this class of problems the
text (sentence) is tends to be large. The approach using to_enum does
the test while traversing while scan approach first converts the whole
text into words and then applies the test thus iterating twice over the
whole text plus doing more conversions (to words) and needs more
temporary memory (i.e. for the whole sequence of words, although the
overhead might be small because of internal String memory sharing).

The Set approach scales better for larger sets of words because the Set
lookup is O(1) while an Array based lookup is O(n).

I am not saying that my approach is faster under all circumstances. But
it surely scales better.

Well, I did a little benchmarking and it turns out that I probably spoke
too soon. As often - assumptions should be verified against measurable
reality.

Here's the numbers. I leave the analysis for the reader, but keep in
mind that the situation might change significantly if the input text
needs to be read via IO (from a file etc.).

Kind regards

robert

robert@fussel /cygdrive/c/Temp
$ ./scan.rb
Rehearsal -------------------------------------------------------
head arr std 7.578000 0.063000 7.641000 ( 7.628000)
head arr enum 0.000000 0.000000 0.000000 ( 0.000000)
head set std 8.016000 0.031000 8.047000 ( 8.043000)
head set enum 0.000000 0.000000 0.000000 ( 0.000000)
head rarr std 7.968000 0.016000 7.984000 ( 8.041000)
head rarr enum 0.000000 0.000000 0.000000 ( 0.002000)
head rx 0.000000 0.000000 0.000000 ( 0.000000)
tail arr std 20.203000 0.000000 20.203000 ( 20.390000)
tail arr enum 32.079000 0.000000 32.079000 ( 33.039000)
tail set std 15.421000 0.031000 15.452000 ( 15.616000)
tail set enum 26.672000 0.016000 26.688000 ( 26.721000)
tail rarr std 19.782000 0.031000 19.813000 ( 19.811000)
tail rarr enum 31.281000 0.000000 31.281000 ( 31.360000)
tail rx 0.078000 0.000000 0.078000 ( 0.080000)
mid arr std 13.828000 0.031000 13.859000 ( 13.853000)
mid arr enum 15.781000 0.000000 15.781000 ( 15.814000)
mid set std 11.485000 0.063000 11.548000 ( 11.559000)
mid set enum 12.953000 0.000000 12.953000 ( 12.961000)
mid rarr std 14.156000 0.062000 14.218000 ( 14.231000)
mid rarr enum 15.375000 0.016000 15.391000 ( 15.412000)
mid rx 0.031000 0.000000 0.031000 ( 0.039000)
-------------------------------------------- total: 253.047000sec

user system total real
head arr std 7.031000 0.062000 7.093000 ( 7.086000)
head arr enum 0.000000 0.000000 0.000000 ( 0.000000)
head set std 7.078000 0.063000 7.141000 ( 7.131000)
head set enum 0.000000 0.000000 0.000000 ( 0.000000)
head rarr std 7.000000 0.125000 7.125000 ( 7.129000)
head rarr enum 0.000000 0.000000 0.000000 ( 0.000000)
head rx 0.000000 0.000000 0.000000 ( 0.000000)
tail arr std 19.282000 0.031000 19.313000 ( 19.341000)
tail arr enum 30.328000 0.078000 30.406000 ( 30.658000)
tail set std 14.594000 0.000000 14.594000 ( 14.600000)
tail set enum 25.360000 0.000000 25.360000 ( 25.403000)
tail rarr std 19.047000 0.016000 19.063000 ( 19.076000)
tail rarr enum 29.922000 0.000000 29.922000 ( 29.984000)
tail rx 0.078000 0.000000 0.078000 ( 0.082000)
mid arr std 13.297000 0.000000 13.297000 ( 13.312000)
mid arr enum 14.453000 0.000000 14.453000 ( 14.451000)
mid set std 10.954000 0.031000 10.985000 ( 11.012000)
mid set enum 12.093000 0.000000 12.093000 ( 12.155000)
mid rarr std 13.312000 0.000000 13.312000 ( 13.346000)
mid rarr enum 14.375000 0.000000 14.375000 ( 14.389000)
mid rx 0.031000 0.000000 0.031000 ( 0.037000)

robert@fussel /cygdrive/c/Temp
$ cat scan.rb
#!/bin/env ruby

require 'set'
require 'enumerator'

require 'benchmark'

TEXT_FRONT = ("a" << (" x" * 1_000_000)).freeze
TEXT_TAIL = (("x " * 1_000_000) << "a").freeze
TEXT_MID = (("x " * 500_000) << "a" << (" x" * 500_000)).freeze
WORDS = %w{a b c d e f}.freeze
REV_WORDS = WORDS.reverse.freeze
SET_WORDS = WORDS.to_set.freeze
RX = Regexp.new("\\b#{Regexp.union(*WORDS)}\\b")

TEXTS = {
"head" => TEXT_FRONT,
"mid" => TEXT_MID,
"tail" => TEXT_TAIL,
}

TESTER = {
"arr" => WORDS,
"rarr" => REV_WORDS,
"set" => SET_WORDS,
}

REPEAT = 5

Benchmark.bmbm 20 do |b|
TEXTS.each do |tlabel, text|
TESTER.each do |lab,enum|
b.report "#{tlabel} #{lab} std" do
REPEAT.times do
text.scan(/\w+/).any? {|w| enum.include? w}
end
end

b.report "#{tlabel} #{lab} enum" do
REPEAT.times do
text.to_enum

scan, /\w+/).any? {|w| enum.include? w}
end
end
end

b.report "#{tlabel} rx" do
REPEAT.times do
RX =~ text
end
end
end
end

robert@fussel /cygdrive/c/Temp
$

Albert Schlef · May 1, 2008

| "This is my test sentence".include?("This") returns true
|
| but i want something like
|
| "This is my test sentence".include?("This", "is", "my")

Yet another solution:

"This is my test sentence".split & ["This", "is", "my"]

Single put routine overlapping words during iteration	4	Jan 2, 2023
Reverse search for a website	2	Apr 24, 2024
Search Results with Pagination	1	Oct 25, 2024
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
The Horror of pointers...	5	Jan 11, 2025
How can I structure the final array to meet the requirements of Bootstrap Tree View for building a tree in JavaScript?	1	Mar 29, 2024
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022

Search string for occurneces of words stored in array

John Butler

Phillip Gawlowski

David A. Black

Jens Wille

Jens Wille

Ken Bloom

David A. Black

Jens Wille

Roger Pack

Robert Klemme

David A. Black

Robert Klemme

David A. Black

Robert Klemme

Albert Schlef

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads