regular expressions help

Vivek · Jul 12, 2008

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

I tried to somehow do a collect , but not sure if there is a way to
retain a variable in between 2 invocations and then concat them and
return as one string..
Ofcourse if there is a smart way to do it in one shot using a regex
then i can do a scan on the string

phlip · Jul 12, 2008

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

Sorry I couldn't find a more verbose way. Maybe there is one!

Axel · Jul 12, 2008

If you can't find anything better, you might want to try:

str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

Regards,
Axel

phlip · Jul 12, 2008

Axel said:
str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

phlip · Jul 12, 2008

str = 'hi hello "hello world" hey yo'

p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...

David A. Black · Jul 12, 2008

Hi --

Axel said:
Axel said:

str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

Click to expand...

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:
=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:
=> ["one", "\"two\" \"three\"", "four"] # only three strings

Try this:

str.scan(/"([^"]+)"|(\w+)/).flatten.compact

Of course this assumes no embedded/escaped/nested "'s, etc.

David

David A. Black · Jul 12, 2008

Hi --

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Click to expand...

Also, non-capturing groups help us remove the .flatten.compact nonsense:

p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...

They both did

(See my previous post.)

David

Bill Kelly · Jul 12, 2008

From: "phlip said:
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

Probably want:

str.scan(/(?:"[^"]*")|(?:\w+)/)

...else the greediness will extend over multiple quoted
strings...

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vs.

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
^^^^^^^^^^^^^

I'm not sure why one version capture the "" marks and the
other did not...

Strange... They both did, on my system...(?)

BTW, in ruby 1.9, we have lookbehind, so we can avoid picking
up the quotes, with:

str.scan(/(?

?<=")[^"]*(?="))|(?:\w+)/)

Regards,

Bill

phlip · Jul 12, 2008

David said:
=> ["hi", "hello", "hello world", "hey", "yo"]

Click to expand...

That's not quite the result, though:

I suspect I copied the wrong line from my transcript!

But...

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

str = 'hi hello "hello world" "hey yo"'
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

p str.scan(/(?:".*?")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

(-:

str.scan(/"([^"]+)"|(\w+)/).flatten.compact

The non-greedy matcher .*? looks cuter.

Of course this assumes no embedded/escaped/nested "'s, etc.

Using regexps as real language parsers makes certain baby deities cry...

Dave Bass · Jul 12, 2008

phlip said:
'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

Click to expand...

'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

But this returns "hello world" as two entries, not one as required.

David A. Black · Jul 12, 2008

Hi --

David said:
David said:

=> ["hi", "hello", "hello world", "hey", "yo"]

Click to expand...

That's not quite the result, though:

Click to expand...

I suspect I copied the wrong line from my transcript!

But...

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

Click to expand...

str = 'hi hello "hello world" "hey yo"'
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

p str.scan(/(?:".*?")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

I don't think the OP wanted the literal quotation marks as part of the
results, though. In other words you'd want the third string to be:

hello world

rather than

"hello world"

David

Harry Kakueki · Jul 12, 2008

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

require 'shellwords'
include Shellwords

str = 'hi hello "hello world" hey yo'

p shellwords(str)

Harry

phlip · Jul 12, 2008

should return

[hi, hello, hello world,hey,yo]

Click to expand...

Click to expand...

But this returns "hello world" as two entries, not one as required.

The "should return" clause is not well-formed anyway...

David A. Black · Jul 12, 2008

Hi --

should return
[hi, hello, hello world,hey,yo]

Click to expand...

Click to expand...

But this returns "hello world" as two entries, not one as required.

Click to expand...

The "should return" clause is not well-formed anyway...

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from

(Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right

David

[1] http://pespmc1.vub.ac.be/occamraz.html (yes, it's still "Link To
Something Other Than Wikipedia!" Week [barely])

Vivek · Jul 13, 2008

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right

Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]

I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Vivek Krishna

David A. Black · Jul 13, 2008

Hi --

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right

Click to expand...

Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]

I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Don't forget the shellwords library though -- a very convenient way to
do this.

David

humax · Jul 13, 2008

X-No-Archive: yes
X-Abuse-To: (e-mail address removed)
Lines: 46
Message-ID: <[email protected]>
NNTP-Posting-Date: 13 Jul 2008 07:43:02 GMT
X-Complaints-To: (e-mail address removed)
Bytes: 2656
Xref: number1.nntp.dca.giganews.com comp.lang.ruby:298205

Hi --

Hi David and others,

On the (usually misappropriated, but hopefully not here) Occam's
Razor principle[1], I would refrain from positing that there's
actually supposed to be a comma between the second "hello" and
"world", or that the quotation marks that were removed to
illustrate the results are actually supposed to be reinstated as
literals. We can wait for a ruling from Vivek, though; he's now
got just about every permutation to choose from (Including
shellwords, thanks to Harry, and that of course is the best. Or at
least, if Occam is right, then Harry is right

Click to expand...

Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me

irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\"
ola \"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]

I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks
good. (unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Click to expand...

Don't forget the shellwords library though -- a very convenient way
to do this.

David

Is there a link for these listed on the web?

Bill Kelly · Jul 13, 2008

From: "humax said:
Is there a link for these listed on the web?

require 'shellwords'

... should work in 1.8 and 1.9 ruby

Help in hangman game	1	Jul 24, 2023
Regular Expressions	4	Jun 17, 2008
Regular expressions, help?	7	Apr 19, 2012
Regular expressions, capture repeated groups	4	Jul 8, 2010
Regular expressions and long text	13	Jun 20, 2008
Regular expressions	17	Feb 11, 2007
Utility to locate errors in regular expressions	3	May 24, 2013
help with regular expressions	9	Oct 24, 2008

regular expressions help

Vivek

phlip

Axel

phlip

phlip

David A. Black

David A. Black

Bill Kelly

phlip

Dave Bass

David A. Black

Harry Kakueki

phlip

David A. Black

Vivek

David A. Black

humax

Bill Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads