regular expressions help

V

Vivek

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]


I tried to somehow do a collect , but not sure if there is a way to
retain a variable in between 2 invocations and then concat them and
return as one string..
Ofcourse if there is a smart way to do it in one shot using a regex
then i can do a scan on the string
 
P

phlip

'hi hello "hello world" hey yo'
should return
[hi, hello, hello world,hey,yo]

'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

Sorry I couldn't find a more verbose way. Maybe there is one!
 
A

Axel

If you can't find anything better, you might want to try:

str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

Regards,
Axel
 
P

phlip

Axel said:
str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!
 
P

phlip

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...
 
D

David A. Black

Hi --

Axel said:
str = 'hi hello "hello world" hey yo'
str.gsub!( / \" [^\"]* \" /x ) {|e| e[1..-2].gsub(' ', "\007") }
result = str.scan( / [\w\007]+ /x ).map {|e| e.gsub("\007", " ") }
p result

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:
=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:
=> ["one", "\"two\" \"three\"", "four"] # only three strings

Try this:

str.scan(/"([^"]+)"|(\w+)/).flatten.compact

Of course this assumes no embedded/escaped/nested "'s, etc.


David
 
D

David A. Black

Hi --

str = 'hi hello "hello world" hey yo'
p str.scan(/(".*")|(\w+)/).flatten.compact

=> ["hi", "hello", "hello world", "hey", "yo"]

Greedy matching to the rescue!

Also, non-capturing groups help us remove the .flatten.compact nonsense:

p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

I'm not sure why one version capture the "" marks and the other did not...

They both did :) (See my previous post.)


David
 
B

Bill Kelly

From: "phlip said:
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "hey", "yo"]

Probably want:

str.scan(/(?:"[^"]*")|(?:\w+)/)

...else the greediness will extend over multiple quoted
strings...

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

vs.

'hi hello "hello world" hey yo "marmoset knocked you out" foo bar'
^^^^^^^^^^^^^

I'm not sure why one version capture the "" marks and the
other did not...

Strange... They both did, on my system...(?)

BTW, in ruby 1.9, we have lookbehind, so we can avoid picking
up the quotes, with:

str.scan(/(?:(?<=")[^"]*(?="))|(?:\w+)/)


Regards,

Bill
 
P

phlip

David said:
=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:

I suspect I copied the wrong line from my transcript!

But...
The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

str = 'hi hello "hello world" "hey yo"'
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

p str.scan(/(?:".*?")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

(-:
str.scan(/"([^"]+)"|(\w+)/).flatten.compact

The non-greedy matcher .*? looks cuter.
Of course this assumes no embedded/escaped/nested "'s, etc.

Using regexps as real language parsers makes certain baby deities cry...
 
D

Dave Bass

phlip said:
'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]

'hi hello "hello world" hey yo'.scan(/\w+/)

=> ["hi", "hello", "hello", "world", "hey", "yo"]

But this returns "hello world" as two entries, not one as required.
 
D

David A. Black

Hi --

David said:
=> ["hi", "hello", "hello world", "hey", "yo"]

That's not quite the result, though:

I suspect I copied the wrong line from my transcript!

But...
The "'s are returned as part of the string '"hello world"'. Also, you
get the wrong result if you have two quoted strings in a row, because
of the greediness:

str = 'hi hello "hello world" "hey yo"'
p str.scan(/(?:".*")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\" \"hey yo\""] # bad

p str.scan(/(?:".*?")|(?:\w+)/)

=> ["hi", "hello", "\"hello world\"", "\"hey yo\""] # good!

I don't think the OP wanted the literal quotation marks as part of the
results, though. In other words you'd want the third string to be:

hello world

rather than

"hello world"


David
 
H

Harry Kakueki

Hi,
How do I split the below string into words..Words can be either a
consecutive set of non whitespace characters or anything withn " "

'hi hello "hello world" hey yo'

should return
[hi, hello, hello world,hey,yo]
require 'shellwords'
include Shellwords

str = 'hi hello "hello world" hey yo'

p shellwords(str)


Harry
 
D

David A. Black

Hi --

should return
[hi, hello, hello world,hey,yo]
But this returns "hello world" as two entries, not one as required.

The "should return" clause is not well-formed anyway...

On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :) (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :)


David

[1] http://pespmc1.vub.ac.be/occamraz.html (yes, it's still "Link To
Something Other Than Wikipedia!" Week [barely])
 
V

Vivek

Hi David and others,
On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :) (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :)



Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me


irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]



I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Vivek Krishna
 
D

David A. Black

Hi --

Hi David and others,
On the (usually misappropriated, but hopefully not here) Occam's Razor
principle[1], I would refrain from positing that there's actually
supposed to be a comma between the second "hello" and "world", or that
the quotation marks that were removed to illustrate the results are
actually supposed to be reinstated as literals. We can wait for a
ruling from Vivek, though; he's now got just about every permutation
to choose from :) (Including shellwords, thanks to Harry, and that of
course is the best. Or at least, if Occam is right, then Harry is
right :)



Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me


irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\" ola
\"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]



I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks good.
(unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Don't forget the shellwords library though -- a very convenient way to
do this.


David
 
H

humax

X-No-Archive: yes
X-Abuse-To: (e-mail address removed)
Lines: 46
Message-ID: <[email protected]>
NNTP-Posting-Date: 13 Jul 2008 07:43:02 GMT
X-Complaints-To: (e-mail address removed)
Bytes: 2656
Xref: number1.nntp.dca.giganews.com comp.lang.ruby:298205

Hi --

Hi David and others,
On the (usually misappropriated, but hopefully not here) Occam's
Razor principle[1], I would refrain from positing that there's
actually supposed to be a comma between the second "hello" and
"world", or that the quotation marks that were removed to
illustrate the results are actually supposed to be reinstated as
literals. We can wait for a ruling from Vivek, though; he's now
got just about every permutation to choose from :) (Including
shellwords, thanks to Harry, and that of course is the best. Or at
least, if Occam is right, then Harry is right :)



Thanks for the replies..Indeed I don't want the quotes to be a part
of the string
This one suggested above by works for me


irb(main):028:0> s
=> "hi there \"hello world\" namaste \"yo man\" \"gutten morgen\"
ola \"what's up\" world"
irb(main):029:0> s.scan(/"([^"]+)"|(\w+)/).flatten.compact
=> ["hi", "there", "hello world", "namaste", "yo man", "gutten
morgen", "ola", "what's up", "world"]



I presume that should capture pretty much any kind of combination..
and I don't have the case where there are nested " so that looks
good. (unless someone can think of a case that breaks )
thanks so much..I had hit a dead end trying to do this!!

Don't forget the shellwords library though -- a very convenient way
to do this.

David

Is there a link for these listed on the web?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top