Splitting strings on spaces, unless inside quotes

R

Richard Livsey

I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

So far I'm drawing a blank on the 'Ruby way' to do this and the only
solutions I can think of are turning out to be fairly ugly.

Any advice would be great. Thanks in advance.
 
E

Eero Saynatkari

I want to split a string into words, but group quoted words together=20
such that...
=20
some words "some quoted text" some more words
=20
would get split up into:
=20
["some", "words", "some quoted text", "some", "more", "words"]
=20
So far I'm drawing a blank on the 'Ruby way' to do this and the only=20
solutions I can think of are turning out to be fairly ugly.
=20
Any advice would be great. Thanks in advance.

Naively, you can try something like this:

s =3D 'foo bar "baz quux" roo'
s.scan(/(?:"")|(?:"(.*[^\\])")|(\w+)/).flatten.compact

Elaborate as necessary (add support for single quotes or something).


E
 
T

Tim Heaney

Richard Livsey said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

How about the csv module? Despite the name, you don't have to use
commas.

require 'csv'
CSV::parse_line('some words "some quoted text" some more words', ' ')

I hope this helps,

Tim
 
J

James Edward Gray II

I want to split a string into words, but group quoted words
together such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

So far I'm drawing a blank on the 'Ruby way' to do this and the
only solutions I can think of are turning out to be fairly ugly.

Any advice would be great. Thanks in advance.

I agree that CSV is the way to go, but here's a direct attempt:
example = %Q{some words "some quoted text" some more words} => "some words \"some quoted text\" some more words"
example.scan(/\s+|\w+|"[^"]*"/).
?> reject { |token| token =~ /^\s+$/ }.
?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

Hope that gives you some fresh ideas.

James Edward Gray II
 
M

Matthew Moss

some words "some quoted text" some more words
would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s =3D 'some words "some quoted text" some more words

sa =3D s.split(/"/).collect { |x| x.strip }
(0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x }.flat=
ten
 
M

Michael 'entropie' Trommer

--c3bfwLpm8qysLVxt
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

* James Edward Gray II ([email protected]) said:
example = %Q{some words "some quoted text" some more words} => "some words \"some quoted text\" some more words"
example.scan(/\s+|\w+|"[^"]*"/).
?> reject { |token| token =~ /^\s+$/ }.
?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

impressive


So long
--
Michael 'entropie' Trommer; http://ackro.org

ruby -e "0.upto((a='njduspAhnbjm/dpn').size-1){|x| a[x]-=1}; p 'mailto:'+a"

--c3bfwLpm8qysLVxt
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFDvx+QBBd8ye5RguQRAoXnAJ4+n/djV0bwu6mgjskyi3tBcLdGuwCgunmv
UyILBLpTRU2WjSJx/53txSY=
=uN3j
-----END PGP SIGNATURE-----

--c3bfwLpm8qysLVxt--
 
M

Matthew Moss

(0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x }.fl=
atten

Just realized that Range responds to zip, so the to_a is unnecessary.

This looks slightly cleaner to me:

(1..sa.size).zip(sa).collect { |i,x| (i&1).zero? ? x : x.split }.flatten
 
X

Xavier Noria

I want to split a string into words, but group quoted words
together such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

Curiously, someone asked exactly that on freenode#perl tonight.

If the input is that simple and is assumed to be well-formed this is
enough:

irb(main):005:0> %q{some words "some quoted text" some "" more
words}.scan(/"[^"]*"|\S+/)
=> ["some", "words", "\"some quoted text\"", "some", "\"\"", "more",
"words"]

Since nothing was said about this, it does not handle escaped quotes,
and it assumes quotes are always balanced, so a field cannot be %q
{"foo}, for example.

-- fxn
 
D

dblack

Hi --

I want to split a string into words, but group quoted words together such
that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

So far I'm drawing a blank on the 'Ruby way' to do this and the only
solutions I can think of are turning out to be fairly ugly.

Any advice would be great. Thanks in advance.

I agree that CSV is the way to go, but here's a direct attempt:

Me too (end of disclaimer :)

example = %Q{some words "some quoted text" some more words} => "some words \"some quoted text\" some more words"
example.scan(/\s+|\w+|"[^"]*"/).
?> reject { |token| token =~ /^\s+$/ }.
?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

I think you could do less work:

example.scan(/"[^"]+"|\S+/).map { |word| word.delete('"') }

(Or am I overlooking some reason you'd want to capture sequences of
spaces?)

I changed the \w+ to \S+ (and moved it after the | to avoid having it
sponge up too much) in case the words included non-\w characters.

I guess with zero-width positive lookbehind/ahead one could do it
without the map operation.


David

--
David A. Black
(e-mail address removed)

"Ruby for Rails", from Manning Publications, coming April 2006!
http://www.manning.com/books/black
 
J

James Edward Gray II

example = %Q{some words "some quoted text" some more words}
=> "some words \"some quoted text\" some more words"
example.scan(/\s+|\w+|"[^"]*"/).
?> reject { |token| token =~ /^\s+$/ }.
?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

I think you could do less work:

example.scan(/"[^"]+"|\S+/).map { |word| word.delete('"') }

(Or am I overlooking some reason you'd want to capture sequences of
spaces?)

I changed the \w+ to \S+ (and moved it after the | to avoid having it
sponge up too much) in case the words included non-\w characters.

You're right, that's better all around.
I guess with zero-width positive lookbehind/ahead one could do it
without the map operation.

You can drop the map(), if you're willing to replace it with two
other calls:
example = %Q{some words "some quoted text" some more words} => "some words \"some quoted text\" some more words"
example.scan(/"([^"]+)"|(\S+)/).flatten.compact
=> ["some", "words", "some quoted text", "some", "more", "words"]

James Edward Gray II
 
F

Florian Groß

Richard said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

Try this:
irb(main):001:0> require 'shellwords'; Shellwords.shellwords 'some words "some quoted text" some more words'
=> ["some", "words", "some quoted text", "some", "more", "words"]
 
A

ara.t.howard

Richard Livsey said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

How about the csv module? Despite the name, you don't have to use
commas.

require 'csv'
CSV::parse_line('some words "some quoted text" some more words', ' ')

I hope this helps,

briliant!

-a
--
===============================================================================
| ara [dot] t [dot] howard [at] noaa [dot] gov
| all happiness comes from the desire for others to be happy. all misery
| comes from the desire for oneself to be happy.
| -- bodhicaryavatara
===============================================================================
 
W

William James

Richard said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )
 
G

Geoff Jacobsen

Richard said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )

Which along with the CSV solution can't handle complex cases:

s='one two" "\'with quotes\' "three "'

s.split( / *"(.*?)" *| / )
=> ["one", "two", " ", "'with", "quotes'", "three "]

require 'csv'
CSV::parse_line(s)
=> []

but Shellwords can:

require 'shellwords'
Shellwords.shellwords(s)
=> ["one", "two with quotes", "three "]
 
R

Robert Klemme

Geoff said:
Richard said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )

Which along with the CSV solution can't handle complex cases:

s='one two" "\'with quotes\' "three "'

s.split( / *"(.*?)" *| / )
=> ["one", "two", " ", "'with", "quotes'", "three "]

require 'csv'
CSV::parse_line(s)
=> []

but Shellwords can:

require 'shellwords'
Shellwords.shellwords(s)
=> ["one", "two with quotes", "three "]

Another option is to use scan instead of split:
%r{"(?:(?:[^"]|\\.)*)"|\S+}
=> ["some", "words", "\"some quoted text\"", "some", "more", "words"]

With some additional effort even the quotes can be removed (using grouping
for example).
r=[];'some words "some quoted text" some more
words'.scan(%r{"((?:[^"]|\\.)*)"|(\S+)}) {|m| r << m.detect {|x|x}};r
=> ["some", "words", "some quoted text", "some", "more", "words"]

Kind regards

robert
 
W

William James

Geoff said:
Richard said:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )

Which along with the CSV solution can't handle complex cases:

s='one two" "\'with quotes\' "three "'

s.split( / *"(.*?)" *| / )
=> ["one", "two", " ", "'with", "quotes'", "three "]

require 'csv'
CSV::parse_line(s)
=> []

but Shellwords can:

require 'shellwords'
Shellwords.shellwords(s)
=> ["one", "two with quotes", "three "]

This is not a "more complex case"; it is an invalid case.
The original poster simply wanted to avoid splitting on spaces
within double quotes, not within single quotes.

The shellwords "solution" is a solution to a different problem, not
to this one. It can't even handle a simple case:

require 'shellwords'
s = "why can't you think?"
Shellwords.shellwords(s)

ArgumentError: Unmatched single quote: 't you think?
 
G

Geoff Jacobsen

Geoff said:
Richard Livsey wrote:
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )

Which along with the CSV solution can't handle complex cases:

s='one two" "\'with quotes\' "three "'

s.split( / *"(.*?)" *| / )
=> ["one", "two", " ", "'with", "quotes'", "three "] ...
but Shellwords can:

require 'shellwords'
Shellwords.shellwords(s)
=> ["one", "two with quotes", "three "]

This is not a "more complex case"; it is an invalid case.
The original poster simply wanted to avoid splitting on spaces
within double quotes, not within single quotes.

The shellwords "solution" is a solution to a different problem, not
to this one. It can't even handle a simple case:

require 'shellwords'
s = "why can't you think?"
Shellwords.shellwords(s)

ArgumentError: Unmatched single quote: 't you think?

I agree my example doesn't match the originators request but *I think*
there is enough ambiguity about the post to postulate that they may want
more real-world cases such as:

s='symbol "William said: \"why can't you think?\"" 123 "<xml>foo</xml>"'
Shellwords.shellwords(s)

=> ["symbol", "William said: \"why can't you think?\"", "123",
"<xml>foo</xml>"]

So Shellwords may indeed be a solution to this problem but the problem
is not stated precisely enough to know.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top