Regexp question

M

Mark Probert

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?
 
S

Simon Strandgaard

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
 
B

Brian Schröder

Mark said:
Hi, Rubyists.

What is the best way of attacking field split on ';' when the string looks
like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

Normally this would call for fixed width lookbehind,

/(?<!\\);/

but as far as I know its not included in the ruby regexp engine.

But for further clarification:
How should 'a;b\\;;c' be split?
If backslashs can be escaped (and you'd want that because otherwise you
can't have a field "b\" its more difficult.

And maybe the CSV library can help you here.

regards,

Brian
 
S

Simon Strandgaard

Hi, Rubyists.

What is the best way of attacking field split on ';' when the string
looks like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]


maybe this one is better ?

irb(main):001:0> "aa;bbb\\;;abc;;d\\\\;e;f".scan(/(?:\A|;)((?:\\[^.]|[^;])*)/)
{ p $1 }
"aa"
"bbb\\;"
"abc"
""
"d\\\\"
"e"
"f"
=> "aa;bbb\\;;abc;;d\\\\;e;f"
irb(main):002:0>
 
M

Mark Probert

Hi ..

Simon Strandgaard said:
How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

Thanks! That is close enough:

irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
irb(main):020:1* next if it.empty?
irb(main):021:1> puts " --> #{it}"
irb(main):022:1> end
--> a is a word
--> b is too
--> c\; for fun
--> d -- forget it
=> ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
it", "", ""]
 
D

Dany Cayouette

But for further clarification:
How should 'a;b\\;;c' be split?
Guess is that it should be
["a", "b\", nil, "c"]

characters escaped by backslash at semi-colon, colon and backslash i.e.

; => \; : => \: \ => \\
If backslashs can be escaped (and you'd want that because otherwise you
can't have a field "b\" its more difficult.

And maybe the CSV library can help you here.

thanks,
Dany
 
D

Dany Cayouette

But for further clarification:
How should 'a;b\\;;c' be split?
Guess is that it should be
["a", "b\", nil, "c"]
Sorry... I meant
["a", "b\\", nil, "c"] where b\\ would utimately become b\ when the escape chars are process in the data portion
characters escaped by backslash at semi-colon, colon and backslash i.e.

; => \; : => \: \ => \\
Didn't think about that one... I thought this was simple and the problem was my lack of programming experience...

Dany
 
F

Florian Gross

Mark said:
Hi, Rubyists.
Moin!

What is the best way of attacking field split on ';' when the string looks
like:

s = 'a;b;c\;;d;'
s.split(/???;/)
=> ["a", "b", "c\;", "d"]

Or is it best to use s.each_byte and do it by hand?

This works, (even with escaped escape characters) but you might be
better off doing it by hand to keep complexity low:
irb(main):025:0> str = "hello;world;foo\\;bar;no escape\\\\;blar"; puts str
hello;world;foo\;bar;no escape\\;blar
=> nil
irb(main):026:0> str.scan(/(?:(?!\\).(?:\\{2})*\\;|[^;])+/).map { |str| str.gsub(/\\(.)/, '\1') }
=> ["hello", "world", "foo;bar", "no escape\\", "blar"]

Regards,
Florian Gross
 
R

Robert Klemme

Mark Probert said:
Hi ..

Simon Strandgaard said:
How about something ala

irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
=> ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

Thanks! That is close enough:

irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
irb(main):020:1* next if it.empty?
irb(main):021:1> puts " --> #{it}"
irb(main):022:1> end
--> a is a word
--> b is too
--> c\; for fun
--> d -- forget it
=> ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
it", "", ""]
s = "aa;bbb\\;;abc;;d\\\\;e;" => "aa;bbb\\;;abc;;d\\\\;e;"
s.scan /(?:\\.|[^\\;])+/
=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]

Regards

robert
 
R

Robert Klemme

Simon Strandgaard said:
On Friday 01 October 2004 09:45, Robert Klemme wrote:
[snip]
s = "aa;bbb\\;;abc;;d\\\\;e;" => "aa;bbb\\;;abc;;d\\\\;e;"
s.scan /(?:\\.|[^\\;])+/
=> ["aa", "bbb\\;", "abc", "d\\\\", "e"]


If its a csv file.. shouldn't output then be?

["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

Darn! You're right. Unfortunately using "*" instead of "+" is not
sufficient: far too many empty strings are found that way.

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top