Regexp question

Discussion in 'Ruby' started by Mark Probert, Sep 30, 2004.

  1. Mark Probert

    Mark Probert Guest

    Hi, Rubyists.

    What is the best way of attacking field split on ';' when the string looks
    like:

    s = 'a;b;c\;;d;'
    s.split(/???;/)
    => ["a", "b", "c\;", "d"]

    Or is it best to use s.each_byte and do it by hand?

    --
    -mark. (probertm @ acm dot org)
    Mark Probert, Sep 30, 2004
    #1
    1. Advertising

  2. On Thursday 30 September 2004 23:15, Mark Probert wrote:
    > Hi, Rubyists.
    >
    > What is the best way of attacking field split on ';' when the string looks
    > like:
    >
    > s = 'a;b;c\;;d;'
    > s.split(/???;/)
    > => ["a", "b", "c\;", "d"]
    >
    > Or is it best to use s.each_byte and do it by hand?


    How about something ala

    irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
    => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]

    --
    Simon Strandgaard
    Simon Strandgaard, Sep 30, 2004
    #2
    1. Advertising

  3. Mark Probert wrote:
    > Hi, Rubyists.
    >
    > What is the best way of attacking field split on ';' when the string looks
    > like:
    >
    > s = 'a;b;c\;;d;'
    > s.split(/???;/)
    > => ["a", "b", "c\;", "d"]
    >
    > Or is it best to use s.each_byte and do it by hand?
    >


    Normally this would call for fixed width lookbehind,

    /(?<!\\);/

    but as far as I know its not included in the ruby regexp engine.

    But for further clarification:
    How should 'a;b\\;;c' be split?
    If backslashs can be escaped (and you'd want that because otherwise you
    can't have a field "b\" its more difficult.

    And maybe the CSV library can help you here.

    regards,

    Brian

    --
    Brian Schröder
    http://ruby.brian-schroeder.de/
    Brian Schröder, Sep 30, 2004
    #3
  4. On Thursday 30 September 2004 23:29, Simon Strandgaard wrote:
    > On Thursday 30 September 2004 23:15, Mark Probert wrote:
    > > Hi, Rubyists.
    > >
    > > What is the best way of attacking field split on ';' when the string
    > > looks like:
    > >
    > > s = 'a;b;c\;;d;'
    > > s.split(/???;/)
    > > => ["a", "b", "c\;", "d"]
    > >
    > > Or is it best to use s.each_byte and do it by hand?

    >
    > How about something ala
    >
    > irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
    > => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]



    maybe this one is better ?

    irb(main):001:0> "aa;bbb\\;;abc;;d\\\\;e;f".scan(/(?:\A|;)((?:\\[^.]|[^;])*)/)
    { p $1 }
    "aa"
    "bbb\\;"
    "abc"
    ""
    "d\\\\"
    "e"
    "f"
    => "aa;bbb\\;;abc;;d\\\\;e;f"
    irb(main):002:0>

    --
    Simon Strandgaard
    Simon Strandgaard, Sep 30, 2004
    #4
  5. Mark Probert

    Mark Probert Guest

    Hi ..

    Simon Strandgaard <> wrote:
    >
    > How about something ala
    >
    > irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
    > => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
    >


    Thanks! That is close enough:

    irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
    irb(main):020:1* next if it.empty?
    irb(main):021:1> puts " --> #{it}"
    irb(main):022:1> end
    --> a is a word
    --> b is too
    --> c\; for fun
    --> d -- forget it
    => ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
    it", "", ""]



    --
    -mark. (probertm @ acm dot org)
    Mark Probert, Sep 30, 2004
    #5

  6. > But for further clarification:
    > How should 'a;b\\;;c' be split?

    Guess is that it should be
    ["a", "b\", nil, "c"]

    characters escaped by backslash at semi-colon, colon and backslash i.e.

    ; => \; : => \: \ => \\

    > If backslashs can be escaped (and you'd want that because otherwise you
    > can't have a field "b\" its more difficult.
    >
    > And maybe the CSV library can help you here.


    thanks,
    Dany
    Dany Cayouette, Sep 30, 2004
    #6
  7. On Thu, 30 Sep 2004 17:57:19 -0400
    Dany Cayouette <> wrote:

    >
    > > But for further clarification:
    > > How should 'a;b\\;;c' be split?

    > Guess is that it should be
    > ["a", "b\", nil, "c"]

    Sorry... I meant
    ["a", "b\\", nil, "c"] where b\\ would utimately become b\ when the escape chars are process in the data portion
    >
    > characters escaped by backslash at semi-colon, colon and backslash i.e.
    >
    > ; => \; : => \: \ => \\
    >
    > > If backslashs can be escaped (and you'd want that because otherwise you
    > > can't have a field "b\" its more difficult.
    > >

    Didn't think about that one... I thought this was simple and the problem was my lack of programming experience...

    Dany
    Dany Cayouette, Sep 30, 2004
    #7
  8. Mark Probert wrote:

    > Hi, Rubyists.


    Moin!

    > What is the best way of attacking field split on ';' when the string looks
    > like:
    >
    > s = 'a;b;c\;;d;'
    > s.split(/???;/)
    > => ["a", "b", "c\;", "d"]
    >
    > Or is it best to use s.each_byte and do it by hand?


    This works, (even with escaped escape characters) but you might be
    better off doing it by hand to keep complexity low:

    > irb(main):025:0> str = "hello;world;foo\\;bar;no escape\\\\;blar"; puts str
    > hello;world;foo\;bar;no escape\\;blar
    > => nil
    > irb(main):026:0> str.scan(/(?:(?!\\).(?:\\{2})*\\;|[^;])+/).map { |str| str.gsub(/\\(.)/, '\1') }
    > => ["hello", "world", "foo;bar", "no escape\\", "blar"]


    Regards,
    Florian Gross
    Florian Gross, Oct 1, 2004
    #8
  9. "Mark Probert" <> schrieb im Newsbeitrag
    news:Xns95749654816D0probertmnospamtelusn@198.161.157.145...
    > Hi ..
    >
    > Simon Strandgaard <> wrote:
    > >
    > > How about something ala
    > >
    > > irb(main):015:0> "aa;bbb\\;;abc;;d\\\\;e;".scan(/(?:\\[^.]|[^;])*;/)
    > > => ["aa;", "bbb\\;;", "abc;", ";", "d\\\\;", "e;"]
    > >

    >
    > Thanks! That is close enough:
    >
    > irb(main):019:0> s.scan(/(?:\\[^.]|[^;])*/).each do |it|
    > irb(main):020:1* next if it.empty?
    > irb(main):021:1> puts " --> #{it}"
    > irb(main):022:1> end
    > --> a is a word
    > --> b is too
    > --> c\; for fun
    > --> d -- forget it
    > => ["a is a word", "", "b is too", "", "c\\; for fun", "", "d -- forget
    > it", "", ""]



    >> s = "aa;bbb\\;;abc;;d\\\\;e;"

    => "aa;bbb\\;;abc;;d\\\\;e;"
    >> s.scan /(?:\\.|[^\\;])+/

    => ["aa", "bbb\\;", "abc", "d\\\\", "e"]

    Regards

    robert
    Robert Klemme, Oct 1, 2004
    #9
  10. On Friday 01 October 2004 09:45, Robert Klemme wrote:
    [snip]
    > >> s = "aa;bbb\\;;abc;;d\\\\;e;"

    > => "aa;bbb\\;;abc;;d\\\\;e;"
    > >> s.scan /(?:\\.|[^\\;])+/

    > => ["aa", "bbb\\;", "abc", "d\\\\", "e"]



    If its a csv file.. shouldn't output then be?

    ["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]

    --
    Simon Strandgaard
    Simon Strandgaard, Oct 1, 2004
    #10
  11. "Simon Strandgaard" <> schrieb im Newsbeitrag
    news:...
    > On Friday 01 October 2004 09:45, Robert Klemme wrote:
    > [snip]
    >> >> s = "aa;bbb\\;;abc;;d\\\\;e;"

    >> => "aa;bbb\\;;abc;;d\\\\;e;"
    >> >> s.scan /(?:\\.|[^\\;])+/

    >> => ["aa", "bbb\\;", "abc", "d\\\\", "e"]

    >
    >
    > If its a csv file.. shouldn't output then be?
    >
    > ["aa", "bbb\\;", "abc", "", "d\\\\", "e", ""]


    Darn! You're right. Unfortunately using "*" instead of "+" is not
    sufficient: far too many empty strings are found that way.

    robert
    Robert Klemme, Oct 1, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    152
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    467
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    344
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    186
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    171
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page