Regexp Guru Needed

  • Thread starter James Edward Gray II
  • Start date
J

James Edward Gray II

We're having a discussion on Ruby Core about how to speed up CSV.
I'm trying to tune a Regexp that matches CSV fields. However, I'm
seeing something I don't expect. Can someone explain this to me,
please?
",".scan(/(?:^|,)(?:"()"|([^",]*))/)
=> [[nil, ""]]

That's a simplified version of what I'm messing with. My question
is, why does it only match once, when I expect two matches?

The first match should be right at the beginning, and is basically
(?:^ ... )(?: ... ([^",]*)). The second match should begin at the
comma, being (?: ... ,)(?: ... ([^",]*)). What am I missing?

James Edward Gray II
 
P

Peter Vanbroekhoven

------=_Part_6082_11694773.1130634239195
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

We're having a discussion on Ruby Core about how to speed up CSV.
I'm trying to tune a Regexp that matches CSV fields. However, I'm
seeing something I don't expect. Can someone explain this to me,
please?
",".scan(/(?:^|,)(?:"()"|([^",]*))/)
=3D> [[nil, ""]]

That's a simplified version of what I'm messing with. My question
is, why does it only match once, when I expect two matches?

The first match should be right at the beginning, and is basically
(?:^ ... )(?: ... ([^",]*)). The second match should begin at the
comma, being (?: ... ,)(?: ... ([^",]*)). What am I missing?

I'm not pretending to be a regexp guru, but nonetheless:

scan moves forward one character even if the portion of the string that it
matched has length 0. This is to prevent it from going into an infinite
loop. Consider your example: the regexp matches at the start of the string,
and matches 0 characters. If for the next match, Ruby has not moved forward
one character, the regexp would match at the start of the string again in
exactly the same way and still have not matched anything of the string.

My suggestion would be to have two regexps, one to strip off the beginning
of the CSV line, and one to split the remainder into parts.

Peter

------=_Part_6082_11694773.1130634239195--
 
J

James Edward Gray II

I'm not pretending to be a regexp guru, but nonetheless:

scan moves forward one character even if the portion of the string
that it
matched has length 0.

I am aware of the infamous "bump-along", but doesn't 0 + 1 == 1? I
expected that to put it on the comma, which would work just fine.

James Edward Gray II
 
J

James Edward Gray II

I am aware of the infamous "bump-along", but doesn't 0 + 1 == 1? I
expected that to put it on the comma, which would work just fine.

Nevermind. I get how dumb I'm being now. There's only one
character, at 0. Duh. Thanks for the lesson.

James Edward Gray II
 
S

Stephen Waits

James said:
From Mastering Regular Expressions (2nd Edition).

Check out RegexBuddy. Worth getting access to Win32 just for this if
you're a Mac guy needing to debug some REs.

--Steve
 
S

Stephen Waits

Or http://www.weitz.de/regex-coach/ - it's the best one I've seen,
and
has Linux and Windows ports (sadly no Mac version).

Thanks for the link Martin. I hadn't found it before. I tried it
out, and, it's a nice "free" alternative to RegexBuddy; however, it
pales in comparison to what RB can do. I do wish RB was a little
cheaper - I've bought much richer software for less money.

--Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top