Surprising Regexp Behavior

  • Thread starter James Edward Gray II
  • Start date
J

James Edward Gray II

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

Here's my Ruby version:

$ ruby -v
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.0]

Thanks for any wisdom you can impart.

James Edward Gray II
 
P

Pit Capitain

James said:
I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

James, what did you expect? Both examples look perfectly valid to me.

Regards,
Pit
 
A

Ara.T.Howard

I keep running into some surprising points with Ruby's Regexp engine today
and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"

irb(main):002:0> html[ %r| <p> .*? |x ]
=> "<p>"

irb(main):003:0> html[ %r| <p> .*? </p> |x ]
=> "<p>one</p>"

irb(main):004:0> html[ %r| <p> .*? </p> .* |x ]
=> "<p>one</p>"

hmm?

but if we use 'm' to make '.' match newline:

irb(main):005:0> html[ %r| <p> .*? </p> .* |xm ]
=> "<p>one</p>\n\n<p>two</p>"

alternatively we can name newline explicitly:

irb(main):006:0> html[ %r| <p> .*? </p> [.\n]* |x ]
=> "<p>one</p>\n\n"

probably 'm' is better for html though.

irb(main):007:0> html =~ %r| <p> (.*?) </p> (.*) |xm and p [$1, $2]
["one", "\n\n<p>two</p>"]

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
D

David A. Black

Hi --

I keep running into some surprising points with Ruby's Regexp engine today
and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "<p>one</p>\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

In both cases, if you use the /m modifier, the dot will match \n, and
I think the behavior you want will happen.


David
 
R

Robert Klemme

James said:
I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "<p>one</p>\n\n<p>two</p>"
=> "<p>one</p>\n\n<p>two</p>"
irb(main):002:0> html.sub!(/<p>(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\n<p>two</p>"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?
That way you make the regexp engine match more than you need and if you
change sub! to gsub! at some time, you'll likely still have only one
replacement, because .* matches anything to the end.

Kind regards

robert
 
J

James Edward Gray II

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it
there?

So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

James Edward Gray II
 
R

Robert Klemme

James said:
So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/<p>(.*?)<\/p>(.*)/) { "<p>#{$1.strip}</p>" }

Kind regards

robert
 
J

James Edward Gray II

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/<p>(.*?)<\/p>(.*)/) { "<p>#{$1.strip}</p>" }

The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the <p> and </p> tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
if $2 =~ /\S/
"#{html} #{link_to '...', :action => :show, :id => id}"
else
html
end
end

It works as expected now.

James Edward Gray II
 
R

Robert Klemme

James said:
The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the <p> and </p> tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
if $2 =~ /\S/
"#{html} #{link_to '...', :action => :show, :id => id}"
else
html
end
end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
html
end

An alternative

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/<p>(.*?)<\/p>.*(\S)?\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if $2
html
end

Just an idea...

Cheers

robert
 
J

James Edward Gray II

This might be a bit more efficient (dunno how often you call it):

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/<p>(.*?)<\/p>(.*)\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if
$2 =~
/\S/
html
end

That's not equivalent. You're missing a space between html's content
and the ellipses.

But thanks for the ideas.

James Edward Gray II
 
A

Ara.T.Howard

Ah! In that case I'd something like:

html.sub!(/(<p>)(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

it never occured to me that regexes could be made to be context sensitive in
that way - that usage of the block, i think, makes them recognize more that
the regular languages doesn't it? something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the pattern was
made quite general to ensure capture forward....

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
R

Robert Klemme

Ara.T.Howard said:
it never occured to me that regexes could be made to be context
sensitive in that way - that usage of the block, i think, makes them
recognize more that the regular languages doesn't it?

No. The block is just for the replacement. It doesn't change anything
for the match.
something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the
pattern was made quite general to ensure capture forward....

I don't see how this is look forward or backward. The group actually has
to be matched to be able to use it as basis for some kind of conditional
replacement. There's no lookahead / lookbehing magic involved - or I
cannot see it. :)

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

regexp property under windows 1
Help with code 0
bug is ruby regexp 3
class context 1
ruby thread is buggy while using serialport 7
Class instance method 2
parentheses and newlines 2
Regexp: named captures 20

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,281
Latest member
Pedroaciny

Latest Threads

Top