Surprising Regexp Behavior

James Edward Gray II · Sep 13, 2005

I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

Here's my Ruby version:

$ ruby -v
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.0]

Thanks for any wisdom you can impart.

James Edward Gray II

Pit Capitain · Sep 13, 2005

James said:
I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

James, what did you expect? Both examples look perfectly valid to me.

Regards,
Pit

Ara.T.Howard · Sep 13, 2005

I keep running into some surprising points with Ruby's Regexp engine today
and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"

irb(main):002:0> html[ %r| .*? |x ]
=> ""

irb(main):003:0> html[ %r| .*? |x ]
=> "one"

irb(main):004:0> html[ %r| .*? .* |x ]
=> "one"

hmm?

but if we use 'm' to make '.' match newline:

irb(main):005:0> html[ %r| .*? .* |xm ]
=> "one\n\ntwo"

alternatively we can name newline explicitly:

irb(main):006:0> html[ %r| .*? [.\n]* |x ]
=> "one\n\n"

probably 'm' is better for html though.

irb(main):007:0> html =~ %r| (.*?) (.*) |xm and p [$1, $2]
["one", "\n\ntwo"]

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================

David A. Black · Sep 13, 2005

Hi --

I keep running into some surprising points with Ruby's Regexp engine today
and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Can anyone explain to me how that isn't a bug?

Here's another surprise, for me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)\Z/) { $1.strip }
=> "one\n\ntwo"

Using an anchor there means that the left-most match doesn't win?

In both cases, if you use the /m modifier, the dot will match \n, and
I think the behavior you want will happen.

David

Robert Klemme · Sep 14, 2005

James said:
I keep running into some surprising points with Ruby's Regexp engine
today and this first one just looks plain wrong to me:

irb(main):001:0> html = "one\n\ntwo"
=> "one\n\ntwo"
irb(main):002:0> html.sub!(/(.*?)<\/p>(.*)/) { $1.strip }
=> "one\n\ntwo"
irb(main):003:0> $2
=> ""

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it there?
That way you make the regexp engine match more than you need and if you
change sub! to gsub! at some time, you'll likely still have only one
replacement, because .* matches anything to the end.

Kind regards

robert

James Edward Gray II · Sep 14, 2005

Maybe I overlooked something but I didn't see anybody mention it: the
trailing (.*) seems quite superfluous to me. Why did you put it
there?

So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

James Edward Gray II

Robert Klemme · Sep 14, 2005

James said:
So I could check to see if there was more content after the first
paragraph that I trimmed. The code goes on to replace it with an
ellipses if there was.

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/(.*?)<\/p>(.*)/) { "#{$1.strip}" }

Kind regards

robert

James Edward Gray II · Sep 14, 2005

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }
html.sub!(/(.*?)<\/p>(.*)/) { "#{$1.strip}" }

The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the and tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
if $2 =~ /\S/
"#{html} #{link_to '...', :action => :show, :id => id}"
else
html
end
end

It works as expected now.

James Edward Gray II

Robert Klemme · Sep 14, 2005

James said:
The method takes a chunk of HTML and pulls the first paragraph out of
it (minus the and tags). But I want to know if there was
other content, so I can add an ellipses if needed.

Here's the entire method, defined in a Rails helper module:

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
if $2 =~ /\S/
"#{html} #{link_to '...', :action => :show, :id => id}"
else
html
end
end

It works as expected now.

This might be a bit more efficient (dunno how often you call it):

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if $2 =~
/\S/
html
end

An alternative

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/(.*?)<\/p>.*(\S)?\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if $2
html
end

Just an idea...

Cheers

robert

James Edward Gray II · Sep 14, 2005

This might be a bit more efficient (dunno how often you call it):

def excerpt( textile, id )
html = sanitize(textilize(textile))
html.sub!(/(.*?)<\/p>(.*)\Z/m) { $1.strip }
html << link_to( '...', :action => :show, :id => id ) if
$2 =~
/\S/
html
end

That's not equivalent. You're missing a space between html's content
and the ellipses.

But thanks for the ideas.

James Edward Gray II

Ara.T.Howard · Sep 14, 2005

Ah! In that case I'd something like:

html.sub!(/()(.*?)(<\/p>)(.*)/) { $1 << $2.strip << $3 }

it never occured to me that regexes could be made to be context sensitive in
that way - that usage of the block, i think, makes them recognize more that
the regular languages doesn't it? something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the pattern was
made quite general to ensure capture forward....

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================

Robert Klemme · Sep 14, 2005

James said:
That's not equivalent. You're missing a space between html's content
and the ellipses.

Right. But hey, that's an easy change, isn't it?

But thanks for the ideas.

You're welcome!

robert

Robert Klemme · Sep 14, 2005

Ara.T.Howard said:
it never occured to me that regexes could be made to be context
sensitive in that way - that usage of the block, i think, makes them
recognize more that the regular languages doesn't it?

No. The block is just for the replacement. It doesn't change anything
for the match.

something like

string.sub(pat){ $1 =~ /foo/ ? 'bar' : 'baz' }

though i suppose you can only look backward using this unless the
pattern was made quite general to ensure capture forward....

I don't see how this is look forward or backward. The group actually has
to be matched to be able to use it as basis for some kind of conditional
replacement. There's no lookahead / lookbehing magic involved - or I
cannot see it.

Kind regards

robert

regexp property under windows	1	Sep 15, 2010
Help with code	0	Jun 12, 2022
bug is ruby regexp	3	Feb 2, 2007
class context	1	Jan 18, 2011
ruby thread is buggy while using serialport	7	Aug 7, 2009
Class instance method	2	Jun 5, 2011
parentheses and newlines	2	Feb 17, 2012
Regexp: named captures	20	Aug 20, 2007

Surprising Regexp Behavior

James Edward Gray II

Pit Capitain

Ara.T.Howard

David A. Black

Robert Klemme

James Edward Gray II

Robert Klemme

James Edward Gray II

Robert Klemme

James Edward Gray II

Ara.T.Howard

Robert Klemme

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads