Need help with a regexp

R

rpheath

I'm trying to write a regular expression to replace a <pre>...</pre>
block or a <blockquote><p>...</p></blockquote> block with a blank ('').
I can only get the <pre>...</pre> to work correctly. Here's what I
have:

text.gsub(/^<pre>[^<]*<\/pre>$|^<blockquote><p>(.*?)<\/p><\/blockquote>$/,'')

Can someone help me figure out why the blockquote is still showing
up??? Thanks in advance.
 
D

Daniel Finnie

Why are you doing a gsub but then anchoring the Regexp to the start &
ends? Use a normal sub or take out all the ^s and $s (except for the
character class definitions, i.e., the ones in square brackets).

Please post some sample text, not of what you would like to remove but
of what you would like to remove it from.

Dan
 
R

rpheath

Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

Here's sample text that would be passed into it.
-----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<pre>
def strip_blocks(text)
text.gsub([regex],'')
end
</pre>

<p>This is another sentence before the block quote.</p>

<blockquote>
<p>This is a quote</p>
</blockquote>

<p>This is one more sentence</p>
----------------------

What I would like to have left is this:

----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<p>This is another sentence before the block quote.</p>

<p>This is one more sentence</p>
 
E

Edwin Fine

rpheath said:
Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

Here's sample text that would be passed into it.
-----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<pre>
def strip_blocks(text)
text.gsub([regex],'')
end
</pre>

<p>This is another sentence before the block quote.</p>

<blockquote>
<p>This is a quote</p>
</blockquote>

<p>This is one more sentence</p>
----------------------

What I would like to have left is this:

----------------------
<p>This is the first sentence. Now I'll post a code snippet:</p>

<p>This is another sentence before the block quote.</p>

<p>This is one more sentence</p>

Try this. It uses the "non-greedy" operator '?' and multiline
case-insensitive matching. Not using the 'non-greedy' operator would
gobble up everything between two tags, including nested tags of the
same name. This is probably not what you would want.


def remove_tag_block(tag, text)
text.gsub(/<#{tag}>.*?<\/#{tag}>/im, '')
end

irb(main):054:0> text
=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n<pre>\ndef strip_blocks(text)\n
text.gsub([regex],'')\nend\n</pre>\n\n<p>This is another sentence before
the block quote.</p>\n\n<blockquote>\n <p>This is a
quote</p>\n</blockquote>\n\n<p>This is one more sentence</p>"

irb(main):055:0> t=remove_tag_block("pre", text)

=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n\n\n<p>This is another sentence before the block
quote.</p>\n\n<blockquote>\n <p>This is a
quote</p>\n</blockquote>\n\n<p>This is one more sentence</p>"

irb(main):056:0> remove_tag_block("blockquote", t)

=> "<p>This is the first sentence. Now I'll post a code
snippet:</p>\n\n\n\n<p>This is another sentence before the block
quote.</p>\n\n\n\n<p>This is one more sentence</p>"

The problem is that this won't work with nested tags, e.g.

<table><tr><td><table>stuff</table></td></tr></table>

irb(main):065:0>
x="<table><tr><td><table>stuff</table></td></tr></table>"
=> "<table><tr><td><table>stuff</table></td></tr></table>"
irb(main):066:0> remove_tag_block("table", x)
=> "</td></tr></table>"

This is because *regular* regular expressions :) can't match nested
pairs, such as "((()(())()))". I think I read somewhere a phrase that
regexp's can't count. You have to use *recursive* regular expressions,
which are found in PCRE (Perl RE), but AFAIK not in the current Ruby
regexp engine. Maybe Oniguruma has it - I dunno. I saw a PCRE extension
for Ruby somewhere, but I don't know anything about it.

The Perl RE for matching nested parentheses is apparently as follows
(from
http://www.sitepoint.com/blogs/2006/09/26/the-joy-of-regular-expressions-1/)

\(((?>[^()]+)|(?R))*\)

I believe that to do this correctly without PCRE, you have to resort to
some text parsing or use a SAX parser or similar. Maybe some Ruby guru
(i.e. not me) will be able to pull out an RE or some easy way to do
this.
 
G

greg

You are missing the 'm' flag which will allow '.' to match new lines

pre_match = /<pre>.*?<\pre>/m
block_match = /<blockquote>.*?:<p>.*?<\/p>.*?<\/blockquote>/m
 
R

Rob Biedenharn

Thanks for the reply. I'm relatively new to regular expressions, and
misinterpretted the ^s and $s. I was thinking they were for that
specific check, so it was either the first string "|" (or) the second
string.

http://groups.google.com/group/rubyonrails-talk/browse_frm/thread/
6c75d5d4df368186/2743494eb303014c#2743494eb303014c

And might I suggest picking ONE mailing list on which to ask your
questions (Ruby is actually the better one for this question about
regular expressions), and then JUST ASK ONCE.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top