Regex question, limit repeats UNLESS within specified tags

J

Jason C

I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Neither of these are going to be perfect, though, because:

1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"


Any suggestions on how to do either of these better? TIA,

Jason
 
J

Justin C

I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;


Found in /usr/share/perl/5.10/pod/perlfaq6.pod
How do I match XML, HTML, or other nasty, ugly things with a regex?
(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML::parser" and "HTML::parser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( http://search.cpan.org ) and wonder at all the work
people have done for you already! :)

Use the modules and use your regex on what's left, don't don't try to
write REs for HTML, life is too short.


Justin.
 
J

Jason C

Found in /usr/share/perl/5.10/pod/perlfaq6.pod

How do I match XML, HTML, or other nasty, ugly things with a regex?

(contributed by brian d foy)



If you just want to get work done, use a module and forget about the

regular expressions. The "XML::parser" and "HTML::parser" modules are

good starts, although each namespace has other parsing modules

specialized for certain tasks and different ways of doing it. Start at

CPAN Search ( http://search.cpan.org ) and wonder at all the work

people have done for you already! :)



Use the modules and use your regex on what's left, don't don't try to

write REs for HTML, life is too short.





Justin.

I've used HTML::parser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

Something like this, I guess:

# Not tested
while (($text !~ /<img[^>]*?>/gi) &&
($text !~ /<a href[^>]*?>/gi)) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

Or maybe two separate loops, like this:

while ($text !~ /<img[^>]*?>/gi) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
$pattern = $repl = $1;

$pattern = quotemeta($pattern);
$repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

$text =~ s#$pattern#$repl#gsi;
}

Thoughts?
 
P

Peter J. Holzer

Your use case is exotic. You will not find exactly what you need off the
shelf. You will find ways to break a document up into <IMG>, <A>, and
neither of thsoe when you use a parsing module. Thus broken up, you can
then do your substring regexp.
Agreed.


No, I don't think you made it seem "too complicated", it *is* too
complicated.

I don't know whether it is complicated but I do know that I don't
understand it. My best guess is that he wants to limit duplicate
characters in the text of document, but wants to avoid mangling URLs.

So if someone writes:

<p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

he wants to change this to

<p>John is stupid!!!!!!</p>

But something like

<img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

should not be changed to

<img src="/images/img000000123.jpg" title="Little Johnny and his dog">

because that would invalidate the link.

But this is just a guess.

Assuming I am right, I would use HTML::parser to parse the file and then
do those substitutions only in text nodes. This is probably most easily
done with a handler.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top