Regex question, limit repeats UNLESS within specified tags

Jason C · Nov 2, 2012

I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Neither of these are going to be perfect, though, because:

1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"

Any suggestions on how to do either of these better? TIA,

Jason

Justin C · Nov 2, 2012

I'm currently limiting repeated characters like so:

$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

I'm guessing that this would be done with negative lookahead, like this:

# Note, these aren't tested, just here for the explanation
$text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
$text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

Found in /usr/share/perl/5.10/pod/perlfaq6.pod
How do I match XML, HTML, or other nasty, ugly things with a regex?
(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML:

arser" and "HTML:

arser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( http://search.cpan.org ) and wonder at all the work
people have done for you already!

Use the modules and use your regex on what's left, don't don't try to
write REs for HTML, life is too short.

Justin.

Jason C · Nov 2, 2012

Found in /usr/share/perl/5.10/pod/perlfaq6.pod

How do I match XML, HTML, or other nasty, ugly things with a regex?

(contributed by brian d foy)

If you just want to get work done, use a module and forget about the

regular expressions. The "XML:arser" and "HTML:arser" modules are

good starts, although each namespace has other parsing modules

specialized for certain tasks and different ways of doing it. Start at

CPAN Search ( http://search.cpan.org ) and wonder at all the work

people have done for you already!

Use the modules and use your regex on what's left, don't don't try to

write REs for HTML, life is too short.

Justin.

I've used HTML:

arser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

Something like this, I guess:

# Not tested
while (($text !~ /<img[^>]*?>/gi) &&
($text !~ /<a href[^>]*?>/gi)) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

Or maybe two separate loops, like this:

while ($text !~ /<img[^>]*?>/gi) {
$text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
}

while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
$pattern = $repl = $1;

$pattern = quotemeta($pattern);
$repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

$text =~ s#$pattern#$repl#gsi;
}

Thoughts?

Peter J. Holzer · Nov 3, 2012

Your use case is exotic. You will not find exactly what you need off the
shelf. You will find ways to break a document up into <IMG>, <A>, and
neither of thsoe when you use a parsing module. Thus broken up, you can
then do your substring regexp.
Agreed.

No, I don't think you made it seem "too complicated", it *is* too
complicated.

I don't know whether it is complicated but I do know that I don't
understand it. My best guess is that he wants to limit duplicate
characters in the text of document, but wants to avoid mangling URLs.

So if someone writes:

<p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

he wants to change this to

<p>John is stupid!!!!!!</p>

But something like

<img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

should not be changed to

<img src="/images/img000000123.jpg" title="Little Johnny and his dog">

because that would invalidate the link.

But this is just a guess.

Assuming I am right, I would use HTML:

arser to parse the file and then
do those substitutions only in text nodes. This is probably most easily
done with a handler.

hp

Regex: match double OR single quote	4	Jul 12, 2012
Can't find a syntax error, hoping a second set of eyes will help	14	Sep 24, 2012
Regex, replacing THIS\|THAT	2	Dec 17, 2011
Sort by number of characters	1	Nov 2, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Help with code	0	Jun 12, 2022
Regex question; match <br> after opening tag	23	Feb 16, 2011
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021

Regex question, limit repeats UNLESS within specified tags

Jason C

Justin C

Jason C

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads