Regex question, limit repeats UNLESS within specified tags

Discussion in 'Perl Misc' started by Jason C, Nov 2, 2012.

  1. Jason C

    Jason C Guest

    I'm currently limiting repeated characters like so:

    $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

    I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

    I'm guessing that this would be done with negative lookahead, like this:

    # Note, these aren't tested, just here for the explanation
    $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
    $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

    Neither of these are going to be perfect, though, because:

    1. in the first one, I need to test for both an opening <img and an ending >; otherwise, I think it would not catch something like "<img src='aaa.jpg'> bbbbbbbbbb" (since the repeated "b" comes after "<img").

    2. in the second one, I also need to test for the ending >, but also for the closing </a>. Even if I fixed the ending >, I could still end up with a confusing "<a href='http://www.aaaaaaaaaa.com'>http://www.aaaaaa.com</a>"


    Any suggestions on how to do either of these better? TIA,

    Jason
     
    Jason C, Nov 2, 2012
    #1
    1. Advertising

  2. Jason C

    Justin C Guest

    On 2012-11-02, Jason C <> wrote:
    > I'm currently limiting repeated characters like so:
    >
    > $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
    >
    > I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.
    >
    > I'm guessing that this would be done with negative lookahead, like this:
    >
    > # Note, these aren't tested, just here for the explanation
    > $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;
    > $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;



    Found in /usr/share/perl/5.10/pod/perlfaq6.pod
    How do I match XML, HTML, or other nasty, ugly things with a regex?
    (contributed by brian d foy)

    If you just want to get work done, use a module and forget about the
    regular expressions. The "XML::parser" and "HTML::parser" modules are
    good starts, although each namespace has other parsing modules
    specialized for certain tasks and different ways of doing it. Start at
    CPAN Search ( http://search.cpan.org ) and wonder at all the work
    people have done for you already! :)

    Use the modules and use your regex on what's left, don't don't try to
    write REs for HTML, life is too short.


    Justin.

    --
    Justin C, by the sea.
     
    Justin C, Nov 2, 2012
    #2
    1. Advertising

  3. Jason C

    Jason C Guest

    On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
    > On 2012-11-02, Jason C <> wrote:
    >
    > > I'm currently limiting repeated characters like so:

    >
    > >

    >
    > > $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

    >
    > >

    >
    > > I'm wanting to modify it to only limit repeated characters if they're not within <img...> or <a href=...></a> tags.

    >
    > >

    >
    > > I'm guessing that this would be done with negative lookahead, like this:

    >
    > >

    >
    > > # Note, these aren't tested, just here for the explanation

    >
    > > $text =~ s#(?<!<img)(.)\1{6,}#$1$1$1$1$1$1#gsi;

    >
    > > $text =~ s#(?<!<a href)(.)\1{6,}#$1$1$1$1$1$1#gsi;

    >
    >
    >
    >
    >
    > Found in /usr/share/perl/5.10/pod/perlfaq6.pod
    >
    > How do I match XML, HTML, or other nasty, ugly things with a regex?
    >
    > (contributed by brian d foy)
    >
    >
    >
    > If you just want to get work done, use a module and forget about the
    >
    > regular expressions. The "XML::parser" and "HTML::parser" modules are
    >
    > good starts, although each namespace has other parsing modules
    >
    > specialized for certain tasks and different ways of doing it. Start at
    >
    > CPAN Search ( http://search.cpan.org ) and wonder at all the work
    >
    > people have done for you already! :)
    >
    >
    >
    > Use the modules and use your regex on what's left, don't don't try to
    >
    > write REs for HTML, life is too short.
    >
    >
    >
    >
    >
    > Justin.
    >
    >
    >
    > --
    >
    > Justin C, by the sea.


    I've used HTML::parser at length, but I don't think that it offers anything like what I'm needing. I looked through CPAN, and didn't find anything like this.

    I might have made the OP seem too complicated. What I really need to figure out is how to run a regex where both the look-behind AND look-ahead match.

    Something like this, I guess:

    # Not tested
    while (($text !~ /<img[^>]*?>/gi) &&
    ($text !~ /<a href[^>]*?>/gi)) {
    $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
    }

    Or maybe two separate loops, like this:

    while ($text !~ /<img[^>]*?>/gi) {
    $text =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;
    }

    while ($text !~ /<a href([^>]*?)>(.*?)<\/a>/gi) {
    $pattern = $repl = $1;

    $pattern = quotemeta($pattern);
    $repl =~ s#(.)\1{6,}#$1$1$1$1$1$1#gsi;

    $text =~ s#$pattern#$repl#gsi;
    }

    Thoughts?
     
    Jason C, Nov 2, 2012
    #3
  4. On 2012-11-02 21:11, Eli the Bearded <*@eli.users.panix.com> wrote:
    > In comp.lang.perl.misc, Jason C <> wrote:
    >> On Friday, November 2, 2012 6:08:03 AM UTC-4, Justin C wrote:
    >>> Found in /usr/share/perl/5.10/pod/perlfaq6.pod
    >>> How do I match XML, HTML, or other nasty, ugly things with a regex?
    >>> (contributed by brian d foy)
    >>> If you just want to get work done, use a module and forget about the
    >>> regular expressions. The "XML::parser" and "HTML::parser" modules

    >> I've used HTML::parser at length, but I don't think that it offers anything
    >> like what I'm needing. I looked through CPAN, and didn't find anything like
    >> this.

    >
    > Your use case is exotic. You will not find exactly what you need off the
    > shelf. You will find ways to break a document up into <IMG>, <A>, and
    > neither of thsoe when you use a parsing module. Thus broken up, you can
    > then do your substring regexp.


    Agreed.

    >
    >> I might have made the OP seem too complicated. What I really need to figure
    >> out is how to run a regex where both the look-behind AND look-ahead match.

    >
    > No, I don't think you made it seem "too complicated", it *is* too
    > complicated.


    I don't know whether it is complicated but I do know that I don't
    understand it. My best guess is that he wants to limit duplicate
    characters in the text of document, but wants to avoid mangling URLs.

    So if someone writes:

    <p>John is stupid!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!</p>

    he wants to change this to

    <p>John is stupid!!!!!!</p>

    But something like

    <img src="/images/img0000000123.jpg" title="Little Johnny and his dog">

    should not be changed to

    <img src="/images/img000000123.jpg" title="Little Johnny and his dog">

    because that would invalidate the link.

    But this is just a guess.

    Assuming I am right, I would use HTML::parser to parse the file and then
    do those substitutions only in text nodes. This is probably most easily
    done with a handler.

    hp



    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
     
    Peter J. Holzer, Nov 3, 2012
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean H. Saxe
    Replies:
    0
    Views:
    1,085
    Dean H. Saxe
    Jan 3, 2004
  2. James Dyer
    Replies:
    5
    Views:
    693
  3. Alan Silver
    Replies:
    1
    Views:
    347
    Alan Silver
    Feb 23, 2006
  4. Gábor SEBESTYÉN

    Unless unless

    Gábor SEBESTYÉN, Jun 17, 2005, in forum: Ruby
    Replies:
    3
    Views:
    178
    Gábor SEBESTYÉN
    Jun 17, 2005
  5. cate
    Replies:
    1
    Views:
    227
    Evertjan.
    Jun 14, 2010
Loading...

Share This Page