RegEx issue

Discussion in 'Perl Misc' started by Dan, Jul 29, 2004.

  1. Dan

    Dan Guest

    OK, I have a perl script that reads in html files and makes some link
    replacements. Everything works OK except it changes something it
    shouldn't. Here is my line of code:

    @getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

    This code replaces a file of the form <a href="whatever.xxx"> to <a
    href="_miscfiles/whatever.xxx">.

    Now that works fine, but it seems to change things it shouldn't be,
    namely instances of <a href="mailto:"> to <a
    href="_miscfiles/">.

    Interestingly, if I have two or more mailto references on a page, it
    will nicely not touch the first, but will change the second. More
    interestingly, if I take out the global parameter 'g' from the end of
    the regex, things for fine for the emails (it doesn't touch them), but
    then the actual whatever.xxx replacements don't get done.

    So I don't understand why it would (a) leave one alone but not the
    other since the 'g' should make it do the same for all instances, or
    (b) touch the email references at all. The [^?@] atom should make sure
    it skips over any email address that happen to be of the form
    .

    Any help is greatly appreciated!! I've been trying to get this solved
    for days!

    Thanks,

    Dan
    Dan, Jul 29, 2004
    #1
    1. Advertising

  2. Dan

    Paul Lalli Guest

    On Thu, 29 Jul 2004, Dan wrote:

    > OK, I have a perl script that reads in html files and makes some link
    > replacements. Everything works OK except it changes something it
    > shouldn't. Here is my line of code:
    >
    > @getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;
    >
    > This code replaces a file of the form <a href="whatever.xxx"> to <a
    > href="_miscfiles/whatever.xxx">.
    >
    > Now that works fine, but it seems to change things it shouldn't be,
    > namely instances of <a href="mailto:"> to <a
    > href="_miscfiles/">.
    >
    > Interestingly, if I have two or more mailto references on a page, it
    > will nicely not touch the first, but will change the second. More
    > interestingly, if I take out the global parameter 'g' from the end of
    > the regex, things for fine for the emails (it doesn't touch them), but
    > then the actual whatever.xxx replacements don't get done.
    >
    > So I don't understand why it would (a) leave one alone but not the
    > other since the 'g' should make it do the same for all instances, or


    My guess - one is contained on a single line, another spans multiple
    lines, and your methodology is reading the HTML file line by line.

    > (b) touch the email references at all. The [^?@] atom should make sure
    > it skips over any email address that happen to be of the form
    > .


    That's not what's in the regexp above. What's in the regexp above is
    [^@?] which is looking for any pattern that doesn't match the @?
    variable. @ needs to be escaped in regexps, because they undergo
    double-quotish interpolation.

    > Any help is greatly appreciated!! I've been trying to get this solved
    > for days!


    The canonical answer to this question is: Don't parse HTML with RegExps!
    Use one of the plethora of modules available on CPAN.

    Paul Lalli
    Paul Lalli, Jul 29, 2004
    #2
    1. Advertising

  3. (Dan) writes:

    > OK, I have a perl script that reads in html files and makes some link
    > replacements. Everything works OK except it changes something it
    > shouldn't. Here is my line of code:
    >
    > @getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;
    >
    > This code replaces a file of the form <a href="whatever.xxx"> to <a
    > href="_miscfiles/whatever.xxx">.
    >
    > Now that works fine, but it seems to change things it shouldn't be,
    > namely instances of <a href="mailto:"> to <a
    > href="_miscfiles/">.


    Define "it shouldn't be". That target matches your regex.

    > Interestingly, if I have two or more mailto references on a page, it
    > will nicely not touch the first, but will change the second.


    Actually that's probably not what's happening. Note that the regex
    [^(/)]+ can match quote characters and angle brakets so can run right
    out of one tag and into another.

    > So I don't understand why it would (a) leave one alone but not the
    > other since the 'g' should make it do the same for all instances, or
    > (b) touch the email references at all. The [^?@] atom should make sure
    > it skips over any email address that happen to be of the form
    > .


    It does not prevent the @ being matched by the [^(/)]

    > Any help is greatly appreciated!! I've been trying to get this solved
    > for days!


    There is a reason we keep telling everyone who comes here trying to
    parse HTML using simple regex[1] not to do that[2].

    Can you guess what that reason is?

    [1] Typically at least a couple a week.

    [2] And use an HTML parsing module instead.

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
    Brian McCauley, Jul 29, 2004
    #3
  4. Paul Lalli wrote:
    > On Thu, 29 Jul 2004, Dan wrote:
    >> (b) touch the email references at all. The [^?@] atom should make
    >> sure it skips over any email address that happen to be of the
    >> form .

    >
    > That's not what's in the regexp above. What's in the regexp above
    > is [^@?]


    That's the same character class.

    > which is looking for any pattern that doesn't match the @?
    > variable. @ needs to be escaped in regexps, because they undergo
    > double-quotish interpolation.


    That's not true when defining a character class, is it?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jul 29, 2004
    #4
  5. Dan

    Paul Lalli Guest

    On Fri, 30 Jul 2004, Gunnar Hjalmarsson wrote:

    > Paul Lalli wrote:
    > > On Thu, 29 Jul 2004, Dan wrote:
    > >> (b) touch the email references at all. The [^?@] atom should make
    > >> sure it skips over any email address that happen to be of the
    > >> form .

    > >
    > > That's not what's in the regexp above. What's in the regexp above
    > > is [^@?]

    >
    > That's the same character class.


    It would seem not.

    > > which is looking for any pattern that doesn't match the @?
    > > variable. @ needs to be escaped in regexps, because they undergo
    > > double-quotish interpolation.

    >
    > That's not true when defining a character class, is it?


    It would seem it is.

    #!/usr/bin/perl
    @f = qw/a-z/;
    print "letters\n" if 'abc' =~ /[@f]/;
    print "numbers\n" if '123' =~ /[@f]/;

    __END__
    letters



    Paul Lalli
    Paul Lalli, Jul 30, 2004
    #5
  6. Paul Lalli wrote:
    > Gunnar Hjalmarsson wrote:
    >> Paul Lalli wrote:
    >>> On Thu, 29 Jul 2004, Dan wrote:
    >>>> (b) touch the email references at all. The [^?@] atom should
    >>>> make sure it skips over any email address that happen to be
    >>>> of the form .
    >>>
    >>> That's not what's in the regexp above. What's in the regexp
    >>> above is [^@?]

    >>
    >> That's the same character class.

    >
    > It would seem not.
    >
    >>> which is looking for any pattern that doesn't match the @?
    >>> variable. @ needs to be escaped in regexps, because they
    >>> undergo double-quotish interpolation.

    >>
    >> That's not true when defining a character class, is it?

    >
    > It would seem it is.
    >
    > #!/usr/bin/perl
    > @f = qw/a-z/;
    > print "letters\n" if 'abc' =~ /[@f]/;
    > print "numbers\n" if '123' =~ /[@f]/;
    >
    > __END__
    > letters


    Hmm... It would seem I stand corrected. :)

    Nevertheless, before posting I did something like this:

    print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
    print "Match\n" if 'abcdef' =~ /^[^@?]+$/;

    Outputs:
    No match
    Match

    So the case seems not to be *that* obvious...

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Jul 30, 2004
    #6
  7. In article <>,
    Gunnar Hjalmarsson <> wrote:
    >Paul Lalli wrote:
    >> Gunnar Hjalmarsson wrote:
    >>> Paul Lalli wrote:
    >>>> On Thu, 29 Jul 2004, Dan wrote:
    >>>>> (b) touch the email references at all. The [^?@] atom should
    >>>>> make sure it skips over any email address that happen to be
    >>>>> of the form .
    >>>>
    >>>> That's not what's in the regexp above. What's in the regexp
    >>>> above is [^@?]
    >>>
    >>> That's the same character class.

    >>
    >> It would seem not.
    >>
    >>>> which is looking for any pattern that doesn't match the @?
    >>>> variable. @ needs to be escaped in regexps, because they
    >>>> undergo double-quotish interpolation.
    >>>
    >>> That's not true when defining a character class, is it?

    >>
    >> It would seem it is.
    >>
    >> #!/usr/bin/perl
    >> @f = qw/a-z/;
    >> print "letters\n" if 'abc' =~ /[@f]/;
    >> print "numbers\n" if '123' =~ /[@f]/;
    >>
    >> __END__
    >> letters

    >
    >Hmm... It would seem I stand corrected. :)
    >
    >Nevertheless, before posting I did something like this:
    >
    > print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
    > print "Match\n" if 'abcdef' =~ /^[^@?]+$/;
    >
    >Outputs:
    >No match
    >Match
    >
    >So the case seems not to be *that* obvious...
    >


    Looks like you're right...

    perl -MO=Deparse -wle '/[@?]/'
    /[\@?]/;

    perl -MO=Deparse -wle '/[ab@]/'
    /[ab\@]/;

    perl -MO=Deparse -wle '/[@m]/'
    Possible unintended interpolation of @m in string at -e line 1.
    Name "main::m" used only once: possible typo at -e line 1.
    /[@m]/;


    --
    Charles DeRykus
    Charles DeRykus, Jul 30, 2004
    #7
  8. Dan

    Dan Guest

    Thanks for the help - I still wasn't able to get that code working
    however.
    I am interested in using one of the HTML parsers on CPAN, but they all
    seem somewhat confusing to me. I can't seem to figure out how they
    operate and how I might use them to extract and manipulate links from
    some HTML stored in a string. If anyone knows any tutorials or
    dumbed-down examples around the web, I'd very much appreciate a link!

    Dan
    Dan, Aug 4, 2004
    #8
  9. Gunnar Hjalmarsson <> writes:

    > Charles DeRykus wrote:
    > > Gunnar Hjalmarsson wrote:
    > >> print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
    > >> print "Match\n" if 'abcdef' =~ /^[^@?]+$/;
    > >> Outputs:
    > >> No match
    > >> Match

    > > Looks like you're right...

    >
    > I seem to be right about /[^@?]/, but I apparently jumped at conclusions.
    >
    > > perl -MO=Deparse -wle '/[@?]/'
    > > /[\@?]/;
    > > perl -MO=Deparse -wle '/[ab@]/'
    > > /[ab\@]/;
    > > perl -MO=Deparse -wle '/[@m]/'
    > > Possible unintended interpolation of @m in string at -e line 1.
    > > Name "main::m" used only once: possible typo at -e line 1.
    > > /[@m]/;

    >
    > Those warnings are displayed if strictures are not enabled and you
    > haven't declared the @m variable.
    >
    > So, I'm a little confused. The lesson here is that @ gets interpolated
    > in regexes sometimes. Maybe a good enough reason to always escape that
    > character, but a less ambigous conclusion would be nice. :)


    I always escape @ that I don't want to interpolate in an interpolative
    context.

    I cannot find a full explaination of exactly when an unescaped @ in an
    interpolative context will be treated as literal even in the "Gory
    details of parsing quoted constructs".

    Interpolating arrays into regex doesn't make a lot of sense. The only
    time I see it used is when using the @{[...]} construct.

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
    Brian McCauley, Aug 4, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    683
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,599
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    585
  4. Xah Lee
    Replies:
    1
    Views:
    924
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    716
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page