What's wrong with the following regular expression?

Discussion in 'Perl Misc' started by kun niu, Apr 18, 2009.

  1. kun niu

    kun niu Guest

    Dear all,

    I'm trying to help to extrace email from company's website.
    Here's part of my test script.

    $content = "<a href=\"mailto:\"><a class=\"hello\" href=
    \"mailto:?title=hello\">";
    @emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);
    foreach my $email (@emails)
    {
    print "email:" . $email . "\n";
    }
    But to my surprise, no result is printed.
    I'm working on Debian squeeze.
    My perl version is 5.10.0.
    Would anyone here please help me out?
    Thanks for any hints or advice in advance.
    kun niu, Apr 18, 2009
    #1
    1. Advertising

  2. kun niu <> wrote:


    > $content = "<a href=\"mailto:\"><a class=\"hello\" href=

    ^^^^^^^
    > \"mailto:?title=hello\">";



    You should always enable warnings when developing Perl code.


    > @emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);

    ^^
    ^^ these are transposed...

    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Apr 18, 2009
    #2
    1. Advertising

  3. kun niu

    kun niu Guest

    On 4ÔÂ19ÈÕ, ÉÏÎç4ʱ09·Ö, Tad J McClellan <> wrote:
    > kun niu <> wrote:
    > > $content = "<a href=\"mailto:\"><a class=\"hello\"href=

    >
    > ^^^^^^^
    >
    > > \"mailto:?title=hello\">";

    >
    > You should always enable warnings when developing Perl code.
    >
    > > @emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);

    >
    > ^^
    > ^^ these are transposed...
    >
    > --
    > Tad McClellan
    > email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"


    Thank you for your reply.
    But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/
    cgim); is transposed.
    I turned on warnings with "perl -w" and I don't see a warning here.
    kun niu, Apr 19, 2009
    #3
  4. kun niu <> wrote:
    > On 4??19??, ????4??09??, Tad J McClellan <> wrote:
    >> kun niu <> wrote:
    >> > $content = "<a href=\"mailto:\"><a class=\"hello\" href=

    ^^
    ^^

    The data has a quote followed by an angle bracket.


    >> > \"mailto:?title=hello\">";

    >>
    >> You should always enable warnings when developing Perl code.
    >>
    >> > @emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);

    >>
    >> ^^
    >> ^^ these are transposed...



    The word "transposed" means "order is reversed"...


    >> --
    >> Tad McClellan
    >> email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"



    It is bad manners to quote .sigs.


    > But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/

    ^^
    ^^

    The pattern has an angle bracket followed by a quote.


    > I turned on warnings with "perl -w" and I don't see a warning here.



    The warning (Possible unintended interpolation) was from the
    assignment line, not the pattern line.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Apr 19, 2009
    #4
  5. kun niu

    kun niu Guest

    On 4ÔÂ19ÈÕ, ÉÏÎç10ʱ07·Ö, Tad J McClellan <> wrote:
    > kun niu <> wrote:
    > > On 4??19??, ????4??09??, Tad J McClellan <> wrote:
    > >> kun niu <> wrote:
    > >> > $content = "<a href=\"mailto:\"><a class=\"hello\" href=

    >
    > ^^
    > ^^
    >
    > The data has a quote followed by an angle bracket.
    >
    > >> > \"mailto:?title=hello\">";

    >
    > >> You should always enable warnings when developing Perl code.

    >
    > >> > @emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);

    >
    > >> ^^
    > >> ^^ these are transposed....

    >
    > The word "transposed" means "order is reversed"...
    >
    > >> --
    > >> Tad McClellan
    > >> email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

    >
    > It is bad manners to quote .sigs.
    >
    > > But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/

    >
    > ^^
    > ^^
    >
    > The pattern has an angle bracket followed by a quote.
    >
    > > I turned on warnings with "perl -w" and I don't see a warning here.

    >
    > The warning (Possible unintended interpolation) was from the
    > assignment line, not the pattern line.
    >
    > --
    > Tad McClellan
    > email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"


    I got it.:)
    Sorry for my carelessness.
    And really appreciate your reply.
    kun niu, Apr 19, 2009
    #5
  6. kun niu <> wrote:
    > On 4??19??, ????10??07??, Tad J McClellan <> wrote:
    >> kun niu <> wrote:
    >> > On 4??19??, ????4??09??, Tad J McClellan <> wrote:



    >> >> --
    >> >> Tad McClellan
    >> >> email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

    >>
    >> It is bad manners to quote .sigs.



    >> --
    >> Tad McClellan
    >> email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"



    OK. That's enough.

    Off to the killfile with you.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Apr 19, 2009
    #6
  7. kun niu

    Guest

    On Sat, 18 Apr 2009 09:01:57 -0700 (PDT), kun niu <> wrote:

    >Dear all,
    >
    >I'm trying to help to extrace email from company's website.
    >Here's part of my test script.
    >
    >$content = "<a href=\"mailto:\"><a class=\"hello\" href=
    >\"mailto:?title=hello\">";
    >@emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);
    >foreach my $email (@emails)
    >{
    > print "email:" . $email . "\n";
    >}
    >But to my surprise, no result is printed.
    >I'm working on Debian squeeze.
    >My perl version is 5.10.0.
    >Would anyone here please help me out?
    >Thanks for any hints or advice in advance.


    Below are better html/xml regular expressions to parse <tag attrib/>,
    what your interrested in. And it will get all the 'mailto:'s.
    Might as well do it right.
    Test sample html and output below __DATA__ section

    -sln

    ## Arxp.pl
    ##
    ## Simple html/xml regexp parser for just <tag attrib/>
    ## No entity conversions, no extras.
    ## Let me know if you want conversions or simple extra's
    ## -sln 4/19/09
    ##

    my @UC_Nstart = (
    "\\x{C0}-\\x{D6}",
    "\\x{D8}-\\x{F6}",
    "\\x{F8}-\\x{2FF}",
    "\\x{370}-\\x{37D}",
    "\\x{37F}-\\x{1FFF}",
    "\\x{200C}-\\x{200D}",
    "\\x{2070}-\\x{218F}",
    "\\x{2C00}-\\x{2FEF}",
    "\\x{3001}-\\x{D7FF}",
    "\\x{F900}-\\x{FDCF}",
    "\\x{FDF0}-\\x{FFFD}",
    "\\x{10000}-\\x{EFFFF}",
    );
    my @UC_Nchar = (
    "\\x{B7}",
    "\\x{0300}-\\x{036F}",
    "\\x{203F}-\\x{2040}",
    );
    my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
    my $Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
    my $Name = "(?:$Nstrt$Nchar*)";

    my $qRx = qr/<(?:(?:($Name)(\s+(?:(?:(?:".*?")|(?:'.*?'))|(?:[^>]*?))+)\s*\/?)|--.*?--)>/s;
    # <( ( 1 12 ( ( ( )|( ))|( )) 2 )| )>

    my $qRxAttr = qr/\G\s+(?:(?:($Name)\s*=\s*("|'|))|($Name))/;
    my $qRxAttr_DL1 = qr/\G(?:([^'&<]*?)|([^'<]*?))'/;
    my $qRxAttr_DL2 = qr/\G(?:([^"&<]*?)|([^"<]*?))"/;
    my $qRxAttr_DL3 = qr/\G([^"'=<\s]+)/;


    my $html = join '', <DATA>;

    while ($html =~ /$qRx/g)
    {
    ## <tag attrib/> or <tag attrib>
    ##
    if (defined $1) # && lc($1) eq 'a'
    {
    my %result = ();
    # get attributes
    $result = _getAttrARRAY ($2, 0, \%result);

    ## do checks
    if (length ($result->{'errstr'})) {
    # missing or extra token, hard error
    print "Error in tag attrib string here ->'$result->{'errstr'}'\n";
    next;
    }
    ## we will consider these acceptable html, not processed for this
    # length ($result->{'dupattrs'})
    # length ($result->{'badattrs'})
    # length ($result->{'noquoteattrs'})

    ## process (scrape) attribute array for 'mailto:'
    my %htmp = @{$result->{'attrsref'}};

    while (my ($atr,$val) = each %htmp)
    {
    push @emails, $1 if ($val =~ /mailto:(.+)/is);
    }
    }
    }
    print $_,"\n" for @emails;

    # -------------------------------

    sub _convertEntities { undef } # intentionally blank

    sub _getAttrARRAY
    {
    my ($attrstr, $conv_ent, $hresult) = @_;
    @{$hresult->{'attrsref'}} = ();
    $hresult->{'badattrs'} = '';
    $hresult->{'dupattrs'} = '';
    $hresult->{'noquoteattrs'} = '';
    $hresult->{'errstr'} = '';
    my %hseen = ();
    my $aref = $hresult->{'attrsref'};
    my ($alt_attval, $attval, $rx, $ndx, $DL3);
    # my $tmpstr = $attrstr;
    my $match = 0;

    while ($attrstr =~ /$qRxAttr/gc)
    {
    $match = 1;
    if (defined $2)
    {
    $ndx = push @{$aref},$1;
    $DL3 = 0;

    if ($2 eq "'") {
    $rx = \$qRxAttr_DL1;
    }
    elsif ($2 eq '"') {
    $rx = \$qRxAttr_DL2;
    } else {
    # no quotes
    $rx = \$qRxAttr_DL3;
    $DL3 = 1;
    }
    if (++$hseen{$1} == 2) {
    $hresult->{'dupattrs'} .= ", $1";
    $hresult->{'dupattrs'} =~ s/^(?:, )+//;
    }
    if ($attrstr =~ /$$rx/gc) {
    if (!$DL3)
    {
    ## normal quoted value
    if (defined $1) {
    push @{$aref},$1;
    next;
    }
    $attval = $2;
    if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
    push @{$aref},$$alt_attval;
    next;
    }
    push @{$aref},$attval;
    next;
    }
    ## bad attrib, value is not quoted
    $attval = $1;
    if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
    push @{$aref},$$alt_attval;
    } else {
    push @{$aref},$attval;
    }
    $hresult->{'noquoteattrs'} .= ", ".$aref->[$ndx-1];
    $hresult->{'noquoteattrs'} =~ s/^(?:, )+//;
    next;
    }
    ## bad value, its either '<' or no ["'] closure
    $hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
    $hresult->{'badattrs'} =~ s/^(?:, )+//;
    push @{$aref},'UNDEF_ATTRVAL';
    # trim up to '<', otherwise its reported as
    # improperly quoted or missing value
    $attrstr = substr ($attrstr, pos($attrstr));
    $attrstr =~ s/^[^<]+//;
    } else {
    ## attrib with no attrib value
    ## (standalone atrribute only)
    $ndx = push @{$aref},$3;
    if (++$hseen{$3} == 2) {
    $hresult->{'dupattrs'} .= ", $3";
    $hresult->{'dupattrs'} =~ s/^(?:, )+//;
    }
    $hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
    $hresult->{'badattrs'} =~ s/^(?:, )+//;
    push @{$aref},'UNDEF_ATTRVAL';
    next;
    }
    # bad, return that part of string which is in error
    $hresult->{'errstr'} = $attrstr;
    return $hresult;
    }
    pos($attrstr) = 0 if (!$match);
    if (length($attrstr) > pos($attrstr)) {
    $attrstr = substr ($attrstr, pos($attrstr));
    $attrstr =~ s/^\s+//; $attrstr =~ s/\s+$//;
    # bad, return that part of string which is in error
    # print "-BAD-:$tmpstr\n";
    $hresult->{'errstr'} = $attrstr if (length($attrstr));
    }
    return $hresult;
    }

    __DATA__

    <-- Don't include <a href='mailto:'> me -->
    <a href='mailto:'>
    <a class="hello" href=
    "mailto:?title='&lt;hello&gt';">
    <tag scrape_me = "mailto:"
    />
    <oh nogood = "mailto:' >


    Output:

    ?title='&lt;hello&gt';
    , Apr 19, 2009
    #7
  8. kun niu

    Guest

    On Sun, 19 Apr 2009 12:24:56 -0700, wrote:

    >On Sat, 18 Apr 2009 09:01:57 -0700 (PDT), kun niu <> wrote:
    >
    >Below are better html/xml regular expressions to parse <tag attrib/>,
    >what your interrested in. And it will get all the 'mailto:'s.
    >Might as well do it right.
    >Test sample html and output below __DATA__ section
    >
    >


    Sorry bout that. Left out a couple of things in the chop.
    -sln

    ------------------------------------------------

    >## Arxp.pl
    >##
    >## Simple html/xml regexp parser for just <tag attrib/>

    [snip]

    use strict;
    use warnings;

    [snip]

    >my @UC_Nstart = (


    [snip]

    >while ($html =~ /$qRx/g)
    >{
    > ## <tag attrib/> or <tag attrib>
    > ##
    > if (defined $1) # && lc($1) eq 'a'
    > {
    > my %result = ();
    > # get attributes

    _getAttrARRAY ($2, 0, \%result);
    >
    > ## do checks

    if (length ($result{'errstr'})) {
    > # missing or extra token, hard error

    print "Error in tag attrib string here ->'$result{'errstr'}'\n";
    > next;
    > }
    > ## we will consider these acceptable html, not processed for this

    # length ($result{'dupattrs'})
    # length ($result{'badattrs'})
    # length ($result{'noquoteattrs'})
    >
    > ## process (scrape) attribute array for 'mailto:'

    my %htmp = @{$result{'attrsref'}};
    >
    > while (my ($atr,$val) = each %htmp)
    > {
    > push @emails, $1 if ($val =~ /mailto:(.+)/is);
    > }
    > }
    >}

    [snip]
    , Apr 19, 2009
    #8
  9. <> wrote:

    ><-- Don't include <a href='mailto:'> me -->



    Why not?

    I thought the point was to find <a> elements, not ignore them.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Apr 19, 2009
    #9
  10. kun niu

    Guest

    On Sun, 19 Apr 2009 17:26:45 -0500, Tad J McClellan <> wrote:

    > <> wrote:
    >
    >><-- Don't include <a href='mailto:'> me -->

    >
    >
    >Why not?
    >
    >I thought the point was to find <a> elements, not ignore them.

    ^^^
    Your close, its like password. You have a partial form of the answer.
    A: <tag attrib> or <tag attrib/>

    Why its Deja vu' all over again.

    <a ...> is not an element inside a comment, not inside CDATA either.
    Comments inside CDATA, CDATA inside comments aren't markup either.
    Feel free to recursively parse data. Be it attribute values, comment, CDATA or
    special data. Fortunately, you don't have to reparse content. Do you want me to
    tack on CDATA and special's and whip up a little recursion? How bout a little
    entity replacement?

    I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
    But there is always /mailto:(.+)/sgi

    Think your smart don't ya?
    A: (recurse reply)

    -sln
    , Apr 20, 2009
    #10
  11. kun niu

    Guest

    On Sun, 19 Apr 2009 16:37:29 -0700, wrote:

    >On Sun, 19 Apr 2009 17:26:45 -0500, Tad J McClellan <> wrote:
    >
    >> <> wrote:
    >>
    >>><-- Don't include <a href='mailto:'> me -->

    >>
    >>
    >>Why not?
    >>
    >>I thought the point was to find <a> elements, not ignore them.

    > ^^^
    >Your close, its like password. You have a partial form of the answer.
    >A: <tag attrib> or <tag attrib/>
    >
    >Why its Deja vu' all over again.
    >
    ><a ...> is not an element inside a comment, not inside CDATA either.
    >Comments inside CDATA, CDATA inside comments aren't markup either.
    >Feel free to recursively parse data. Be it attribute values, comment, CDATA or
    >special data. Fortunately, you don't have to reparse content.

    ^^^
    I'm going to take that back, you should reparse content as well.
    So I could tack on content.

    > Do you want me to
    >tack on CDATA and special's and whip up a little recursion? How bout a little
    >entity replacement?
    >
    >I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
    >But there is always /mailto:(.+)/sgi
    >
    >Think your smart don't ya?
    >A: (recurse reply)
    >


    Let me know.

    -sln
    Hope this helps.
    , Apr 20, 2009
    #11
  12. On 2009-04-19 22:26, Tad J McClellan <> wrote:
    > <> wrote:
    >><-- Don't include <a href='mailto:'> me -->

    >
    >
    > Why not?
    >
    > I thought the point was to find <a> elements, not ignore them.


    This isn't an a element, it is a comment.

    hp
    Peter J. Holzer, Apr 21, 2009
    #12
  13. Peter J. Holzer <> wrote:
    > On 2009-04-19 22:26, Tad J McClellan <> wrote:
    >> <> wrote:
    >>><-- Don't include <a href='mailto:'> me -->

    ^^^
    ^^^ should be <!--
    >>
    >>
    >> Why not?
    >>
    >> I thought the point was to find <a> elements, not ignore them.

    >
    > This isn't an a element, it is a comment.



    It is not a comment. The bang is missing.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Apr 21, 2009
    #13
  14. On 2009-04-21 15:50, Tad J McClellan <> wrote:
    > Peter J. Holzer <> wrote:
    >> On 2009-04-19 22:26, Tad J McClellan <> wrote:
    >>> <> wrote:
    >>>><-- Don't include <a href='mailto:'> me -->

    > ^^^
    > ^^^ should be <!--
    >>>
    >>>
    >>> Why not?
    >>>
    >>> I thought the point was to find <a> elements, not ignore them.

    >>
    >> This isn't an a element, it is a comment.

    >
    >
    > It is not a comment. The bang is missing.


    My fault, sorry.

    hp
    Peter J. Holzer, Apr 21, 2009
    #14
  15. kun niu

    Guest

    On Tue, 21 Apr 2009 10:50:15 -0500, Tad J McClellan <> wrote:

    >Peter J. Holzer <> wrote:
    >> On 2009-04-19 22:26, Tad J McClellan <> wrote:
    >>> <> wrote:
    >>>><-- Don't include <a href='mailto:'> me -->

    > ^^^
    > ^^^ should be <!--
    >>>
    >>>
    >>> Why not?
    >>>
    >>> I thought the point was to find <a> elements, not ignore them.

    >>
    >> This isn't an a element, it is a comment.

    >
    >
    >It is not a comment. The bang is missing.


    My apologies.

    -sln
    , Apr 21, 2009
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,272
  2. Edward
    Replies:
    0
    Views:
    822
    Edward
    Dec 4, 2003
  3. COHENMARVIN

    Whats wrong with this regular expression?

    COHENMARVIN, Aug 19, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    409
    Peter Blum
    Aug 19, 2005
  4. equinox
    Replies:
    3
    Views:
    79
    equinox
    Dec 18, 2008
  5. grocery_stocker
    Replies:
    13
    Views:
    280
    Michael Paoli
    Oct 6, 2008
Loading...

Share This Page