What's wrong with the following regular expression?

K

kun niu

Dear all,

I'm trying to help to extrace email from company's website.
Here's part of my test script.

$content = "<a href=\"mailto:[email protected]\"><a class=\"hello\" href=
\"mailto:[email protected]?title=hello\">";
@emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);
foreach my $email (@emails)
{
print "email:" . $email . "\n";
}
But to my surprise, no result is printed.
I'm working on Debian squeeze.
My perl version is 5.10.0.
Would anyone here please help me out?
Thanks for any hints or advice in advance.
 
K

kun niu

You should always enable warnings when developing Perl code.


^^
^^ these are transposed...

Thank you for your reply.
But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/
cgim); is transposed.
I turned on warnings with "perl -w" and I don't see a warning here.
 
T

Tad J McClellan

^^
^^

The data has a quote followed by an angle bracket.



The word "transposed" means "order is reversed"...



It is bad manners to quote .sigs.

But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/
^^
^^

The pattern has an angle bracket followed by a quote.

I turned on warnings with "perl -w" and I don't see a warning here.


The warning (Possible unintended interpolation) was from the
assignment line, not the pattern line.
 
K

kun niu

^^
^^

The data has a quote followed by an angle bracket.

The word "transposed" means "order is reversed"...

It is bad manners to quote .sigs.
But I wonder how @emails = ($content =~ /<a.*href="mailto:(.*)>"/

^^
^^

The pattern has an angle bracket followed by a quote.
I turned on warnings with "perl -w" and I don't see a warning here.

The warning (Possible unintended interpolation) was from the
assignment line, not the pattern line.

I got it.:)
Sorry for my carelessness.
And really appreciate your reply.
 
S

sln

Dear all,

I'm trying to help to extrace email from company's website.
Here's part of my test script.

$content = "<a href=\"mailto:[email protected]\"><a class=\"hello\" href=
\"mailto:[email protected]?title=hello\">";
@emails = ($content =~ /<a.*href="mailto:(.*)>"/cgim);
foreach my $email (@emails)
{
print "email:" . $email . "\n";
}
But to my surprise, no result is printed.
I'm working on Debian squeeze.
My perl version is 5.10.0.
Would anyone here please help me out?
Thanks for any hints or advice in advance.

Below are better html/xml regular expressions to parse <tag attrib/>,
what your interrested in. And it will get all the 'mailto:'s.
Might as well do it right.
Test sample html and output below __DATA__ section

-sln

## Arxp.pl
##
## Simple html/xml regexp parser for just <tag attrib/>
## No entity conversions, no extras.
## Let me know if you want conversions or simple extra's
## -sln 4/19/09
##

my @UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
my @UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
my $Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
my $Name = "(?:$Nstrt$Nchar*)";

my $qRx = qr/<(?:(?:($Name)(\s+(?:(?:(?:".*?")|(?:'.*?'))|(?:[^>]*?))+)\s*\/?)|--.*?--)>/s;
# <( ( 1 12 ( ( ( )|( ))|( )) 2 )| )>

my $qRxAttr = qr/\G\s+(?:(?:($Name)\s*=\s*("|'|))|($Name))/;
my $qRxAttr_DL1 = qr/\G(?:([^'&<]*?)|([^'<]*?))'/;
my $qRxAttr_DL2 = qr/\G(?:([^"&<]*?)|([^"<]*?))"/;
my $qRxAttr_DL3 = qr/\G([^"'=<\s]+)/;


my $html = join '', <DATA>;

while ($html =~ /$qRx/g)
{
## <tag attrib/> or <tag attrib>
##
if (defined $1) # && lc($1) eq 'a'
{
my %result = ();
# get attributes
$result = _getAttrARRAY ($2, 0, \%result);

## do checks
if (length ($result->{'errstr'})) {
# missing or extra token, hard error
print "Error in tag attrib string here ->'$result->{'errstr'}'\n";
next;
}
## we will consider these acceptable html, not processed for this
# length ($result->{'dupattrs'})
# length ($result->{'badattrs'})
# length ($result->{'noquoteattrs'})

## process (scrape) attribute array for 'mailto:'
my %htmp = @{$result->{'attrsref'}};

while (my ($atr,$val) = each %htmp)
{
push @emails, $1 if ($val =~ /mailto:(.+)/is);
}
}
}
print $_,"\n" for @emails;

# -------------------------------

sub _convertEntities { undef } # intentionally blank

sub _getAttrARRAY
{
my ($attrstr, $conv_ent, $hresult) = @_;
@{$hresult->{'attrsref'}} = ();
$hresult->{'badattrs'} = '';
$hresult->{'dupattrs'} = '';
$hresult->{'noquoteattrs'} = '';
$hresult->{'errstr'} = '';
my %hseen = ();
my $aref = $hresult->{'attrsref'};
my ($alt_attval, $attval, $rx, $ndx, $DL3);
# my $tmpstr = $attrstr;
my $match = 0;

while ($attrstr =~ /$qRxAttr/gc)
{
$match = 1;
if (defined $2)
{
$ndx = push @{$aref},$1;
$DL3 = 0;

if ($2 eq "'") {
$rx = \$qRxAttr_DL1;
}
elsif ($2 eq '"') {
$rx = \$qRxAttr_DL2;
} else {
# no quotes
$rx = \$qRxAttr_DL3;
$DL3 = 1;
}
if (++$hseen{$1} == 2) {
$hresult->{'dupattrs'} .= ", $1";
$hresult->{'dupattrs'} =~ s/^(?:, )+//;
}
if ($attrstr =~ /$$rx/gc) {
if (!$DL3)
{
## normal quoted value
if (defined $1) {
push @{$aref},$1;
next;
}
$attval = $2;
if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
push @{$aref},$$alt_attval;
next;
}
push @{$aref},$attval;
next;
}
## bad attrib, value is not quoted
$attval = $1;
if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
push @{$aref},$$alt_attval;
} else {
push @{$aref},$attval;
}
$hresult->{'noquoteattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'noquoteattrs'} =~ s/^(?:, )+//;
next;
}
## bad value, its either '<' or no ["'] closure
$hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'badattrs'} =~ s/^(?:, )+//;
push @{$aref},'UNDEF_ATTRVAL';
# trim up to '<', otherwise its reported as
# improperly quoted or missing value
$attrstr = substr ($attrstr, pos($attrstr));
$attrstr =~ s/^[^<]+//;
} else {
## attrib with no attrib value
## (standalone atrribute only)
$ndx = push @{$aref},$3;
if (++$hseen{$3} == 2) {
$hresult->{'dupattrs'} .= ", $3";
$hresult->{'dupattrs'} =~ s/^(?:, )+//;
}
$hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'badattrs'} =~ s/^(?:, )+//;
push @{$aref},'UNDEF_ATTRVAL';
next;
}
# bad, return that part of string which is in error
$hresult->{'errstr'} = $attrstr;
return $hresult;
}
pos($attrstr) = 0 if (!$match);
if (length($attrstr) > pos($attrstr)) {
$attrstr = substr ($attrstr, pos($attrstr));
$attrstr =~ s/^\s+//; $attrstr =~ s/\s+$//;
# bad, return that part of string which is in error
# print "-BAD-:$tmpstr\n";
$hresult->{'errstr'} = $attrstr if (length($attrstr));
}
return $hresult;
}

__DATA__

<-- Don't include <a href='mailto:[email protected]'> me -->
<a href='mailto:[email protected]'>
<a class="hello" href=
"mailto:[email protected]?title='&lt;hello&gt';">
<tag scrape_me = "mailto:[email protected]"
/>
<oh nogood = "mailto:[email protected]' >


Output:
(e-mail address removed)
(e-mail address removed)?title='&lt;hello&gt';
(e-mail address removed)
 
S

sln

Below are better html/xml regular expressions to parse <tag attrib/>,
what your interrested in. And it will get all the 'mailto:'s.
Might as well do it right.
Test sample html and output below __DATA__ section

Sorry bout that. Left out a couple of things in the chop.
-sln

------------------------------------------------
## Arxp.pl
##
## Simple html/xml regexp parser for just <tag attrib/>
[snip]

use strict;
use warnings;

[snip]
my @UC_Nstart = (
[snip]

while ($html =~ /$qRx/g)
{
## <tag attrib/> or <tag attrib>
##
if (defined $1) # && lc($1) eq 'a'
{
my %result = ();
# get attributes
_getAttrARRAY ($2, 0, \%result);
## do checks
if (length ($result{'errstr'})) {
# missing or extra token, hard error
print "Error in tag attrib string here ->'$result{'errstr'}'\n";
next;
}
## we will consider these acceptable html, not processed for this
# length ($result{'dupattrs'})
# length ($result{'badattrs'})
# length ($result{'noquoteattrs'})
## process (scrape) attribute array for 'mailto:'
my %htmp = @{$result{'attrsref'}};
while (my ($atr,$val) = each %htmp)
{
push @emails, $1 if ($val =~ /mailto:(.+)/is);
}
}
}
[snip]
 
S

sln

Why not?

I thought the point was to find <a> elements, not ignore them.
^^^
Your close, its like password. You have a partial form of the answer.
A: <tag attrib> or <tag attrib/>

Why its Deja vu' all over again.

<a ...> is not an element inside a comment, not inside CDATA either.
Comments inside CDATA, CDATA inside comments aren't markup either.
Feel free to recursively parse data. Be it attribute values, comment, CDATA or
special data. Fortunately, you don't have to reparse content. Do you want me to
tack on CDATA and special's and whip up a little recursion? How bout a little
entity replacement?

I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
But there is always /mailto:(.+)/sgi

Think your smart don't ya?
A: (recurse reply)

-sln
 
S

sln

^^^
Your close, its like password. You have a partial form of the answer.
A: <tag attrib> or <tag attrib/>

Why its Deja vu' all over again.

<a ...> is not an element inside a comment, not inside CDATA either.
Comments inside CDATA, CDATA inside comments aren't markup either.
Feel free to recursively parse data. Be it attribute values, comment, CDATA or
special data. Fortunately, you don't have to reparse content.
^^^
I'm going to take that back, you should reparse content as well.
So I could tack on content.
Do you want me to
tack on CDATA and special's and whip up a little recursion? How bout a little
entity replacement?

I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
But there is always /mailto:(.+)/sgi

Think your smart don't ya?
A: (recurse reply)

Let me know.

-sln
Hope this helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top