What's wrong with the following regular expression?

kun niu · Apr 18, 2009

Dear all,

I'm trying to help to extrace email from company's website.
Here's part of my test script.

$content = "<a href=\"mailto:[email protected]\"><a class=\"hello\" href=
\"mailto:[email protected]?title=hello\">";
@emails = ($content =~ /<a.*href="mailto

.*)>"/cgim);
foreach my $email (@emails)
{
print "email:" . $email . "\n";
}
But to my surprise, no result is printed.
I'm working on Debian squeeze.
My perl version is 5.10.0.
Would anyone here please help me out?
Thanks for any hints or advice in advance.

Tad J McClellan · Apr 18, 2009

$content = "<a href=\"mailto:[email protected]\"><a class=\"hello\" href= ^^^^^^^
\"mailto:[email protected]?title=hello\">";

You should always enable warnings when developing Perl code.

@emails = ($content =~ /<a.*href="mailto.*)>"/cgim);

^^
^^ these are transposed...

kun niu · Apr 19, 2009

You should always enable warnings when developing Perl code.

^^
^^ these are transposed...

Thank you for your reply.
But I wonder how @emails = ($content =~ /<a.*href="mailto

.*)>"/
cgim); is transposed.
I turned on warnings with "perl -w" and I don't see a warning here.

Tad J McClellan · Apr 19, 2009

^^
^^

The data has a quote followed by an angle bracket.

The word "transposed" means "order is reversed"...

It is bad manners to quote .sigs.

But I wonder how @emails = ($content =~ /<a.*href="mailto.*)>"/

^^
^^

The pattern has an angle bracket followed by a quote.

I turned on warnings with "perl -w" and I don't see a warning here.

The warning (Possible unintended interpolation) was from the
assignment line, not the pattern line.

kun niu · Apr 19, 2009

^^
^^

The data has a quote followed by an angle bracket.

The word "transposed" means "order is reversed"...

It is bad manners to quote .sigs.

But I wonder how @emails = ($content =~ /<a.*href="mailto.*)>"/

Click to expand...

^^
^^

The pattern has an angle bracket followed by a quote.

I turned on warnings with "perl -w" and I don't see a warning here.

Click to expand...

The warning (Possible unintended interpolation) was from the
assignment line, not the pattern line.

I got it.

Sorry for my carelessness.
And really appreciate your reply.

Tad J McClellan · Apr 19, 2009

kun niu said:
It is bad manners to quote .sigs.

Click to expand...

OK. That's enough.

Off to the killfile with you.

sln · Apr 19, 2009

Dear all,

I'm trying to help to extrace email from company's website.
Here's part of my test script.

$content = "<a href=\"mailto:[email protected]\"><a class=\"hello\" href=
\"mailto:[email protected]?title=hello\">";
@emails = ($content =~ /<a.*href="mailto.*)>"/cgim);
foreach my $email (@emails)
{
print "email:" . $email . "\n";
}
But to my surprise, no result is printed.
I'm working on Debian squeeze.
My perl version is 5.10.0.
Would anyone here please help me out?
Thanks for any hints or advice in advance.

Below are better html/xml regular expressions to parse <tag attrib/>,
what your interrested in. And it will get all the 'mailto:'s.
Might as well do it right.
Test sample html and output below __DATA__ section

-sln

## Arxp.pl
##
## Simple html/xml regexp parser for just <tag attrib/>
## No entity conversions, no extras.
## Let me know if you want conversions or simple extra's
## -sln 4/19/09
##

my @UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
my @UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
my $Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
my $Name = "(?:$Nstrt$Nchar*)";

my $qRx = qr/<(?

?

$Name)(\s+(?

?

?:".*?")|(?:'.*?'))|(?:[^>]*?))+)\s*\/?)|--.*?--)>/s;
# <( ( 1 12 ( ( ( )|( ))|( )) 2 )| )>

my $qRxAttr = qr/\G\s+(?

?

$Name)\s*=\s*("|'|))|($Name))/;
my $qRxAttr_DL1 = qr/\G(?

[^'&<]*?)|([^'<]*?))'/;
my $qRxAttr_DL2 = qr/\G(?

[^"&<]*?)|([^"<]*?))"/;
my $qRxAttr_DL3 = qr/\G([^"'=<\s]+)/;

my $html = join '', <DATA>;

while ($html =~ /$qRx/g)
{
## <tag attrib/> or <tag attrib>
##
if (defined $1) # && lc($1) eq 'a'
{
my %result = ();
# get attributes
$result = _getAttrARRAY ($2, 0, \%result);

## do checks
if (length ($result->{'errstr'})) {
# missing or extra token, hard error
print "Error in tag attrib string here ->'$result->{'errstr'}'\n";
next;
}
## we will consider these acceptable html, not processed for this
# length ($result->{'dupattrs'})
# length ($result->{'badattrs'})
# length ($result->{'noquoteattrs'})

## process (scrape) attribute array for 'mailto:'
my %htmp = @{$result->{'attrsref'}};

while (my ($atr,$val) = each %htmp)
{
push @emails, $1 if ($val =~ /mailto

.+)/is);
}
}
}
print $_,"\n" for @emails;

# -------------------------------

sub _convertEntities { undef } # intentionally blank

sub _getAttrARRAY
{
my ($attrstr, $conv_ent, $hresult) = @_;
@{$hresult->{'attrsref'}} = ();
$hresult->{'badattrs'} = '';
$hresult->{'dupattrs'} = '';
$hresult->{'noquoteattrs'} = '';
$hresult->{'errstr'} = '';
my %hseen = ();
my $aref = $hresult->{'attrsref'};
my ($alt_attval, $attval, $rx, $ndx, $DL3);
# my $tmpstr = $attrstr;
my $match = 0;

while ($attrstr =~ /$qRxAttr/gc)
{
$match = 1;
if (defined $2)
{
$ndx = push @{$aref},$1;
$DL3 = 0;

if ($2 eq "'") {
$rx = \$qRxAttr_DL1;
}
elsif ($2 eq '"') {
$rx = \$qRxAttr_DL2;
} else {
# no quotes
$rx = \$qRxAttr_DL3;
$DL3 = 1;
}
if (++$hseen{$1} == 2) {
$hresult->{'dupattrs'} .= ", $1";
$hresult->{'dupattrs'} =~ s/^(?:, )+//;
}
if ($attrstr =~ /$$rx/gc) {
if (!$DL3)
{
## normal quoted value
if (defined $1) {
push @{$aref},$1;
next;
}
$attval = $2;
if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
push @{$aref},$$alt_attval;
next;
}
push @{$aref},$attval;
next;
}
## bad attrib, value is not quoted
$attval = $1;
if ($conv_ent && defined ($alt_attval = _convertEntities (\$attval))) {
push @{$aref},$$alt_attval;
} else {
push @{$aref},$attval;
}
$hresult->{'noquoteattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'noquoteattrs'} =~ s/^(?:, )+//;
next;
}
## bad value, its either '<' or no ["'] closure
$hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'badattrs'} =~ s/^(?:, )+//;
push @{$aref},'UNDEF_ATTRVAL';
# trim up to '<', otherwise its reported as
# improperly quoted or missing value
$attrstr = substr ($attrstr, pos($attrstr));
$attrstr =~ s/^[^<]+//;
} else {
## attrib with no attrib value
## (standalone atrribute only)
$ndx = push @{$aref},$3;
if (++$hseen{$3} == 2) {
$hresult->{'dupattrs'} .= ", $3";
$hresult->{'dupattrs'} =~ s/^(?:, )+//;
}
$hresult->{'badattrs'} .= ", ".$aref->[$ndx-1];
$hresult->{'badattrs'} =~ s/^(?:, )+//;
push @{$aref},'UNDEF_ATTRVAL';
next;
}
# bad, return that part of string which is in error
$hresult->{'errstr'} = $attrstr;
return $hresult;
}
pos($attrstr) = 0 if (!$match);
if (length($attrstr) > pos($attrstr)) {
$attrstr = substr ($attrstr, pos($attrstr));
$attrstr =~ s/^\s+//; $attrstr =~ s/\s+$//;
# bad, return that part of string which is in error
# print "-BAD-:$tmpstr\n";
$hresult->{'errstr'} = $attrstr if (length($attrstr));
}
return $hresult;
}

__DATA__

<-- Don't include <a href='mailto:[email protected]'> me -->
<a href='mailto:[email protected]'>
<a class="hello" href=
"mailto:[email protected]?title='<hello&gt';">
<tag scrape_me = "mailto:[email protected]"
/>
<oh nogood = "mailto:[email protected]' >

Output:
(e-mail address removed)
(e-mail address removed)?title='<hello&gt';
(e-mail address removed)

sln · Apr 19, 2009

Below are better html/xml regular expressions to parse <tag attrib/>,
what your interrested in. And it will get all the 'mailto:'s.
Might as well do it right.
Test sample html and output below __DATA__ section

Sorry bout that. Left out a couple of things in the chop.
-sln

------------------------------------------------

## Arxp.pl
##
## Simple html/xml regexp parser for just <tag attrib/>

[snip]

use strict;
use warnings;

[snip]

my @UC_Nstart = (
[snip]

while ($html =~ /$qRx/g)
{
## <tag attrib/> or <tag attrib>
##
if (defined $1) # && lc($1) eq 'a'
{
my %result = ();
# get attributes

_getAttrARRAY ($2, 0, \%result);

## do checks

if (length ($result{'errstr'})) {

# missing or extra token, hard error

print "Error in tag attrib string here ->'$result{'errstr'}'\n";

next;
}
## we will consider these acceptable html, not processed for this

# length ($result{'dupattrs'})
# length ($result{'badattrs'})
# length ($result{'noquoteattrs'})

## process (scrape) attribute array for 'mailto:'

my %htmp = @{$result{'attrsref'}};

while (my ($atr,$val) = each %htmp)
{
push @emails, $1 if ($val =~ /mailto.+)/is);
}
}
}

[snip]

Tad J McClellan · Apr 19, 2009

<-- Don't include <a href='mailto:[email protected]'> me -->

Why not?

I thought the point was to find <a> elements, not ignore them.

sln · Apr 20, 2009

Why not?

I thought the point was to find <a> elements, not ignore them.

^^^
Your close, its like password. You have a partial form of the answer.
A: <tag attrib> or <tag attrib/>

Why its Deja vu' all over again.

<a ...> is not an element inside a comment, not inside CDATA either.
Comments inside CDATA, CDATA inside comments aren't markup either.
Feel free to recursively parse data. Be it attribute values, comment, CDATA or
special data. Fortunately, you don't have to reparse content. Do you want me to
tack on CDATA and special's and whip up a little recursion? How bout a little
entity replacement?

I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
But there is always /mailto

.+)/sgi

Think your smart don't ya?
A: (recurse reply)

-sln

sln · Apr 20, 2009

^^^
Your close, its like password. You have a partial form of the answer.
A: <tag attrib> or <tag attrib/>

Why its Deja vu' all over again.

<a ...> is not an element inside a comment, not inside CDATA either.
Comments inside CDATA, CDATA inside comments aren't markup either.
Feel free to recursively parse data. Be it attribute values, comment, CDATA or
special data. Fortunately, you don't have to reparse content.

^^^
I'm going to take that back, you should reparse content as well.
So I could tack on content.

Do you want me to
tack on CDATA and special's and whip up a little recursion? How bout a little
entity replacement?

I guess 'mailto:...' doesen't neeed html/xml parsing because its not html.
But there is always /mailto.+)/sgi

Think your smart don't ya?
A: (recurse reply)

Let me know.

-sln
Hope this helps.

Peter J. Holzer · Apr 21, 2009

Why not?

I thought the point was to find <a> elements, not ignore them.

This isn't an a element, it is a comment.

hp

Tad J McClellan · Apr 21, 2009

^^^
^^^ should be <!--

This isn't an a element, it is a comment.

It is not a comment. The bang is missing.

Peter J. Holzer · Apr 21, 2009

^^^
^^^ should be <!--

It is not a comment. The bang is missing.

My fault, sorry.

hp

sln · Apr 21, 2009

^^^
^^^ should be <!--

It is not a comment. The bang is missing.

My apologies.

-sln

Only one table shows up with the information	2	Mar 29, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
FAQ 6.5 I put a regular expression into $/ but it didn't work. What's wrong?	0	Jan 28, 2011
Help with code	0	Jun 12, 2022
Flip-Cards with Local Images	1	Mar 27, 2023
Having difficulty with the layout of these images / video for this web page	2	Jul 5, 2022
Regular expression segmentation Fault with in-place substitution	1	Jul 29, 2009
What's the best way to write this regular expression?	41	Mar 6, 2012

What's wrong with the following regular expression?

kun niu

Tad J McClellan

kun niu

Tad J McClellan

kun niu

Tad J McClellan

sln

sln

Tad J McClellan

sln

sln

Peter J. Holzer

Tad J McClellan

Peter J. Holzer

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads