RegEx issue

D

Dan

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second. More
interestingly, if I take out the global parameter 'g' from the end of
the regex, things for fine for the emails (it doesn't touch them), but
then the actual whatever.xxx replacements don't get done.

So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or
(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

Any help is greatly appreciated!! I've been trying to get this solved
for days!

Thanks,

Dan
 
P

Paul Lalli

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second. More
interestingly, if I take out the global parameter 'g' from the end of
the regex, things for fine for the emails (it doesn't touch them), but
then the actual whatever.xxx replacements don't get done.

So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or

My guess - one is contained on a single line, another spans multiple
lines, and your methodology is reading the HTML file line by line.
(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

That's not what's in the regexp above. What's in the regexp above is
[^@?] which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they undergo
double-quotish interpolation.
Any help is greatly appreciated!! I've been trying to get this solved
for days!

The canonical answer to this question is: Don't parse HTML with RegExps!
Use one of the plethora of modules available on CPAN.

Paul Lalli
 
B

Brian McCauley

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Define "it shouldn't be". That target matches your regex.
Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second.

Actually that's probably not what's happening. Note that the regex
[^(/)]+ can match quote characters and angle brakets so can run right
out of one tag and into another.
So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or
(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

It does not prevent the @ being matched by the [^(/)]
Any help is greatly appreciated!! I've been trying to get this solved
for days!

There is a reason we keep telling everyone who comes here trying to
parse HTML using simple regex[1] not to do that[2].

Can you guess what that reason is?

[1] Typically at least a couple a week.

[2] And use an HTML parsing module instead.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
G

Gunnar Hjalmarsson

Paul said:
(b) touch the email references at all. The [^?@] atom should make
sure it skips over any email address that happen to be of the
form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp above
is [^@?]

That's the same character class.
which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they undergo
double-quotish interpolation.

That's not true when defining a character class, is it?
 
P

Paul Lalli

Paul said:
(b) touch the email references at all. The [^?@] atom should make
sure it skips over any email address that happen to be of the
form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp above
is [^@?]

That's the same character class.

It would seem not.
That's not true when defining a character class, is it?

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters



Paul Lalli
 
G

Gunnar Hjalmarsson

Paul said:
Gunnar said:
Paul said:
On Thu, 29 Jul 2004, Dan wrote:
(b) touch the email references at all. The [^?@] atom should
make sure it skips over any email address that happen to be
of the form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp
above is [^@?]

That's the same character class.

It would seem not.
That's not true when defining a character class, is it?

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters

Hmm... It would seem I stand corrected. :)

Nevertheless, before posting I did something like this:

print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;

Outputs:
No match
Match

So the case seems not to be *that* obvious...
 
C

Charles DeRykus

Paul said:
Gunnar said:
Paul Lalli wrote:
On Thu, 29 Jul 2004, Dan wrote:
(b) touch the email references at all. The [^?@] atom should
make sure it skips over any email address that happen to be
of the form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp
above is [^@?]

That's the same character class.

It would seem not.
which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they
undergo double-quotish interpolation.

That's not true when defining a character class, is it?

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters

Hmm... It would seem I stand corrected. :)

Nevertheless, before posting I did something like this:

print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;

Outputs:
No match
Match

So the case seems not to be *that* obvious...

Looks like you're right...

perl -MO=Deparse -wle '/[@?]/'
/[\@?]/;

perl -MO=Deparse -wle '/[ab@]/'
/[ab\@]/;

perl -MO=Deparse -wle '/[@m]/'
Possible unintended interpolation of @m in string at -e line 1.
Name "main::m" used only once: possible typo at -e line 1.
/[@m]/;
 
D

Dan

Thanks for the help - I still wasn't able to get that code working
however.
I am interested in using one of the HTML parsers on CPAN, but they all
seem somewhat confusing to me. I can't seem to figure out how they
operate and how I might use them to extract and manipulate links from
some HTML stored in a string. If anyone knows any tutorials or
dumbed-down examples around the web, I'd very much appreciate a link!

Dan
 
B

Brian McCauley

Gunnar Hjalmarsson said:
Charles said:
Gunnar said:
print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;
Outputs:
No match
Match
Looks like you're right...

I seem to be right about /[^@?]/, but I apparently jumped at conclusions.
perl -MO=Deparse -wle '/[@?]/'
/[\@?]/;
perl -MO=Deparse -wle '/[ab@]/'
/[ab\@]/;
perl -MO=Deparse -wle '/[@m]/'
Possible unintended interpolation of @m in string at -e line 1.
Name "main::m" used only once: possible typo at -e line 1.
/[@m]/;

Those warnings are displayed if strictures are not enabled and you
haven't declared the @m variable.

So, I'm a little confused. The lesson here is that @ gets interpolated
in regexes sometimes. Maybe a good enough reason to always escape that
character, but a less ambigous conclusion would be nice. :)

I always escape @ that I don't want to interpolate in an interpolative
context.

I cannot find a full explaination of exactly when an unescaped @ in an
interpolative context will be treated as literal even in the "Gory
details of parsing quoted constructs".

Interpolating arrays into regex doesn't make a lot of sense. The only
time I see it used is when using the @{[...]} construct.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top