RegEx issue

Dan · Jul 29, 2004

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second. More
interestingly, if I take out the global parameter 'g' from the end of
the regex, things for fine for the emails (it doesn't touch them), but
then the actual whatever.xxx replacements don't get done.

So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or
(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

Any help is greatly appreciated!! I've been trying to get this solved
for days!

Thanks,

Dan

Paul Lalli · Jul 29, 2004

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second. More
interestingly, if I take out the global parameter 'g' from the end of
the regex, things for fine for the emails (it doesn't touch them), but
then the actual whatever.xxx replacements don't get done.

So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or

My guess - one is contained on a single line, another spans multiple
lines, and your methodology is reading the HTML file line by line.

(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

That's not what's in the regexp above. What's in the regexp above is
[^@?] which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they undergo
double-quotish interpolation.

Any help is greatly appreciated!! I've been trying to get this solved
for days!

The canonical answer to this question is: Don't parse HTML with RegExps!
Use one of the plethora of modules available on CPAN.

Paul Lalli

Brian McCauley · Jul 29, 2004

OK, I have a perl script that reads in html files and makes some link
replacements. Everything works OK except it changes something it
shouldn't. Here is my line of code:

@getstaf2[0] =~ s!href=\"([^(/)]+)[\.]+([^@?]+)\"!href=\"_miscfiles\/$1\.$2\"!ig;

This code replaces a file of the form <a href="whatever.xxx"> to <a
href="_miscfiles/whatever.xxx">.

Now that works fine, but it seems to change things it shouldn't be,
namely instances of <a href="mailto:[email protected]"> to <a
href="_miscfiles/[email protected]">.

Define "it shouldn't be". That target matches your regex.

Interestingly, if I have two or more mailto references on a page, it
will nicely not touch the first, but will change the second.

Actually that's probably not what's happening. Note that the regex
[^(/)]+ can match quote characters and angle brakets so can run right
out of one tag and into another.

So I don't understand why it would (a) leave one alone but not the
other since the 'g' should make it do the same for all instances, or
(b) touch the email references at all. The [^?@] atom should make sure
it skips over any email address that happen to be of the form
(e-mail address removed).

It does not prevent the @ being matched by the [^(/)]

Any help is greatly appreciated!! I've been trying to get this solved
for days!

There is a reason we keep telling everyone who comes here trying to
parse HTML using simple regex[1] not to do that[2].

Can you guess what that reason is?

[1] Typically at least a couple a week.

[2] And use an HTML parsing module instead.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Gunnar Hjalmarsson · Jul 29, 2004

Paul said:
(b) touch the email references at all. The [^?@] atom should make
sure it skips over any email address that happen to be of the
form (e-mail address removed).

Click to expand...

That's not what's in the regexp above. What's in the regexp above
is [^@?]

That's the same character class.

which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they undergo
double-quotish interpolation.

That's not true when defining a character class, is it?

Paul Lalli · Jul 30, 2004

Paul said:
Paul said:

(b) touch the email references at all. The [^?@] atom should make
sure it skips over any email address that happen to be of the
form (e-mail address removed).

Click to expand...

That's not what's in the regexp above. What's in the regexp above
is [^@?]

Click to expand...

That's the same character class.

It would seem not.

That's not true when defining a character class, is it?

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters

Paul Lalli

Gunnar Hjalmarsson · Jul 30, 2004

Paul said:
Gunnar said:

Paul said:

On Thu, 29 Jul 2004, Dan wrote:
(b) touch the email references at all. The [^?@] atom should
make sure it skips over any email address that happen to be
of the form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp
above is [^@?]

Click to expand...

That's the same character class.

Click to expand...

It would seem not.

That's not true when defining a character class, is it?

Click to expand...

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters

Hmm... It would seem I stand corrected.

Nevertheless, before posting I did something like this:

print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;

Outputs:
No match
Match

So the case seems not to be *that* obvious...

Charles DeRykus · Jul 30, 2004

Paul said:
Paul said:

Gunnar said:

Paul Lalli wrote:
On Thu, 29 Jul 2004, Dan wrote:
(b) touch the email references at all. The [^?@] atom should
make sure it skips over any email address that happen to be
of the form (e-mail address removed).

That's not what's in the regexp above. What's in the regexp
above is [^@?]

That's the same character class.

Click to expand...

It would seem not.

which is looking for any pattern that doesn't match the @?
variable. @ needs to be escaped in regexps, because they
undergo double-quotish interpolation.

That's not true when defining a character class, is it?

Click to expand...

It would seem it is.

#!/usr/bin/perl
@f = qw/a-z/;
print "letters\n" if 'abc' =~ /[@f]/;
print "numbers\n" if '123' =~ /[@f]/;

__END__
letters

Click to expand...

Hmm... It would seem I stand corrected.

Nevertheless, before posting I did something like this:

print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;

Outputs:
No match
Match

So the case seems not to be *that* obvious...

Looks like you're right...

perl -MO=Deparse -wle '/[@?]/'
/[\@?]/;

perl -MO=Deparse -wle '/[ab@]/'
/[ab\@]/;

perl -MO=Deparse -wle '/[@m]/'
Possible unintended interpolation of @m in string at -e line 1.
Name "main::m" used only once: possible typo at -e line 1.
/[@m]/;

Dan · Aug 4, 2004

Thanks for the help - I still wasn't able to get that code working
however.
I am interested in using one of the HTML parsers on CPAN, but they all
seem somewhat confusing to me. I can't seem to figure out how they
operate and how I might use them to extract and manipulate links from
some HTML stored in a string. If anyone knows any tutorials or
dumbed-down examples around the web, I'd very much appreciate a link!

Dan

Brian McCauley · Aug 4, 2004

Gunnar Hjalmarsson said:
Charles said:

Gunnar said:

print "No match\n" unless 'abc@def' =~ /^[^@?]+$/;
print "Match\n" if 'abcdef' =~ /^[^@?]+$/;
Outputs:
No match
Match

Click to expand...

Looks like you're right...

Click to expand...

I seem to be right about /[^@?]/, but I apparently jumped at conclusions.

perl -MO=Deparse -wle '/[@?]/'
/[\@?]/;
perl -MO=Deparse -wle '/[ab@]/'
/[ab\@]/;
perl -MO=Deparse -wle '/[@m]/'
Possible unintended interpolation of @m in string at -e line 1.
Name "main::m" used only once: possible typo at -e line 1.
/[@m]/;

Click to expand...

Those warnings are displayed if strictures are not enabled and you
haven't declared the @m variable.

So, I'm a little confused. The lesson here is that @ gets interpolated
in regexes sometimes. Maybe a good enough reason to always escape that
character, but a less ambigous conclusion would be nice.

I always escape @ that I don't want to interpolate in an interpolative
context.

I cannot find a full explaination of exactly when an unescaped @ in an
interpolative context will be treated as literal even in the "Gory
details of parsing quoted constructs".

Interpolating arrays into regex doesn't make a lot of sense. The only
time I see it used is when using the @{[...]} construct.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Regex: match double OR single quote	4	Jul 12, 2012
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
Clickable link conversion regex?	0	Nov 30, 2012
regex problem	7	Jun 12, 2009
Check forms With JavaScript	1	Mar 28, 2023
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Help with dynamic regex	14	Mar 7, 2012

RegEx issue

Dan

Paul Lalli

Brian McCauley

Gunnar Hjalmarsson

Paul Lalli

Gunnar Hjalmarsson

Charles DeRykus

Dan

Brian McCauley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads