Regular Expression Question

B

Börni

Hi

This is probably very easy, but I don't get it.

Example:
#!perl -w
use strict;

my $string = '<meta name="Keywords" content="" lang="fr">';

my ($keywords) = $string =~ /.*?meta name="Keywords".*?content="(.*?)">/;

print "[$keywords]\n";
exit 0;


In the Example above I'd expect $keywords to be empty. Instead it is ["
lang="fr].

What is the correct expression to match everything
<meta name="Keywords" content="-->IN HERE<--" lang="fr">
even when it's empty?

Regards Bernard
 
T

Tim Greer

Börni said:
Hi

This is probably very easy, but I don't get it.

Example:
#!perl -w
use strict;

my $string = '<meta name="Keywords" content="" lang="fr">';

my ($keywords) = $string =~ /.*?meta
name="Keywords".*?content="(.*?)">/;

print "[$keywords]\n";
exit 0;


In the Example above I'd expect $keywords to be empty. Instead it is
[" lang="fr].

What is the correct expression to match everything
<meta name="Keywords" content="-->IN HERE<--" lang="fr">
even when it's empty?

Regards Bernard

In your above code, it is doing exactly what it should. Using your
current example, make the following change:

my ($keywords) = $string =~ /^.*?meta
name="Keywords".*?content="([^"]*)"/;

That will take zero or more characters in content="" and anything from
the opening double quote to the closing double quote, which is not a
double quote itself, will be what $keywords is. You could probably
just write that as: my ($keywords) = $string
=~ /^.*?content="([^"]*)"/; if that's what you want to stick with.
Notice I've added the start of the string with ^ in my examples. If
it's not going to be the start of the string in real code, just adjust
accordingly.
 
C

Charlton Wilbur

B> Hi This is probably very easy, but I don't get it.

That's because you're using regular expressions to parse HTML.

You will save yourself considerable pain if you use a parser, such as
HTML::parser, to parse HTML.

Charlton
 
B

Börni

Thank you very much for your help everybody! (Of course my problem was the
">" character)
 
T

Tim Greer

Börni said:
Thank you very much for your help everybody! (Of course my problem was
the ">" character)

(top posting fixed)

Actually, the problem wasn't the ">" character. The problem was that
the match went all the way to the last character, which happened to be
the > character. The actual problem was that it was grabbing
everything from the content's opening double quote content=" (.*?) all
the way to ending ">, which happened to be " lang="fr.
 
T

Tim McDaniel

(top posting fixed)

Actually, the problem wasn't the ">" character. The problem was that
the match went all the way to the last character, which happened to be
the > character. The actual problem was that it was grabbing
everything from the content's opening double quote content=" (.*?) all
the way to ending ">, which happened to be " lang="fr.

No, he's right: the problem was that '>' was in the regexp.
.*?
is non-greedy matching. If the terminal '>' had not been in the
regexp, it would have stopped at the second ".
 
T

Tim Greer

Tim said:
No, he's right: the problem was that '>' was in the regexp.
.*?
is non-greedy matching. If the terminal '>' had not been in the
regexp, it would have stopped at the second ".

I suppose it's just a matter of wording it. I read it as the OP meaning
it was the character, rather than the formatting of the regex and the
location of it. I just think the preferable way would be to match with
([^"]*), but I suppose it's up to the individual.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top