Why this Regex not working?

L

Looking

$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want
 
M

Mark Clements

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want
You may want to check out HTTP::Headers rather than doing this yourself.

With this regex

(this won't work for readers using proportional fonts)

$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
^

The problem is that in order to do a non-greedy match the question mark
should be immediately adjacent to the * ie you need to remove the
brackets or put the ? inside the brackets. Also, you don't need the |
(pipe symbol) inside [] character classes.

regards,

Mark
 
A

Anno Siegel

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
^ ^
Do you actually want to allow | besides " and ' for quotes? I think
you have conflated character class notation and alternation.
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

Better use a real HTML parser.
The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want

Simple. /.*/ is greedy, it matches the longest string it can while
still having the rest of the pattern match. So it picks up everything
until the last " or ' in the line. The question mark in /(.*)?/
serves no purpose. You probably meant to put it inside the parentheses:
/(.*?)/. In that position the match will be non-greedy.

Anno
 
J

John W. Krahn

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


John
 
L

Looking

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

May I ask what \1 is? I am trying to do a search of \1 on google but this
string is too short.
I need to get whatever is between the first 2 pairs of "" or '' after
content=
 
L

Looking

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want
You may want to check out HTTP::Headers rather than doing this yourself.

If you mean HTML::HeadParser
I tried it and it is not working!.

That is the sample it gave:
$h = HTTP::Headers->new;
$p = HTML::HeadParser->new($h);
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
undef $p;
print $h->title; # should print "Stupid example"

I tried to use $h->description, it does not return anything. I am trying to
get keywords, description etc, but got nothing.
If you know where the bugs are, let me know.
 
L

Looking

$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

By the way, I assume \1 is same as $1 but on the left side. Your code is not
working. It does not match anything. Although, I think your idea is right

$s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
#$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
print "$s\n";

I hope it can return
this is what i' want
but yours return
"sadf content= "this is what i' want " asd " sdf " adfa " sdf'
so, no match.
 
M

Mark Clements

Looking said:
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;


May I ask what \1 is? I am trying to do a search of \1 on google but this
string is too short.
I need to get whatever is between the first 2 pairs of "" or '' after
content=
you need to read up on regexps. check out

man perlre

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

\1, \2 etc are typically used within the regexp itself, and $1, $2 etc
outside it (or in the second part of a s/// operation).

Mark
 
M

Mark Clements

Looking said:
If you mean HTML::HeadParser
I tried it and it is not working!.
Er - I misread your requirement as parsing HTTP headers rather than the
<HEAD> section of an HTML document. Sorry for leading you down the wrong
path. Try this


use strict;
use warnings;

use HTML::HeadParser;

my $p = HTML::HeadParser->new();
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
print $p->header("title");
 
J

Jeff 'japhy' Pinyan

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart
 
M

Mark Clements

Jeff said:
For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.


No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.
Thanks - serves me right for not having the good sense to run the code
in order to check, and I didn't know that about character classes.

Mark
 
J

John W. Krahn

Jeff said:
For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

Oops, don't you hate it when that happens. ;-)
So how come you can put a variable in a character class and have it work at
run-time?


John
 
J

Jeff 'japhy' Pinyan

Jeff said:
For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

Oops, don't you hate it when that happens. ;-)
So how come you can put a variable in a character class and have it work at
run-time?

Because when a variable is in a regex, the regex can't be compiled until
run-time[1]. That "law" just doesn't hold for backreferences.

[1] thus the existence of the /o switch which quells more than one
compilation of a regex with variables in it

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart
 
J

John W. Krahn

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably
want something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

By the way, I assume \1 is same as $1 but on the left side. Your code is not
working. It does not match anything. Although, I think your idea is right

$s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
#$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
print "$s\n";

I hope it can return
this is what i' want
but yours return
"sadf content= "this is what i' want " asd " sdf " adfa " sdf'
so, no match.

Yes, as "Japhy" has pointed out, \1 won't work inside of a character class.
This should work a lot better. :)

$s =~ s/.*content=.*?(["'])(.*?)\1.*/$2/si;



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
ClayE7480

Latest Threads

Top