Why this Regex not working?

Looking · Sep 16, 2004

$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want

Mark Clements · Sep 16, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want

You may want to check out HTTP::Headers rather than doing this yourself.

With this regex

(this won't work for readers using proportional fonts)

$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
^

The problem is that in order to do a non-greedy match the question mark
should be immediately adjacent to the * ie you need to remove the
brackets or put the ? inside the brackets. Also, you don't need the |
(pipe symbol) inside [] character classes.

regards,

Mark

Anno Siegel · Sep 16, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;

^ ^
Do you actually want to allow | besides " and ' for quotes? I think
you have conflated character class notation and alternation.

#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

Better use a real HTML parser.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want

Simple. /.*/ is greedy, it matches the longest string it can while
still having the rest of the pattern match. So it picks up everything
until the last " or ' in the line. The question mark in /(.*)?/
serves no purpose. You probably meant to put it inside the parentheses:
/(.*?)/. In that position the match will be non-greedy.

Anno

John W. Krahn · Sep 16, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

John

Looking · Sep 16, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

Click to expand...

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

May I ask what \1 is? I am trying to do a search of \1 on google but this
string is too short.
I need to get whatever is between the first 2 pairs of "" or '' after
content=

Looking · Sep 16, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?
I am trying to get whatever is between the first pair of "" or '' after
content=. It is parsing the header file of HTML pages.

The first regex gave me this:
"this is what i want " asd " sdf " adfa

But I need this:
this is what i want

Click to expand...

You may want to check out HTTP::Headers rather than doing this yourself.

If you mean HTML::HeadParser
I tried it and it is not working!.

That is the sample it gave:
$h = HTTP::Headers->new;
$p = HTML::HeadParser->new($h);
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
undef $p;
print $h->title; # should print "Stupid example"

I tried to use $h->description, it does not return anything. I am trying to
get keywords, description etc, but got nothing.
If you know where the bugs are, let me know.

Looking · Sep 16, 2004

$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

Click to expand...

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably want
something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

By the way, I assume \1 is same as $1 but on the left side. Your code is not
working. It does not match anything. Although, I think your idea is right

$s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
#$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
print "$s\n";

I hope it can return
this is what i' want
but yours return
"sadf content= "this is what i' want " asd " sdf " adfa " sdf'
so, no match.

Mark Clements · Sep 16, 2004

Looking said:
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

Click to expand...

May I ask what \1 is? I am trying to do a search of \1 on google but this
string is too short.
I need to get whatever is between the first 2 pairs of "" or '' after
content=

you need to read up on regexps. check out

man perlre

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

\1, \2 etc are typically used within the regexp itself, and $1, $2 etc
outside it (or in the second part of a s/// operation).

Mark

Mark Clements · Sep 16, 2004

Looking said:
If you mean HTML::HeadParser
I tried it and it is not working!.

Er - I misread your requirement as parsing HTTP headers rather than the
<HEAD> section of an HTML document. Sorry for leading you down the wrong
path. Try this

use strict;
use warnings;

use HTML::HeadParser;

my $p = HTML::HeadParser->new();
$p->parse(<<EOT);
<title>Stupid example</title>
<base href="http://www.linpro.no/lwp/";>
Normal text starts here.
EOT
print $p->header("title");

Jeff 'japhy' Pinyan · Sep 16, 2004

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart

Mark Clements · Sep 16, 2004

Jeff said:
For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

Click to expand...

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

Thanks - serves me right for not having the good sense to run the code
in order to check, and I didn't know that about character classes.

Mark

John W. Krahn · Sep 17, 2004

Jeff said:
For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

Click to expand...

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

Oops, don't you hate it when that happens. ;-)
So how come you can put a variable in a character class and have it work at
run-time?

John

Jeff 'japhy' Pinyan · Sep 17, 2004

Jeff said:
Jeff said:

For the record, \1 is a backreference ie it refers to a previously
matched and captured part of the regexp.

so

(["'])([^\1]*)[\1]

matches " or ', followed by any character other than these zero or more
times, followed by whichever of " and ' was matched the first time.

Click to expand...

No it doesn't. Character classes are created when the regex is compiled,
but \1 is not known until the regex is EXECUTED. Using \1 inside a
character class is that same as using \x01 or \001, it's the ASCII
character whose ordinal value is 1.

Click to expand...

Oops, don't you hate it when that happens. ;-)
So how come you can put a variable in a character class and have it work at
run-time?

Because when a variable is in a regex, the regex can't be compiled until
run-time[1]. That "law" just doesn't hold for backreferences.

[1] thus the existence of the /o switch which quells more than one
compilation of a regex with variables in it

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
Senior Dean, Fall 2004 % have long ago been overpaid?
RPI Corporation Secretary %
http://japhy.perlmonk.org/ % -- Meister Eckhart

John W. Krahn · Sep 17, 2004

Looking said:
$s='sadf content= "this is what i want " asd " sdf " adfa " sdf';
$s =~ s/.*content=.*?["|'](.*)?["|'].*/$1/si;
#$s =~ s/.*content=.*?["|']([^"|']*)["|'].*/$1/si;
print "$s\n";

The scond regex works. I wonder why the first regex not working?

Click to expand...

That is because *, + and ? are greedy and will match as many characters as
possible so (.*) will match everything to the end until the last ", | or '
character. (Why are you trying to match the | character?) You probably
want something like:

$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;

Click to expand...

By the way, I assume \1 is same as $1 but on the left side. Your code is not
working. It does not match anything. Although, I think your idea is right

$s=qq( "sadf content= "this is what i' want " asd " sdf " adfa " sdf' );
#$s =~ s/.*content=.*?["'](.*?)["'].*/$1/si;
$s =~ s/.*content=.*?(["'])([^\1]*)[\1].*/$2/si;
print "$s\n";

I hope it can return
this is what i' want
but yours return
"sadf content= "this is what i' want " asd " sdf " adfa " sdf'
so, no match.

Yes, as "Japhy" has pointed out, \1 won't work inside of a character class.
This should work a lot better.

$s =~ s/.*content=.*?(["'])(.*?)\1.*/$2/si;

John

RegEx	0	Sep 1, 2022
Why is the e.target not working here?	1	Dec 29, 2022
Code working properly in VS code for every test case but assigned wrong when submitted why?	0	Aug 21, 2022
Do...While...Not working	1	Feb 15, 2023
My regex kung-fu is not strong =(	0	Apr 4, 2020
HTML Anchor tag not working	2	Dec 15, 2020
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
My http request is working but not doing it correctly	0	Oct 13, 2023

Why this Regex not working?

Looking

Mark Clements

Anno Siegel

John W. Krahn

Looking

Looking

Looking

Mark Clements

Mark Clements

Jeff 'japhy' Pinyan

Mark Clements

John W. Krahn

Jeff 'japhy' Pinyan

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads