Regular Expresions & pattern matching (mis)understanding

Robert Stelmack · Feb 16, 2004

I have embedded some marked text in a large file to indicate chapters and
page numbers. I want to read that file and strip out the page number to be
displayed with the line of text or following text so as to have a page
number reference for the displayed text. I have read and reread the perlre
and looked for examples on the Internet, but I must be missing a basic
concept. I also have O'Reill's Programming Perl book, but the examples are
sometimes hard to apply to what I am trying to do. Here is what I tried to
get to work (with various syntax changes):

#!/usr/bin/perl
$_= "He had come to pass his experience along to me - if <page>10</page>I
cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;
s,[<page>][0-9][</page>],,g;
printf "(p.$PAGE) contains [$_]:\n";

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]

....but instead it displayed:

bash-2.05b$ test.cgi
(p.) contains [He had come to pass his experience along to me - if
<page>10</page>I cared to have it.]:

I really want to get my head around pattern matching and binding since all
my working code looks too much like my old FORTRAN programs.

Cheers,

Bob

Nils Petter Vaskinn · Feb 16, 2004

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]

...but instead it displayed:

bash-2.05b$ test.cgi
(p.) contains [He had come to pass his experience along to me - if
<page>10</page>I cared to have it.]:

Completely untested and probably wrong, but it might give you a kick in
the right direction:

/^(.*)<page>([0-9]+)</page>(.*)/;
$page = $2;
$rest = $1 . $3;
print "(p.$page) contains $rest\n";

I really want to get my head around pattern matching and binding since all
my working code looks too much like my old FORTRAN programs.

Look for capturing in the regexp chapter of your favorite perl book

Gunnar Hjalmarsson · Feb 16, 2004

Robert said:
#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;

Several mistakes in that line.

- You need parentheses around $PAGE to enforce list context.
- You need the '=' operator to assign the captured string.
- Brackets are for character classes, and shall not be used around
<page> etc.
- You need [0-9]+ so that it matches one or more digits.
- You need to surround the latter with parentheses to capture it.

This is what I suppose you mean:

($PAGE) = / said:
s,[<page>][0-9][</page>],,g;

This is what I suppose you mean:

s,<page>[0-9]+</page>,,;

(not sure why you are using the /g modifier)

printf "(p.$PAGE) contains [$_]:\n";

No need to use printf(). print() is sufficient.

print "(p.$PAGE) contains [$_]:\n";

But the $PAGE variable is redundant. Instead you can do:

s,<page>([0-9]+)</page>,,;
print "(p.$1) contains [$_]:\n";

HTH

Tad McClellan · Feb 16, 2004

Robert Stelmack said:
I have read and reread the perlre

Have you read the regex _tutorial_ too?

perldoc perlretut

but I must be missing a basic
concept.

You are missing multiple concepts simultaneously. See below.

I also have O'Reill's Programming Perl book,

The best book for extracting the text-processing power from regexes is:

"Mastering Regular Expressions" (2nd edition) O'Reilly

#!/usr/bin/perl

You should ask for all the help you can get:

use strict;
use warnings;

Have you seen the Posting Guidelines that are posted here frequently?

$_= "He had come to pass his experience along to me - if <page>10</page>I
cared to have it.";

It is absolutely essential that we have exactly the same string as
you if we are to help you with matching that string.

Consider wrapping long strings yourself (in valid Perl) so your
newsreader won't break stuff for you.

$PAGE =~ /[<page>][0-9][<\/page>]/;

1) you are trying to match the pattern against the string contained
in $PAGE, but the string is really in $_.
(perl would have warned you about that if you had asked it to...)

2) a "character class" matches a _single_ character.
[<page>] is exactly equivalent to [aegp<>] since the
listed characters are the same.

3) your pattern will match only single-digit numbers, you need to
allow multiple digit characters between the "tags".

4) you need "capturing parenthesis" around the page number digits
if you want access to them later.

5) you don't need the m// at all if you are going to s/// with
the same pattern. s/// does nothing if it the match fails.

6) the \d shortcut char class matches the same chars as [0-9].

s,[<page>][0-9][</page>],,g;

After this statement all of the "tags" will be gone, and it will
be "too late" to apply further processing to them (such as print()).

printf "(p.$PAGE) contains [$_]:\n";

You should use print() unless you make use of the formatting
that printf() provides.

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]

----------------------------------
#!/usr/bin/perl
use strict;
use warnings;

$_= "He had come to pass his experience along to me - if "
. "<page>10</page>I cared to have it.";

while ( s,<page>(\d+)</page>,,g ) {
print "(p.$1) contains [$_]:\n";
}

Tad McClellan · Feb 16, 2004

Nils Petter Vaskinn said:
and probably wrong

/^(.*)<page>([0-9]+)</page>(.*)/;
$page = $2;
$rest = $1 . $3;

Using the dollar-digit variables without first ensuring that
the match succeeded is indeed wrong.

Uri Guttman · Feb 16, 2004

GH" == Gunnar Hjalmarsson said:
#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;

Click to expand...

GH> Several mistakes in that line.

GH> - You need parentheses around $PAGE to enforce list context.

why is list context needed?

GH> This is what I suppose you mean:

GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

and why do you have list context there? you never use the grabbed
results in that line.

uri

Gunnar Hjalmarsson · Feb 16, 2004

Uri said:
"GH" == Gunnar Hjalmarsson <[email protected]> writes:

Click to expand...

GH> Robert Stelmack said:

#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;

Click to expand...

Click to expand...

GH> Several mistakes in that line.

GH> - You need parentheses around $PAGE to enforce list context.

why is list context needed?

GH> This is what I suppose you mean:

GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

and why do you have list context there? you never use the grabbed
results in that line.

No, but two lines further down, OP's code presupposes that $PAGE
contains the page number. After having suggested a minimum of changes
to OP's code, I also mentioned that the whole line is redundant, and
that the page number well can be captured in the s/// operator.

What's your message, Uri?

Uri Guttman · Feb 16, 2004

GH> Uri Guttman said:
$PAGE =~ /[<page>][0-9][<\/page>]/;

Click to expand...

Click to expand...

GH> - You need parentheses around $PAGE to enforce list context.

well, without any grabs nor assignment, list context is meaningless
there. the line has =~.

GH> This is what I suppose you mean:

GH> What's your message, Uri?

i didn't see the change from =~ to = in that line. so it was more than
just your previous comment about list context being needed.

uri

Nils Petter Vaskinn · Feb 17, 2004

Nils Petter Vaskinn said:
Nils Petter Vaskinn said:

and probably wrong

Click to expand...

/^(.*)<page>([0-9]+)</page>(.*)/;
$page = $2;
$rest = $1 . $3;

Click to expand...

Using the dollar-digit variables without first ensuring that
the match succeeded is indeed wrong.

And i should probably have escaped that '/'

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {
$page = $2;
$rest = $1 . $3;
}

Ben Morrow · Feb 17, 2004

Nils Petter Vaskinn said:
And i should probably have escaped that '/'

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

Ben

Brian McCauley · Feb 18, 2004

Ben Morrow said:
Nils Petter Vaskinn said:

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Click to expand...

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

I think the ability to use alternate quoting delimiters is often
overrated. I don't like to increase the ammount of contextual
information needed to understand what I'm looking at. In the case of
a long regex I don't want to have to remeber that it's using some
delimiter other than /. Compared to the tiny effort of the extra
keystoke to escape each / I don't think the small loss of readability
is justified.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\

Tassilo v. Parseval · Feb 18, 2004

Also sprach Brian McCauley:

Ben Morrow said:
Ben Morrow said:

Nils Petter Vaskinn said:

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Click to expand...

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

Click to expand...

I think the ability to use alternate quoting delimiters is often
overrated. I don't like to increase the ammount of contextual
information needed to understand what I'm looking at. In the case of
a long regex I don't want to have to remeber that it's using some
delimiter other than /. Compared to the tiny effort of the extra
keystoke to escape each / I don't think the small loss of readability
is justified.

And particularly, using a delimiter that has special meaning in regexps
(such as '|') is always a bad idea. Under these circumstances I prefer
to use '!' or '#' or in fact any character that visually stands out and
is not meta in its semantics.

Tassilo

Uri Guttman · Feb 18, 2004

BM> Ben Morrow said:
Nils Petter Vaskinn said:

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Click to expand...

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

Click to expand...

BM> I think the ability to use alternate quoting delimiters is often
BM> overrated. I don't like to increase the ammount of contextual
BM> information needed to understand what I'm looking at. In the case of
BM> a long regex I don't want to have to remeber that it's using some
BM> delimiter other than /. Compared to the tiny effort of the extra
BM> keystoke to escape each / I don't think the small loss of readability
BM> is justified.

i have to disagree. i find \ annoying to see when it is not needed. just
choose a delimiter that works with this regex. i like paired delims like
{} or []. and if the regex gets too long or complex, /x is called
for. and then paired delims work very well:

s{
blah
}
{
replace
}sexi ;

uri

How can I fix my pattern coding error in c++	0	Mar 19, 2023
pattern matching and abstract functions	12	Mar 29, 2011
Help needed with tough regular expression matching	11	Oct 12, 2009
FAQ 4.23 How do I find matching/nesting anything?	0	Apr 2, 2011
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Minimum Total Difficulty	0	Nov 15, 2023
understanding regexp, Text::ParseWords	2	Nov 5, 2010
pattern matching and array methods	7	Apr 27, 2011

Regular Expresions & pattern matching (mis)understanding

Robert Stelmack

Nils Petter Vaskinn

Gunnar Hjalmarsson

Tad McClellan

Tad McClellan

Uri Guttman

Gunnar Hjalmarsson

Uri Guttman

Nils Petter Vaskinn

Ben Morrow

Brian McCauley

Tassilo v. Parseval

Uri Guttman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads