Regular Expresions & pattern matching (mis)understanding

R

Robert Stelmack

I have embedded some marked text in a large file to indicate chapters and
page numbers. I want to read that file and strip out the page number to be
displayed with the line of text or following text so as to have a page
number reference for the displayed text. I have read and reread the perlre
and looked for examples on the Internet, but I must be missing a basic
concept. I also have O'Reill's Programming Perl book, but the examples are
sometimes hard to apply to what I am trying to do. Here is what I tried to
get to work (with various syntax changes):

#!/usr/bin/perl
$_= "He had come to pass his experience along to me - if <page>10</page>I
cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;
s,[<page>][0-9][</page>],,g;
printf "(p.$PAGE) contains [$_]:\n";

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]

....but instead it displayed:

bash-2.05b$ test.cgi
(p.) contains [He had come to pass his experience along to me - if
<page>10</page>I cared to have it.]:


I really want to get my head around pattern matching and binding since all
my working code looks too much like my old FORTRAN programs.

Cheers,

Bob
 
N

Nils Petter Vaskinn

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]

...but instead it displayed:

bash-2.05b$ test.cgi
(p.) contains [He had come to pass his experience along to me - if
<page>10</page>I cared to have it.]:

Completely untested and probably wrong, but it might give you a kick in
the right direction:

/^(.*)<page>([0-9]+)</page>(.*)/;
$page = $2;
$rest = $1 . $3;
print "(p.$page) contains $rest\n";
I really want to get my head around pattern matching and binding since all
my working code looks too much like my old FORTRAN programs.

Look for capturing in the regexp chapter of your favorite perl book
 
G

Gunnar Hjalmarsson

Robert said:
#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;

Several mistakes in that line.

- You need parentheses around $PAGE to enforce list context.
- You need the '=' operator to assign the captured string.
- Brackets are for character classes, and shall not be used around
<page> etc.
- You need [0-9]+ so that it matches one or more digits.
- You need to surround the latter with parentheses to capture it.

This is what I suppose you mean:

($PAGE) = / said:
s,[<page>][0-9][</page>],,g;

This is what I suppose you mean:

s,<page>[0-9]+</page>,,;

(not sure why you are using the /g modifier)
printf "(p.$PAGE) contains [$_]:\n";

No need to use printf(). print() is sufficient.

print "(p.$PAGE) contains [$_]:\n";

But the $PAGE variable is redundant. Instead you can do:

s,<page>([0-9]+)</page>,,;
print "(p.$1) contains [$_]:\n";

HTH
 
T

Tad McClellan

Robert Stelmack said:
I have read and reread the perlre


Have you read the regex _tutorial_ too?

perldoc perlretut

but I must be missing a basic
concept.


You are missing multiple concepts simultaneously. See below.

I also have O'Reill's Programming Perl book,


The best book for extracting the text-processing power from regexes is:

"Mastering Regular Expressions" (2nd edition) O'Reilly

#!/usr/bin/perl


You should ask for all the help you can get:

use strict;
use warnings;

Have you seen the Posting Guidelines that are posted here frequently?

$_= "He had come to pass his experience along to me - if <page>10</page>I
cared to have it.";


It is absolutely essential that we have exactly the same string as
you if we are to help you with matching that string.

Consider wrapping long strings yourself (in valid Perl) so your
newsreader won't break stuff for you.

$PAGE =~ /[<page>][0-9][<\/page>]/;


1) you are trying to match the pattern against the string contained
in $PAGE, but the string is really in $_.
(perl would have warned you about that if you had asked it to...)

2) a "character class" matches a _single_ character.
[<page>] is exactly equivalent to [aegp<>] since the
listed characters are the same.

3) your pattern will match only single-digit numbers, you need to
allow multiple digit characters between the "tags".

4) you need "capturing parenthesis" around the page number digits
if you want access to them later.

5) you don't need the m// at all if you are going to s/// with
the same pattern. s/// does nothing if it the match fails.

6) the \d shortcut char class matches the same chars as [0-9].

s,[<page>][0-9][</page>],,g;


After this statement all of the "tags" will be gone, and it will
be "too late" to apply further processing to them (such as print()).

printf "(p.$PAGE) contains [$_]:\n";


You should use print() unless you make use of the formatting
that printf() provides.

The output I expected was:

(p.10) contains [He had come to pass his experience along to me - if I cared
to have it.]


----------------------------------
#!/usr/bin/perl
use strict;
use warnings;

$_= "He had come to pass his experience along to me - if "
. "<page>10</page>I cared to have it.";

while ( s,<page>(\d+)</page>,,g ) {
print "(p.$1) contains [$_]:\n";
}
 
U

Uri Guttman

GH" == Gunnar Hjalmarsson said:
#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;


GH> Several mistakes in that line.

GH> - You need parentheses around $PAGE to enforce list context.

why is list context needed?

GH> This is what I suppose you mean:

GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

and why do you have list context there? you never use the grabbed
results in that line.

uri
 
G

Gunnar Hjalmarsson

Uri said:
"GH" == Gunnar Hjalmarsson <[email protected]> writes:
GH> Robert Stelmack said:
#!/usr/bin/perl
$_= "He had come to pass his experience along to me -
if <page>10</page>I cared to have it.";
$PAGE =~ /[<page>][0-9][<\/page>]/;

GH> Several mistakes in that line.

GH> - You need parentheses around $PAGE to enforce list context.

why is list context needed?

GH> This is what I suppose you mean:

GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

and why do you have list context there? you never use the grabbed
results in that line.

No, but two lines further down, OP's code presupposes that $PAGE
contains the page number. After having suggested a minimum of changes
to OP's code, I also mentioned that the whole line is redundant, and
that the page number well can be captured in the s/// operator.

What's your message, Uri?
 
U

Uri Guttman

GH> Uri Guttman said:
$PAGE =~ /[<page>][0-9][<\/page>]/;

GH> - You need parentheses around $PAGE to enforce list context.

well, without any grabs nor assignment, list context is meaningless
there. the line has =~.

GH> This is what I suppose you mean:

GH> What's your message, Uri?

i didn't see the change from =~ to = in that line. so it was more than
just your previous comment about list context being needed.

uri
 
N

Nils Petter Vaskinn

Nils Petter Vaskinn said:
and probably wrong
/^(.*)<page>([0-9]+)</page>(.*)/;
$page = $2;
$rest = $1 . $3;


Using the dollar-digit variables without first ensuring that
the match succeeded is indeed wrong.

And i should probably have escaped that '/'

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {
$page = $2;
$rest = $1 . $3;
}
 
B

Ben Morrow

Nils Petter Vaskinn said:
And i should probably have escaped that '/'

if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

Ben
 
B

Brian McCauley

Ben Morrow said:
Nils Petter Vaskinn said:
if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

I think the ability to use alternate quoting delimiters is often
overrated. I don't like to increase the ammount of contextual
information needed to understand what I'm looking at. In the case of
a long regex I don't want to have to remeber that it's using some
delimiter other than /. Compared to the tiny effort of the extra
keystoke to escape each / I don't think the small loss of readability
is justified.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
T

Tassilo v. Parseval

Also sprach Brian McCauley:
Ben Morrow said:
Nils Petter Vaskinn said:
if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

I think the ability to use alternate quoting delimiters is often
overrated. I don't like to increase the ammount of contextual
information needed to understand what I'm looking at. In the case of
a long regex I don't want to have to remeber that it's using some
delimiter other than /. Compared to the tiny effort of the extra
keystoke to escape each / I don't think the small loss of readability
is justified.

And particularly, using a delimiter that has special meaning in regexps
(such as '|') is always a bad idea. Under these circumstances I prefer
to use '!' or '#' or in fact any character that visually stands out and
is not meta in its semantics.

Tassilo
 
U

Uri Guttman

BM> Ben Morrow said:
Nils Petter Vaskinn said:
if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

Bleech!

if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

BM> I think the ability to use alternate quoting delimiters is often
BM> overrated. I don't like to increase the ammount of contextual
BM> information needed to understand what I'm looking at. In the case of
BM> a long regex I don't want to have to remeber that it's using some
BM> delimiter other than /. Compared to the tiny effort of the extra
BM> keystoke to escape each / I don't think the small loss of readability
BM> is justified.

i have to disagree. i find \ annoying to see when it is not needed. just
choose a delimiter that works with this regex. i like paired delims like
{} or []. and if the regex gets too long or complex, /x is called
for. and then paired delims work very well:

s{
blah
}
{
replace
}sexi ;

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top