Teach me how to fish, regexp

Henry · Oct 7, 2003

Folks:

I've got a bunch of fixed-format text files (< 100k bytes each) to sniff.

Each file is divided into paragraphs. Each para is preceded by at least
three blank lines, and is introduced by a section number of 1 to 6 digits
followed by a period and two spaces, OR, 1 to 6 digits followed by a period
and at least one digit, followed by a period and two spaces, e.g.

------------------------------------------------------
......
<empty>
<empty>
<empty>
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...
------------------------------------------------------

Or, the second format:

------------------------------------------------------
......
<empty>
<empty>
<empty>
12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
.....
------------------------------------------------------

Yes, if you are wondering, these are legal blah-blah-blahs.

Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.

Right! I've been writing trial regular expressions all day, and I have come
to the conclusion that I'm not very good at it. I've also examined examples
and help pages until I'm ... really tired and no wiser.

Well, I _can_ split based on assuming that the three empty lines _always_
appear before a new section, but this doesn't seem very robust. Seems like
I really ought to be able to recognize at least two empties followed by
these two fixed-format alternatives.

Best I've figured out takes a common subset of the two cases:

#@sections = split /\n\n\n[0-9][0-9]+\./,<>;

This works ok, but it eats the match string. Non-capturing parentheses? I
wish I could make heads or tails of this syntax. Look-ahead assertion?
Even more cryptic.

I can't even figure out why I seem to need "[0-9][0-9]+" for my 5 digit test
case when it seems "[0-9]+" ought to suffice. (Yeah, I know my solution
will fail if there's only 1 digit --i.e. the first 9 sections-- but that's
obviously the least of my problems).

Could some wizard teach me to fish: Please don't give me a solution, merely
tell me where I'm going wrong and put me back on the right path.

Or should I go back to my awk hack that works and which I actually
understand?

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Martien Verbruggen · Oct 7, 2003

Folks:

I've got a bunch of fixed-format text files (< 100k bytes each) to sniff.

Each file is divided into paragraphs. Each para is preceded by at least
three blank lines, and is introduced by a section number of 1 to 6 digits
followed by a period and two spaces, OR, 1 to 6 digits followed by a period
and at least one digit, followed by a period and two spaces, e.g.

Is the first paragraph also preceded by three blank lines? And do you
mean three blank lines, or three newlines? I will assume three
newlines (i.e. two blank lines).

[snip of example records, see code, below]

Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.

I would probably set the input record separator ($/, see perlvar) to
"", which will treat two or more consecutive newlines as the record
separator. Then each record starts with the number you're interested
in.

#!/usr/local/bin/perl
use warnings;
use strict;

$/ = "";
while (<DATA>)
{
chomp;
if (my ($num, $para) = /^(\d+(?:\.\d)?)\. (.*)/s)
{
print "[$num] $para\n";
}
else
{
print "MALFORMED RECORD\n";
}
}

__DATA__
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...

12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
.....

12034.2. Foo bar baz

12035. Grubble banana groo
feeble deee ....

=== End example program===

The regular expression in more detail:

/
^ # from the beginning of the record
( # start capture
\d+ # one or more digits
(?: # start grouping, but no capturing
\.\d # A literal . followed by a digit
) # end grouping
? # previous (group) one or zero times, i.e. it's optional
) # end capturing
.\ \ # literal . followed by two spaces
(.*) # capture the rest of the record
/sx

The s modifier makes . match newlines, and the x modifier allows the
comments I put in (which is also why I needed to escape the spaces in
this version, and not in the one above. The first capturing set of
parentheses returns the paragraph number, including the sub-number, if
present, and the second capturing parentheses set returns the "Blah,
blah.." bit up to the end of the record.

Also see the perlvar and perlre documentation for more information.

If two newlines is not a record splitter, and you _have_ to use a
minimum of three, this won't work. You can't even check after reading
a record whether it ends in more than two newlines, since it always
will end in exactly two, no matter how many are in the input (which is
pretty annoying), so you'd have to probably set $/ to "\n\n\n", and
remove any leading and trailing whitespace yourself and skip "empty"
records:

#!/usr/local/bin/perl
use warnings;
use strict;

$/ = "\n\n\n";
while (<DATA>)
{
s/^\s+//;
s/\s+$//;
next if $_ eq "";

if (my ($num, $para) = /^(\d+(?:\.\d)?)\. (.*)/s)
{
print "[$num] $para\n";
}
else
{
print "ILLEGAL RECORD\n";
}
}

__DATA__
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...

blah

12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
.....

12034.2. Foo bar baz

12035. Grubble banana groo
feeble deee ....

=== End example program===

Martien

Roy Johnson · Oct 7, 2003

If I may take the last question first:

Henry said:
Or should I go back to my awk hack that works and which I actually
understand?

You could always run your awk hack through a2p to see how it deals
with your situation. Could be ugly, could be enlightening.

If you do this:
my @paras = split(/\n{3}(\d+\.(?:\d+.)?) /, $whole_file);

You will have your section headers split out as their own paragraphs,
followed by the paragraphs themselves. Then you just have to put them
back together as you go through the list. (Decide for yourself whether
you want the spaces after the section number retained. My split is
throwing them away.)

If you want three or more newlines, change the {3} to {3,}. Season to
taste.

I can't even figure out why I seem to need "[0-9][0-9]+" for my 5 digit test
case when it seems "[0-9]+" ought to suffice.

I agree (although I recommend \d instead of [0-9]). What was [0-9]+
doing wrong that [0-9][0-9]+ fixed?

Henry · Oct 7, 2003

Martien Vebruggen:

Thank you for your response to my post:

Is the first paragraph also preceded by three blank lines? And do you mean
three blank lines, or three newlines? I will assume three newlines (i.e. two
blank lines).

It appears that there are _usually_ three blank lines, i.e. four newlines
preceding each new section. It appears that all other breaks --the ones I
don't want to find-- are shorter (fewer newlines) but I don't know how
reliable this is.

[snip of example records, see code, below]

Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.

Click to expand...

I would probably set the input record separator ($/, see perlvar) to "", which
will treat two or more consecutive newlines as the record separator. Then each
record starts with the number you're interested in.

Right, that's what I finally did, in effect. (I did something similar at the
"split".) But this isn't very robust, I think: it depends on some typist
somewhere _always_ following the rules.

I think you are saying that slurp mode may not be the best choice.

As far as your setting

$/ = "";

This is not exactly intuitive from the point of view of a newcomer.
Sorry, could you help me understand (or give me a blind rule of thumb) how
what looks like setting a variable to an empty string implies "two or more
successive newlines"?

#!/usr/local/bin/perl use warnings; use strict;

$/ = ""; while (<DATA>) { chomp; if (my ($num, $para) = /^(\d+(?:\.\d)?)\.
(.*)/s) { print "[$num] $para\n"; } else { print "MALFORMED RECORD\n"; } }

=== End example program===

Thanks for taking all the trouble to explain the components in detail:

/
^ # from the beginning of the record
Right.

( # start capture

Capture? I guess you mean the mysterious "save the stuff you match"
mechanisms I've found in some perl references. The explanations I've found
are very short and not very useful. Also: I find it hard to discriminate
between parens used for operation grouping and this use.

\d+ # one or more digits
Yes.

(?: # start grouping, but no capturing

Sorry, could you speak more fully about this? Again, I haven't found a good
reference for this stuff.

\.\d # A literal . followed by a digit
Right.

) # end grouping

OK, as above.

? # previous (group) one or zero times, i.e. it's optional
OK.

) # end capturing

OK, as above.

.\ \ # literal . followed by two spaces

Sorry, I don't get that. Could you explain more fully? I think that I
understand that a period, unescaped, matches any character, so I would
expect that you'd have to escape before the period to match a literal
period/decimal point.

(.*) # capture the rest of the record

I think I understand that

.*

means "any character, repeated 0 or more times", but I don't get how the
parens lead to capture (and not operation grouping, as above) and eventual
appearance of the captured data somewhere.

/sx

The s modifier makes . match newlines, and the x modifier allows the comments
I put in (which is also why I needed to escape the spaces in this version, and
not in the one above.

OK. (The modifiers mechanism takes some getting used to.)

The first capturing set of parentheses returns the paragraph number, including
the sub-number, if present, and the second capturing parentheses set returns
the "Blah, blah.." bit up to the end of the record.

Right, as I said above, I can't figure out how this aspect works. This may
seem obvious to you but looks like a hidden (or magical) side-effect to me.

Also see the perlvar and perlre documentation for more information.

My desk and my screen are littered with various references. Thanks for
pointing out these man "subreferences" -- I had not noticed them

If two newlines is not a record splitter, and you _have_ to use a minimum of
three, this won't work.

Sorry, could you speak more fully about this? Is there a restriction I'm
not seeing?

You can't even check after reading a record whether it
ends in more than two newlines, since it always will end in exactly two, no
matter how many are in the input (which is pretty annoying), so you'd have to
probably set $/ to "\n\n\n", and remove any leading and trailing whitespace
yourself and skip "empty" records:

Right. This is exactly where I arrived before I decided I needed help and
posted my original question, except that I stayed with slurping the data
instead of sort-of line-at-a-time processing.

#!/usr/local/bin/perl use warnings; use strict;

$/ = "\n\n\n"; while (<DATA>) { s/^\s+//; s/\s+$//; next if $_ eq "";

if (my ($num, $para) = /^(\d+(?:\.\d)?)\. (.*)/s) { print "[$num] $para\n"; }
else { print "ILLEGAL RECORD\n"; } }

<snip>

Thanks. It would seem that some of the mystifications I asked about above
appear here also.

Thanks for your patience.

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Henry · Oct 7, 2003

Roy Johnson:

Thanks for your response to my post:

If I may take the last question first:

You could always run your awk hack through a2p to see how it deals
with your situation. Could be ugly, could be enlightening.

Good idea. I already did that. It _was_ ugly. (Kind of like seeing
myself on TV. Yech! )

If you do this:
my @paras = split(/\n{3}(\d+\.(?:\d+.)?) /, $whole_file);

You will have your section headers split out as their own paragraphs,
followed by the paragraphs themselves. Then you just have to put them
back together as you go through the list. (Decide for yourself whether
you want the spaces after the section number retained. My split is
throwing them away.)

If you want three or more newlines, change the {3} to {3,}. Season to
taste.

Thanks! I plugged your expression in to my test scaffolding. The result is
quite workable; I just need to traverse the resulting array appropriately;
which should be no problem. So you've given me a fish (a solution).

I need to learn how to catch my own fish. So I'll backtrack and make sure
I understand your solution. Please have patience with me; I'm (obviously)
new to all this.

#1 What does using "my" mean? (Give me a clue -- a keyword; I'll look it up.
Googling for "my" and "perl" has not been particularly enlightening.)

#2 Is there any difference --except brevity-- between writing

\n{3}

and

\n\n\n

-- both mean "exactly 3 newlines in succession" right?

I understand the fundamental match expression components, excepting your use
of parens and "?:".

#3 To get started, is there a difference between uttering

my @Paras = split(/\n{3}(\d+\.(?:\d+.)?) /, $whole_file);

and

my @Paras = split /\n{3}(\d+\.(?:\d+.)?) /, $whole_file;

Both seem to work the same way in my test scaffolding.

#4 .... And I _really_ don't see how this expression leads to getting
alternating saved section numbers and section contents in the output array.
This seems to be based on a bit of assumptions/side effects/magic, and I
haven't yet found the right perl reference to explain it.

I can't even figure out why I seem to need "[0-9][0-9]+" for my 5 digit test
case when it seems "[0-9]+" ought to suffice.

Click to expand...

I agree (although I recommend \d instead of [0-9]).

OK, these are equivalent, though, right? Is this a matter of style,
dialect, common usage, modernity, or what?

My preference for [0-9] is only this: that construct seems more versatile
and so perhaps I can do more with less demands on my internal (brain) memory
or fewer references to a perl regexp cheat-sheet.

What was [0-9]+ doing wrong that [0-9][0-9]+ fixed?

Hmmmm, I forget. There were so many...

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Martien Verbruggen · Oct 7, 2003

[rewrapped long lines]

Martien Vebruggen:

Thank you for your response to my post:

Right, that's what I finally did, in effect. (I did something
similar at the "split".) But this isn't very robust, I think: it
depends on some typist somewhere _always_ following the rules.

I think you are saying that slurp mode may not be the best choice.

As far as your setting

$/ = "";

This is not exactly intuitive from the point of view of a newcomer.
Sorry, could you help me understand (or give me a blind rule of
thumb) how what looks like setting a variable to an empty string
implies "two or more successive newlines"?

The perlvar documentation explains what $/ (the input record
separator) does, and that it has a "special" setting of the empty
string, which makes it reads "paragraphs", i.e. blocks of text
separated by two or more newlines.

Thanks for taking all the trouble to explain the components in detail:

Capture? I guess you mean the mysterious "save the stuff you match"
mechanisms I've found in some perl references. The explanations
I've found are very short and not very useful. Also: I find it
hard to discriminate between parens used for operation grouping and
this use.

Yes. Capturing parentheses "save" whatever is matched between them,
and return it as a result of the operation, as well as in the named
variables $1, $2, etc.. At the same time they group multiple
characters together to form a single subpattern.

There is more information about this in the perlre documentation, as
well as in the perlop documentation under the entry for
"m/PATTERN/cgimosx".

Sorry, could you speak more fully about this? Again, I haven't
found a good reference for this stuff.

If you only want to group some stuff together in a subpattern, but you
don't want that match of that subpattern returned as one of the digit
variables, or in the return list, you use (?

ATTERN). Again, see the
perlre documentation for a full explanation.

Sorry, I don't get that. Could you explain more fully? I think that I
understand that a period, unescaped, matches any character, so I would
expect that you'd have to escape before the period to match a literal
period/decimal point.

You're right. my mistake in transcribing the regular expression. there
should be a backslash in front of the dot.

I think I understand that

.*

means "any character, repeated 0 or more times", but I don't get how the
parens lead to capture (and not operation grouping, as above) and eventual
appearance of the captured data somewhere.

It does both. They group, and as a side effect, the matched subpattern
gets captured and returned (in this case as the second element of the
returned list, as well as in $2).

Right, as I said above, I can't figure out how this aspect works.
This may seem obvious to you but looks like a hidden (or magical)
side-effect to me.

The fact that those grouped subpattern matches get returned (and saved
in $1, $2...) is more an effect of the m// operator (documented in
perlop) than of regular expressions themselves. However, they do get
captured in regular expressions, and you can refer back to them (with
\1, \2...) inside of the same regular expression.

My desk and my screen are littered with various references. Thanks for
pointing out these man "subreferences" -- I had not noticed them

man perl gives a rather complete list of all the various other manual
pages that are available.

Sorry, could you speak more fully about this? Is there a
restriction I'm not seeing?

If, for example, your text is formatted like:

12345 Some text for paragraph 1

Some more text that belongs in paragraph two

12345.1 This is the second paragraph

Then setting $/ to "" would read the second part of the first
paragraph as a separate read, since it has two newlines between the
first and second bit. if there is text in your documents that is like
that, you can't use the first bunch of code (with $/ set to ""), but
you have to use the second bunch of code (with $/ set to "\n\n\n" or
possibly even "\n\n\n\n") and do a bit more work in removing trailing
and leading newlines.

That's not what I posted. The newlines are important.

There are also a perlrequick and a perlretut manual page, which are
more gentle introductions to regular expressions than the perlre
reference documentation. You should probably have a bit of a read of
those.

Furthermore: Don't worry too much that some of this stuff looks
magical. It is. Perl is full of things that you just have to learn
about by immersion, and by repeated visits to the same documentation.
it can take a while before some of this stuff becomes automatic.

Martien

Bryan Castillo · Oct 7, 2003

I've got a bunch of fixed-format text files (< 100k bytes each) to sniff.

Each file is divided into paragraphs. Each para is preceded by at least
three blank lines, and is introduced by a section number of 1 to 6 digits
followed by a period and two spaces, OR, 1 to 6 digits followed by a period
and at least one digit, followed by a period and two spaces, e.g.

------------------------------------------------------
.....
<empty>
<empty>
<empty>
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...
------------------------------------------------------

Or, the second format:

------------------------------------------------------
.....
<empty>
<empty>
<empty>
12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
....
------------------------------------------------------

Yes, if you are wondering, these are legal blah-blah-blahs.

Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.

Here is a way to slurp and split with an re, for what you described.
3 things you might want to look at:

1. How the zero-width look ahead asserion (?!) doesn't
consume the section number

2. How the (?

grouping doesn't capture the value
(you could have used regular capturing grouping
but there isn't any point to capture in the split)

3. How the qr operator is used. It isn't nescessary, but
I thought it made the code more readable.

You should also see, that this split might leava an empty
value in first element of the array. You will have to check for it.

use strict;
use warnings;
use IO::File;

sub readfile {
my $in = IO::File->new($_[0], "r") || die;
my $text = '';
$text.=$_ while (<$in>);
return $text;
}

my $text = readfile('file.txt');

# compile re here for readability
my $re = qr/
[\r\n]{3,} # match 3 or more new lines
(?! # zero width look ahead doesn't consume section
\d{5}\. # match first 5 digits of section
(?:\d\.)? # match optional digit and dot (non-capturing)
\s{2} # match 2 spaces
)
/x;

my @t = split $re, $text;

for (my $i=0; $i<=$#t; $i++) {
print "Para [$i]\n", $t[$i], "\n", "-"x60,"\n";
}

Could some wizard teach me to fish: Please don't give me a solution, merely
tell me where I'm going wrong and put me back on the right path.

Sorry, but its easier to give a sollution and to ask you to
read it and research it to figure out how it works.

Roy Johnson · Oct 8, 2003

For learning how to fish, I recommend the book "Learning Perl". Randal
has a gift for introducing topics so that they are understandable and
stick with you.

As a versatile fishing pole you should already have, I direct you to
perldoc, and have included some baited hooks below.

Having a copy
of "Programming Perl" is a good alternative to looking everything up
in perldoc and man pages.

Henry said:
#1 What does using "my" mean? (Give me a clue -- a keyword; I'll look it up.
Googling for "my" and "perl" has not been particularly enlightening.)

"my" declares a variable that exists only in the enclosing block. It
is "lexically scoped". When declaring variables, this is usually what
you want. Undeclared variables are global, and those declared "local"
are "dynamically scoped", meaning their values will persist into
function calls.

perldoc -f my
perldoc -f local

#2 Is there any difference --except brevity-- between writing
\n{3}
and
\n\n\n
-- both mean "exactly 3 newlines in succession" right?

Correct. I like shorter expressions when possible, and prefer not to
have to count how many times I've put something.

I understand the fundamental match expression components, excepting your use
of parens and "?:".

(?:...) indicates that the enclosing parentheses will not be returned
by the match or assigned to $1, etc. They are only there to group the
pattern so that a quantifier can apply to it.

perldoc perlretut
perldoc perlre

#3 To get started, is there a difference between uttering
my @paras = split(/\n{3}(\d+\.(?:\d+.)?) /, $whole_file);
and
my @paras = split /\n{3}(\d+\.(?:\d+.)?) /, $whole_file;
Both seem to work the same way in my test scaffolding.

Apart from the extra space in the 2nd pattern, they should be the
same. The parentheses for function calls are generally optional,
unless it becomes ambiguous which function gets an argument.

perldoc -f split

#4 .... And I _really_ don't see how this expression leads to getting
alternating saved section numbers and section contents in the output array.

It's how split works. The pattern you give it is the delimiter, which
is not ordinarily returned, but if you have parenthesized
sub-expression(s) in its pattern, they will be returned in order
amidst the rest of the data. (Except, of course, if you ?: the ()s.)

[...]I recommend \d instead of [0-9]).

Click to expand...

OK, these are equivalent, though, right? Is this a matter of style,
dialect, common usage, modernity, or what?

Convenience, mostly. They are equivalent. I probably shouldn't have
thrown extras like that at you when you're trying to nail down more
fundamental stuff.

Read and practice.

Tad McClellan · Oct 8, 2003

[snip good advice]

"Programming Perl" is a good alternative to looking everything up
in perldoc and man pages.

No it isn't.

The Camel book describes some old version of Perl.

The std docs describe the version of Perl that you
actually have installed on your system.

"Programming Perl" is a good _supplement_ to looking everything
up in perldoc.

[snip more good advice]

Henry · Oct 8, 2003

Martien:

Thanks for your response on this thread:

The perlvar documentation explains what $/ (the input record
separator) does, and that it has a "special" setting of the empty
string, which makes it reads "paragraphs", i.e. blocks of text
separated by two or more newlines.

Right. You are too polite, so I'll say it: I need to RTFM.

That said, I have to say I find the man pages the last place I want to look
in terms of convenience. I'll have to find a web based... Done!

Ah, much easier. Here's a relevant extract.

....Setting to "" will treat two or more consecutive empty lines as a single
empty line. Setting to "\n\n" will blindly assume that the next input
character belongs to the next paragraph, even if it's a newline. (Mnemonic:
/ delimits line boundaries when quoting poetry.)

Quoting _poetry_?

The choice of "" as a special is clearly a choice of convenience on the part
the people doing the internals-- and has no particular mnemonic or symbolic
values. I'm glad that's clear.

This gives me a clue to the magnitude of my task of learning perl to do
useful work.

Yes. Capturing parentheses "save" whatever is matched between them,
and return it as a result of the operation, as well as in the named
variables $1, $2, etc.. At the same time they group multiple
characters together to form a single subpattern.

(I have the books "Perl by Example" and "Learning Perl" and what I've
finding is those aren't particularly good references when it comes to
details like this.)

Right, now I'm getting the idea. Parens capture the matched stuff.

Thanks for coming out and saying this so directly.

Apparently only specially-marked parens do NOT capture stuff, right. Also, I
see by reading further that capturing is expensive in terms of processing
time, so you might want to limit its use.

There is more information about this in the perlre documentation,

Aha: in a Warning, I find

...(stuff)...

Captures stuff, but

...(?:stuff)...

doesn't. Cute.

well as in the perlop documentation under the entry for
"m/PATTERN/cgimosx".

Hmmm, yet another man page. How _many_ are there? Over 100? Hmmmm...

Yes, that's the best treatment of many issues that I could not understand by
consulting other references. Thanks!

I'm still trying to figure out the best way of referring to what are here
called "flags" in this context, but seem to have other names elsewhere: the
"cgimosx" items. Am I confused, or is this confusing?

If you only want to group some stuff together in a subpattern, but you
don't want that match of that subpattern returned as one of the digit
variables, or in the return list, you use (?ATTERN). Again, see the
perlre documentation for a full explanation.

You're right. my mistake in transcribing the regular expression. there
should be a backslash in front of the dot.

Sound effect: said:
It does both. They group, and as a side effect, the matched subpattern
gets captured and returned (in this case as the second element of the
returned list, as well as in $2).

Aha!

It seems to me that you can use parens to affect operation grouping, if you
are sufficiently qualified (or ambitious); that's story #2. Depending on
which documentation you use, you may discover Story #2, which describes who
parens _also_ store data, how to access that data, and how and why avoid
unnecessary use of this feature.

What I did not find --part of why I'm here-- is a reference that tells both
stories.

The fact that those grouped subpattern matches get returned (and saved
in $1, $2...) is more an effect of the m// operator (documented in
perlop) than of regular expressions themselves. However, they do get
captured in regular expressions, and you can refer back to them (with
\1, \2...) inside of the same regular expression.

I'm sorry you said _that_, because now I have uncertainty about the scope of
this "side effect" in different contexts. I guess your purpose is to alert
me to the fact that parens in regexps anyplace do save data, but the
accessibility of the data varies from context to context. Right?
\

There are also a perlrequick and a perlretut manual page, which are
more gentle introductions to regular expressions than the perlre
reference documentation. You should probably have a bit of a read of
those.

In my spare time, I'll concatenate all the perlxxxx man pages and see how
the contents compare to a moderate sized book.

That's a lot of stuff, and in a reference format. Fortunately, the examples
are generally quite good, but this isn't exactly the most friendly
environment.

Furthermore: Don't worry too much that some of this stuff looks
magical. It is. Perl is full of things that you just have to learn
about by immersion, and by repeated visits to the same documentation.
it can take a while before some of this stuff becomes automatic.

Perl doesn't seem to be anything one can pick up quickly, that's for sure.

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Sam Holden · Oct 8, 2003

That said, I have to say I find the man pages the last place I want to look
in terms of convenience. I'll have to find a web based... Done!

The risj with that is that the man pages on your system will document
the perl installed on your system. The web versions will document
some version of perl which might not be the one you are using.

Plus the man pages also document the modules you have installed, rather than
a random selection of them...

Ah, much easier. Here's a relevant extract.

...Setting to "" will treat two or more consecutive empty lines as a single
empty line. Setting to "\n\n" will blindly assume that the next input
character belongs to the next paragraph, even if it's a newline. (Mnemonic:
/ delimits line boundaries when quoting poetry.)

Quoting _poetry_?

Why not. It's true after all, and hence a reasonable mnemonic.

The choice of "" as a special is clearly a choice of convenience on the part
the people doing the internals-- and has no particular mnemonic or symbolic
values. I'm glad that's clear.

"" and undef are the only two values that could be used as something special,
since everything else is a possible literal end of line marker.

undef as "slurp" mode makes sense, since that's more common than paragraph
mode (in my experience anyway) and being undef means a simple 'local $/;'
is enough to enable it.

That leaves "" for something else, and paragraph mode is a good choice in
my opinion, since it's a reasonably commonly wanted operation.

Of course when references arrived a new possibility became available and
(since it's perl) was used...

Helgi Briem · Oct 9, 2003

That's a lot of stuff, and in a reference format. Fortunately, the examples
are generally quite good, but this isn't exactly the most friendly
environment.

What do you mean? Plain text, a pageful at a time, readily
searchable, isn't friendly enough for you? How much more
friendly can you get. If you prefer, the Activestate distribution
of Perl for Windows, Linux and Solaris, comes with all the
documentation in html format. I use that a little, but usually
prefer perldoc.

Perl doesn't seem to be anything one can pick up quickly, that's for sure.

I disagree. Programming logic in itself takes a while, but
if you are a programmer, you can pick up enough Perl to do
useful things in a couple of weeks, *if* you learn how to use
perldoc. It is the very core of learning Perl, more important than
anything else.

Roy Johnson · Oct 9, 2003

No it isn't.

The Camel book describes some old version of Perl.

I think a little more significance is being given to the differences
in versions than is warranted. For most things, the Camel book is a
fine reference. Particularly for someone just getting acquainted with
Perl, it is going to be more than adequate. You give the impression
that there are major compatibility problems between versions of Perl
5.

That said, I should have had a caveat that features do change.

David H. Adler · Oct 9, 2003

I think a little more significance is being given to the differences
in versions than is warranted. For most things, the Camel book is a
fine reference. Particularly for someone just getting acquainted with
Perl, it is going to be more than adequate. You give the impression
that there are major compatibility problems between versions of Perl
5.

That said, I should have had a caveat that features do change.

That caveat is exactly what makes it a bad *reference*.

I certainly wouldn't say it's not worth reading, but I wouldn't want to
look something up in a source that may be wrong on what ever I'm trying
to do.

Rather than check the book and then check the current docs to make sure
it wasn't something that had changed, I'd rather just read the docs.

dha

Helgi Briem · Oct 10, 2003

Rather than check the book and then check the current docs to make sure
it wasn't something that had changed, I'd rather just read the docs.

Plus of course, the digital documentation on your hard disk
is infinitely handier and more accessible than any paper
format. Cutting and pasting the superb examples covers
most eventualities.

Buy the book if you must, but leave it in the bathroom
where it belongs.

ko · Oct 10, 2003

Helgi said:
Plus of course, the digital documentation on your hard disk
is infinitely handier and more accessible than any paper
format. Cutting and pasting the superb examples covers
most eventualities.

Buy the book if you must, but leave it in the bathroom
where it belongs.

Am I reading this right?!? Programming Perl, a book by Larry Wall, Tom
Christiansen, and Jon Orwant belongs in the bathroom?!?

Helgi Briem · Oct 10, 2003

Am I reading this right?!? Programming Perl, a book by Larry Wall,
Tom Christiansen, and Jon Orwant belongs in the bathroom?!?

Of course. That's where you have peace and quiet for
reading.

Henry · Oct 10, 2003

Roy Johnson:

Thanks for your response on this thread:

(Sorry for the delay in responding. Ate some bad food, was "down" for a
couple of days.)

For learning how to fish, I recommend the book "Learning Perl". Randal
has a gift for introducing topics so that they are understandable and
stick with you.

Hmmm, that's exactly what I have right here, bought it over a year ago,
didn't get a chance to use it until now. It looked like the right choice
at the time.

Anticipating some other comments: I note now that this book was over 9
years old when I bought it. I might have chosen differently if I had paid
attention to the date on it. On the other hand, maybe "older" perl is
"simpler" perl, and easier to learn.

As a versatile fishing pole you should already have, I direct you to
perldoc, and have included some baited hooks below.

Thanks for following the metaphor, and my further gratitude for not reading
into it any particular theological meaning to it.

Having a copy of "Programming Perl" is a good alternative to looking
everything up in perldoc and man pages.

Right, I think that's the next purchase, assuming I survive this round of
perl work.

"my" declares a variable that exists only in the enclosing block.

I kind of figured that, but wanted to be sure. I _did_ look for it.

It is "lexically scoped". When declaring variables, this is usually what you
want. Undeclared variables are global, and those declared "local" are
"dynamically scoped", meaning their values will persist into function calls.

Right. (Jeez, yet another permutation on scoping rules to learn.)

perldoc -f my
perldoc -f local

Aha! That's what "perldoc" means. And MacOS X has it. Cool. Strange
controls, though. Uses nroff? Oh, well...

Ummm, this may be entirely obvious to you, but how would a newcomer know
about perldoc? Oh, well, it is clearly shown in the base man page, but
not in the two books I bought.

Correct. I like shorter expressions when possible, and prefer not to
have to count how many times I've put something.
Right.

(?:...) indicates that the enclosing parentheses will not be returned
by the match or assigned to $1, etc. They are only there to group the
pattern so that a quantifier can apply to it.

perldoc perlretut
perldoc perlre

I've been all over prelre in the man pages. I guess you mean "perlreftut".
Yep, I'll have to read that.

Apart from the extra space in the 2nd pattern, they should be the
same. The parentheses for function calls are generally optional,
unless it becomes ambiguous which function gets an argument.

Wow, yet another variation on syntax requirements. I would just as soon
have parens for function arguments required, but that probably shows I
haven't used perl long enough.

perldoc -f split

It's how split works. The pattern you give it is the delimiter, which
is not ordinarily returned, but if you have parenthesized
sub-expression(s) in its pattern, they will be returned in order
amidst the rest of the data. (Except, of course, if you ?: the ()s.)

Oh, yeah, that. I was getting so deep into the re stuff that I forget the
function in which I was operating. True!

[...]I recommend \d instead of [0-9]).

Click to expand...

OK, these are equivalent, though, right? Is this a matter of style,
dialect, common usage, modernity, or what?

Click to expand...

Convenience, mostly. They are equivalent. I probably shouldn't have
thrown extras like that at you when you're trying to nail down more
fundamental stuff.

Indeed. Actually, that minor variation isn't a problem. Some others on
this thread have thrown much other arcane stuff at me.

Read and practice.

Well, eventually, I need to do the job I set out to. The good news is that
I see how all the tools I need to do the job are at my fingertips, in one
environment. I could do this with awk and shell scripting, too, but it
would be messy. The bad news is that perl seems to have infinite
regression of complexity; the further I look, the more complexity I see!

Is there a deep structure to perl that will become clear after study and
practice? Or will I confirm my current impression that it's got extensions
of a good basic idea extending to the horizon in every direction?

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Henry · Oct 10, 2003

Bryan Castillo:

Thanks for your post on this thread:

(Sorry for the delay in responding. Ate some bad food, was "down" for a
couple of days.)

<snip>

Here is a way to slurp and split with an re, for what you described.
3 things you might want to look at:

1. How the zero-width look ahead assertion (?!) doesn't
consume the section number

"Doesn't consume"? Now I'm REALLY confused. According to what I find in
(one) perlre man page, this has concept has to do with looking for 'x' not
followed by 'y'. Maybe I have the wrong man page. Is there a place
that

2. How the (? grouping doesn't capture the value
(you could have used regular capturing grouping
but there isn't any point to capture in the split)

Right (I think) -- ( ) captures and groups, while (?: ) groups only. Got
it.

3. How the qr operator is used. It isn't necessary, but
I thought it made the code more readable.

I think I get this.

You should also see, that this split might leave an empty
value in first element of the array. You will have to check for it.

Yes, definitely, I've grown accustomed to the empty value in element 0.

use strict;
use warnings;
use IO::File;

Huh? Is this the o-o face of perl? Do I need this?

sub readfile {
my $in = IO::File->new($_[0], "r") || die;
my $text = '';
$text.=$_ while (<$in>);
return $text;
}

OK, makes sense, in general; don't assume clean input, right? Fortunately,
I have pretty good assurance my input is all pretty good -- just may not be
consistent.

I like using subroutines, in general. Looks like I'm going to have to learn
how to use them in perl.

my $text = readfile('file.txt');

# compile re here for readability
my $re = qr/
[\r\n]{3,} # match 3 or more new lines
(?! # zero width look ahead doesn't consume section
\d{5}\. # match first 5 digits of section
(?:\d\.)? # match optional digit and dot (non-capturing)
\s{2} # match 2 spaces
)
/x;

my @t = split $re, $text;

for (my $i=0; $i<=$#t; $i++) {
print "Para [$i]\n", $t[$i], "\n", "-"x60,"\n";
}

Could some wizard teach me to fish: Please don't give me a solution, merely
tell me where I'm going wrong and put me back on the right path.

Click to expand...

Sorry, but its easier to give a solution and to ask you to
read it and research it to figure out how it works.

That certainly is a reasonable solution, and I appreciate your help and
patience.

Most of the above makes sense to me. (I still have to understand the
oddities of the "print" syntax, but that's for a different thread.)

Thanks,

Henry

(e-mail address removed) remove 'zzz'

Henry · Oct 10, 2003

Tad McClellan:

Thanks for your post on this thread:

[snip good advice]

"Programming Perl" is a good alternative to looking everything up
in perldoc and man pages.

Click to expand...

No it isn't.

The Camel book describes some old version of Perl.

The std docs describe the version of Perl that you
actually have installed on your system.

"Programming Perl" is a good _supplement_ to looking everything
up in perldoc.

OK, maybe I _won't_ make this my next purchase. Just what I need to add to
my confusion, a serious case of versionitis.

I'm really struggling to find the right combination of references, both
print and electronic, to support my efforts to learn perl and do a practical
project at the same time.

It appears that between man and perldocs, I _can_ get all the content I
need, and some very good examples.

However, sorry, I don't find these the friendliest documentation
environments. Man pages are great, as far as they go...

Also, I don't find them very well calibrated with respect to
level-of-expertise or detail. Just in the course of this thread, I've found
myself pulled deeper and deeper into what I consider highly-technical
aspects of perl. It is reassuring to know there's plenty of flexibility to
perl, but I sure wish there were "depth" notices (as on the edges of most
swimming pools) on the various materials. I'm founding myself drowned in
detail when all I want to do is get to the other side of the pool.

Is it possible to classify perl material into, say, "beginning",
"intermediate" and "advanced"? Is this done in practice?

Thanks,

Henry

(e-mail address removed) remove 'zzz'

[snip more good advice]

Using GIT to get remote code	1	Dec 30, 2021
RegExp pattern / replace function	0	Mar 3, 2025
How to get IE6 to respect <td> height attribute?	8	Feb 23, 2004
Parsing HTML using TreeBuilder - how to get the "next" tag?	1	Jun 12, 2005
can someone teach me this?	6	Jul 20, 2012
empty leading field from split()	1	Nov 2, 2006
How to re-write a text files content to have right-justified columns ?	3	Nov 16, 2007
how to process each directory	7	Feb 27, 2008

Teach me how to fish, regexp

Henry

Martien Verbruggen

Roy Johnson

Henry

Henry

Martien Verbruggen

Bryan Castillo

Roy Johnson

Tad McClellan

Henry

Sam Holden

Helgi Briem

Roy Johnson

David H. Adler

Helgi Briem

ko

Helgi Briem

Henry

Henry

Henry

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads