Need help understanding how a file input block works

P

Paul

Hello there. Please forgive this newbie question but I don't really
know Perl as I have only been using it for a few days.

I need to understand how a particular script works and am having
difficulty with one particular subroutine.
First, here is a sample of the input file (IN_FILE):
---
OVERVIEW
----------------------------
This is where notes and stuff goes.
And maybe some more notes here too.

MATERIALS
----------------------------
This is where more notes and stuff goes.

etc...
---

Second, here is a section of the Perl code that I am having
difficulties with:
---
sub parsefile
{
1: $overview=0; @overview = ();
2: # open file..
3: while ($line = <IN_FILE>)
4: {
5: if ($line =~ /^\U\w*/)
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview section")}
11: $overview++;
12: $bin = \@overview;
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;
17: push (@$bin,$line);
18: }
19: if (!$overview) {error("Missing a Overview section")}
20: # etc...
}
---

So here's what I've figured out so far:

- line 3 starts a loop that goes through each line in the input file
- line 5 I had difficulties with, but I think it means "if the line
starts with an uppercase word char" then do this block
- line 7 says "if the line starts with 'OVERVIEW'" then do this block

(Here's where I start to lose it. I think I might be having
difficulties with the 'block' concept.. but let me continue..)

- line 9 starts by saying 'get the next line in the input file.
*Then* it says "next if" the line doesn't contain a bunch of dashes.

QUESTION 1: Where does "next" go? Does it go to the start of line 9
(i.e. the beginning of the current block) or to line 3?

- line 10 is an error check and calls some other method. (I'm okay
with this line.)
- line 11 increments the $overview count. ok.

- line 12 --> I HAVE NO IDEA!

QUESTION 2: What does line 12 do? Does it assign something to a
variable? If so, I don't get it. at all.

- line 13 - "next" to where? Line 3?

- line 16 does a double-quote substitution. ok.

- line 17 pushes the content of the $line into .. what *is* that
variable? where did it come from? (It's not used anywhere else in
the script!)

QUESTION 3: What is this variable in line 17?

Finally, somewhere along the line I think the @overview array is
assigned the content of each of the lines between the heading rows.
But I'm not sure how that's done. I'm pretty sure it happens in the
block above, but there's some kind of voodoo magic working that is
keeping me from seeing it.

Can some kind person please help me understand how this code works? I
have spent several hours, gone through 2 O'Reilly books and asked
everyone within yelling distance of my desk but so far I haven't been
able to completely understand this.

TIA.
 
J

John W. Krahn

Paul said:
Hello there. Please forgive this newbie question but I don't really
know Perl as I have only been using it for a few days.

I need to understand how a particular script works and am having
difficulty with one particular subroutine.
First, here is a sample of the input file (IN_FILE):
---
OVERVIEW
----------------------------
This is where notes and stuff goes.
And maybe some more notes here too.

MATERIALS
----------------------------
This is where more notes and stuff goes.

etc...
---

Second, here is a section of the Perl code that I am having
difficulties with:
---
sub parsefile
{
1: $overview=0; @overview = ();
2: # open file..
3: while ($line = <IN_FILE>)
4: {
5: if ($line =~ /^\U\w*/)
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview section")}
11: $overview++;
12: $bin = \@overview;
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;
17: push (@$bin,$line);
18: }
19: if (!$overview) {error("Missing a Overview section")}
20: # etc...
}
---

So here's what I've figured out so far:

- line 3 starts a loop that goes through each line in the input file

Correct so far.
- line 5 I had difficulties with, but I think it means "if the line
starts with an uppercase word char" then do this block

$ perl -le'my $pattern = qr/^\U\w*/; print $pattern'
(?-xism:^\W*)

So the expression:
5: if ($line =~ /^\U\w*/)

Is the same as:
5: if ($line =~ /^\W*/)

Which says match if the line begins with zero or more *non-word* characters
and since *every* line begins with zero non-word characters then every line
will match. In other words, this test does nothing useful.
- line 7 says "if the line starts with 'OVERVIEW'" then do this block
Correct.

(Here's where I start to lose it. I think I might be having
difficulties with the 'block' concept.. but let me continue..)

- line 9 starts by saying 'get the next line in the input file.
*Then* it says "next if" the line doesn't contain a bunch of dashes.
Correct.

QUESTION 1: Where does "next" go? Does it go to the start of line 9
(i.e. the beginning of the current block) or to line 3?

next goes to the beginning of the enclosing loop and there is only one loop in
your example.
- line 10 is an error check and calls some other method. (I'm okay
with this line.)
- line 11 increments the $overview count. ok.

- line 12 --> I HAVE NO IDEA!

QUESTION 2: What does line 12 do? Does it assign something to a
variable? If so, I don't get it. at all.

It assigns a reference of the array @overview to the scalar $bin. I have no
idea why this is being done because the code is always referencing the same array.
- line 13 - "next" to where? Line 3?

Again, there is only one loop that 'next' can apply to.
- line 16 does a double-quote substitution. ok.

- line 17 pushes the content of the $line into .. what *is* that
variable? where did it come from? (It's not used anywhere else in
the script!)

QUESTION 3: What is this variable in line 17?

That is the variable that was assigned to on line 12. The reference in $bin
is dereferenced and so the original array @overview has a value pushed onto it.




John
 
B

Ben Morrow

Quoth "Paul said:
Hello there. Please forgive this newbie question but I don't really
know Perl as I have only been using it for a few days.

Firstly: thank you for your clear problem description :).
I need to understand how a particular script works and am having
difficulty with one particular subroutine.
First, here is a sample of the input file (IN_FILE):
---
OVERVIEW
----------------------------
This is where notes and stuff goes.
And maybe some more notes here too.

MATERIALS

There are several stylistic and code-hygiene issues with this code; I
will deal with them below, after your 'real' questions.
---
sub parsefile
{
1: $overview=0; @overview = ();
2: # open file..
3: while ($line = <IN_FILE>)
4: {
5: if ($line =~ /^\U\w*/)
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview section")}
11: $overview++;
12: $bin = \@overview;
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;
17: push (@$bin,$line);
18: }
19: if (!$overview) {error("Missing a Overview section")}
20: # etc...
}
---

So here's what I've figured out so far:

- line 3 starts a loop that goes through each line in the input file
- line 5 I had difficulties with, but I think it means "if the line
starts with an uppercase word char" then do this block

Pretty much. $line =~ /.../ is an expression that is 'true'[0] if the
contents of $line match the pattern ('regular expression' or 'regex')
between the slashes. A tutorial for Perl's regexen is available in
'perldoc perlretut'; the full description is in 'perldoc perlre'. Your
pattern has four parts:

^ says 'match if we are at the start of the string',
\U is an error, see below,
\w says 'match any word character', which for various reasons means
'any letter (upper- or lowercase), any digit, or _',
* says 'allow the preceding item to match zero or more times'.

[0] I'm assuming you know enough about programming in general to be
familiar with the idea of expressions being 'true' or 'false', and
constructions like 'if' that test that.

The original author presumably thought /\U/ meant 'match any uppercase
letter', but that is not the case. It *actually* means 'before
attempting to match this pattern, take everything from here to the next
\E and make it uppercase'. So the pattern actually being matched is
/^\W*/, which means 'the start of the string, followed by zero-or-more
characters which are *not* word characters'. This will match anything at
all, as the only non-optional part is 'the start of the string', and
every string has a start :).

What the author meant, instead of /\U/, was /[[:upper:]]/. This means

[...] match any one of the characters inside the brackets
[:upper:] insert a list of all uppercase characters. Note that
this only works inside [...].

So, your pattern (once fixed) will match strings like 'OVERVIEW', but
also strings like 'Overview' and 'Xfff_8885'; and, since you don't
insist the match goes all the way to the end of the string, 'U %#@:'.
Note that the /\w*/ part is completely useless, as it is allowed to
match nothing at all and nothing follows it. I suspect you probably want
something more like /^[[:upper:]]+$/, which means

^ the start of the string
[[:upper:]] any uppercase character...
+ ...one or more times
$ the end of the string

If you want to allow spaces in the string you need something like
/^[[:upper:]\s]+$/, where \s means 'match any space character'.
- line 7 says "if the line starts with 'OVERVIEW'" then do this block
Yup.

(Here's where I start to lose it. I think I might be having
difficulties with the 'block' concept.. but let me continue..)

- line 9 starts by saying 'get the next line in the input file.
*Then* it says "next if" the line doesn't contain a bunch of dashes.

QUESTION 1: Where does "next" go? Does it go to the start of line 9
(i.e. the beginning of the current block) or to line 3?

'next' is documented in perldoc perl, in the sextion "Loop Control".
What it means is 'start the next iteration of the innermost loop'. In
this case, the innermost loop is the 'while' loop that begins on line 3,
so that's where it goes. Note that it doesn't just jump up to line 3:
that would be 'redo', which restarts the current iteration. It
- line 10 is an error check and calls some other method. (I'm okay
with this line.)
- line 11 increments the $overview count. ok.

- line 12 --> I HAVE NO IDEA!

QUESTION 2: What does line 12 do? Does it assign something to a
variable? If so, I don't get it. at all.

line 12 says 'take a reference to the variable @overview, and assign it
to $bin'. Understanding the concept of references (or 'pointers' in C)
is crucial to any more than basic programming. I would strongly
recommend you read through the whole of perldoc perlreftut, which is a
decent introduction to the idea. (If it *isn't*, please let us know
which bits you don't understand, so we can fix them :).)
- line 13 - "next" to where? Line 3?

Yup: again, the innermost loop.
- line 16 does a double-quote substitution. ok.

- line 17 pushes the content of the $line into .. what *is* that
variable? where did it come from? (It's not used anywhere else in
the script!)

QUESTION 3: What is this variable in line 17?

It's not a variable :). It's a dereference expression. It has two parts:
'$bin' is a normal scalar variable, and '@' as a prefix is the array
deref operator, and says 'find the array that this reference refers to':
in this case the @overview array (because of line 12). Again, read
perlreftut.
Finally, somewhere along the line I think the @overview array is
assigned the content of each of the lines between the heading rows.
But I'm not sure how that's done. I'm pretty sure it happens in the
block above, but there's some kind of voodoo magic working that is
keeping me from seeing it.

It's the reference stuff that's confused you.

So, some more general comments. I'll include the original again, so you
can see what I'm talking about.
sub parsefile
{

This is an unimportant stylistic matter, but it is usual in Perl to put
an opening brace at the end of the line:

sub parsefile {
1: $overview=0; @overview = ();

These two are global variables. Globals are best avoided: if some other
part of your program uses $overview, it could interfere with this sub,
causing a very hard-to-find bug. You should *always* begin your programs
with

use strict;

which will stop you from using globals by accident. Next you should
*always* have

use warnings;

which will warn you about potential problems. Adding 'use strict;' to
this program will likely produce a whole lot of errors about 'Global
symbol $foo requires explicit package name': this is telling you you
have an undeclared global. You can fix these by putting

my $foo;

*just* before you need the variable: this creates a $foo that only
exists in this part of the program, so noone else can interfere with it.
See perldoc perlsub, the section called "Private Variables via my()",
and also http://perl.plover.com/FAQs/Namespaces.html .

In this case, you need the 'my' up here, outside the loop; otherwise
you'll get a fresh copy every time the loop goes around. You don't need
them until just before the loop, though, so they can go below opening
the file.

Variables in Perl are always initialized: arrays and hashes to 'empty',
scalars to the special value undef, that behaves like 0 when treated as
a number and like '' when treated as a string. These two initializations
are probably a poor attempt to compensate for the fact the author wasn't
using proper 'my' variables: you don't need them.
2: # open file..
3: while ($line = <IN_FILE>)

This filehandle (IN_FILE) is also global. You don't specify how you're
opening it, but I presume it's something like

open IN_FILE, 'foo' or die "can't open 'foo': $!";

(You *do* check you could open it, don't you? And you *do* include the
reason it couldn't be opened, which can be found in the magic $!
variable?)

This should be replaced with

open my $IN_FILE, '<', 'foo'
or die "can't open 'foo': $!";

This has two differences: firstly, the filehandle is placed in a new
'my' variable, so it's no longer global; and secondly, you're explicitly
telling Perl you want to open the file for reading. This isn't so
important in this case, but get into the habit of doing it now, or one
day you'll write something like

open my $FILE, $filename or die "...";

, some malicious person will manage to make $filename contain
'|rm -rf /', and all you're files will be deleted. Perl's 'magic open'
is terribly flexible, but rather dangerous.

The while loop then needs to become

while (my $line = <$IN_FILE>) {

so that $line is declared, of course.
4: {
5: if ($line =~ /^\U\w*/)

As discussed above, this regex matches everything, so this whole 'if'
block is completely useless.
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview
section")}

if ($overview) {

would be more consistent with what comes below. Also, stylistically,
writing an if all bunched up like that is horrid.

if ($overview) {
error(...);
}
11: $overview++;
12: $bin = \@overview;

This line I don't understand the purpose of. Is $bin assigned a ref to
some other array earlier in the program? If not, then the program will
crash unless the first line in the file is an OVERVIEW section, as the
deref below will have nothing to deref; and the whole exercise is
pointless, as you can just push onto the @overview array directly.
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;

The backslashes are both unnecessary.

$line =~ s/"/'/g;

This would be clearer and faster with tr///:

$line =~ tr/"/'/;

Ben
 
T

Tad McClellan

Paul said:
Hello there. Please forgive this newbie question but I don't really
know Perl as I have only been using it for a few days.

I need to understand how a particular script works
sub parsefile
{
1: $overview=0; @overview = ();
2: # open file..
3: while ($line = <IN_FILE>)
4: {
5: if ($line =~ /^\U\w*/)
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview section")}
11: $overview++;
12: $bin = \@overview;
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;
17: push (@$bin,$line);
18: }
19: if (!$overview) {error("Missing a Overview section")}
20: # etc...
}


Whoever wrote this code did not know Perl very well either...

QUESTION 1: Where does "next" go?


It goes where the documentation for the next operator says it will go. :)

perldoc -f next

next LABEL
next The "next" command is like the "continue" statement in C; it
starts the next iteration of the loop:

Does it go to the start of line 9
(i.e. the beginning of the current block) or to line 3?


It goes to the beginning of the _loop_ containing the next,
line 3 in this case.

- line 12 --> I HAVE NO IDEA!
QUESTION 2: What does line 12 do? Does it assign something to a
variable? If so, I don't get it. at all.


It is taking a reference to an array:

perldoc perlreftut


- line 13 - "next" to where? Line 3?

Yes.


- line 16 does a double-quote substitution. ok.


Though it does illustrate that this programmer didn't really know
much Perl. Double quotes are not meta in regexes, and single quotes
are not meta in strings, so neither needs to be backslashed:

$line =~ s/"/'/g;

And even better way would be:

$line =~ tr/"/'/;

- line 17 pushes the content of the $line into .. what *is* that
variable?


It is dereferencing the reference that it took earlier.

where did it come from? (It's not used anywhere else in
the script!)


$bin is. The at-sign is dereferencing $bin as an array.
 
B

Ben Morrow

Quoth Ben Morrow said:
Quoth "Paul said:
Hello there. Please forgive this newbie question but I don't really
know Perl as I have only been using it for a few days.

Firstly: thank you for your clear problem description :).
I need to understand how a particular script works and am having
difficulty with one particular subroutine.
First, here is a sample of the input file (IN_FILE):
---
OVERVIEW
----------------------------
This is where notes and stuff goes.
And maybe some more notes here too.

MATERIALS

There are several stylistic and code-hygiene issues with this code; I
will deal with them below, after your 'real' questions.
---
sub parsefile
{
1: $overview=0; @overview = ();
2: # open file..
3: while ($line = <IN_FILE>)
4: {
5: if ($line =~ /^\U\w*/)
6: {
7: if ($line =~ /^OVERVIEW/)
8: {
9: $line = <IN_FILE>; next if $line !~ /--------+/;
10: if ($overview== 1) {error("More than one Overview section")}
11: $overview++;
12: $bin = \@overview;
13: next;
14: }
15: }
16: $line =~ s/\"/\'/g;
17: push (@$bin,$line);
18: }
19: if (!$overview) {error("Missing a Overview section")}
20: # etc...
}
---

So here's what I've figured out so far:

- line 3 starts a loop that goes through each line in the input file
- line 5 I had difficulties with, but I think it means "if the line
starts with an uppercase word char" then do this block

Pretty much. $line =~ /.../ is an expression that is 'true'[0] if the
contents of $line match the pattern ('regular expression' or 'regex')
between the slashes. A tutorial for Perl's regexen is available in
'perldoc perlretut'; the full description is in 'perldoc perlre'. Your
pattern has four parts:

^ says 'match if we are at the start of the string',
\U is an error, see below,
\w says 'match any word character', which for various reasons means
'any letter (upper- or lowercase), any digit, or _',
* says 'allow the preceding item to match zero or more times'.

[0] I'm assuming you know enough about programming in general to be
familiar with the idea of expressions being 'true' or 'false', and
constructions like 'if' that test that.

The original author presumably thought /\U/ meant 'match any uppercase
letter', but that is not the case. It *actually* means 'before
attempting to match this pattern, take everything from here to the next
\E and make it uppercase'. So the pattern actually being matched is
/^\W*/, which means 'the start of the string, followed by zero-or-more
characters which are *not* word characters'. This will match anything at
all, as the only non-optional part is 'the start of the string', and
every string has a start :).

What the author meant, instead of /\U/, was /[[:upper:]]/. This means

[...] match any one of the characters inside the brackets
[:upper:] insert a list of all uppercase characters. Note that
this only works inside [...].

So, your pattern (once fixed) will match strings like 'OVERVIEW', but
also strings like 'Overview' and 'Xfff_8885'; and, since you don't
insist the match goes all the way to the end of the string, 'U %#@:'.
Note that the /\w*/ part is completely useless, as it is allowed to
match nothing at all and nothing follows it. I suspect you probably want
something more like /^[[:upper:]]+$/, which means

^ the start of the string
[[:upper:]] any uppercase character...
+ ...one or more times
$ the end of the string

If you want to allow spaces in the string you need something like
/^[[:upper:]\s]+$/, where \s means 'match any space character'.
- line 7 says "if the line starts with 'OVERVIEW'" then do this block
Yup.

(Here's where I start to lose it. I think I might be having
difficulties with the 'block' concept.. but let me continue..)

- line 9 starts by saying 'get the next line in the input file.
*Then* it says "next if" the line doesn't contain a bunch of dashes.

QUESTION 1: Where does "next" go? Does it go to the start of line 9
(i.e. the beginning of the current block) or to line 3?

'next' is documented in perldoc perl, in the sextion "Loop Control".

I meant

perldoc perlsyn, in the section "Loop Control"

of course.

Ben
 
P

Paul

Wow! Thank you so much for taking the time to respond to my post with
such a detailed response. That was great and *very* helpful.

This line I don't understand the purpose of. Is $bin assigned a ref to
some...

After reading your reply about reference/dereference expressions, I
now understand what the script is trying to do.

In my original post, I included a sample of what the input file looks
like. Basically, it's a normal text file with a bunch of sections
that have headings in ALL CAPS followed by a dash line separator. The
sample code that I posted only had the first 'if' block in it, but the
actual script contains about 5 of these 'if' blocks. (They all look
about the same, just the variables change.)

So, when the $line matches a particular section heading, the $bin
variable is referenced to the desired array variable (e.g. line 12).
Then when it gets to the end, it pushes the $line into the correct
array variable (line 17). That line 12 reference assignment changes
in each section, so basically the script is trying to be efficient in
the way it's dumping each line into the correct, desired array
variable.

Essentially, it parses each section from the input file into a
different array variable. I get it now. Maybe not some of the fiddly
bits still, but enough to know what I'm doing now.

Thank you all for your terrific responses. I'm glad that I decided to
post to this news group.

All the best! Cheers. Paul. =)
 
T

Tad McClellan

Paul said:
Basically, it's a normal text file with a bunch of sections
that have ....
followed by a dash line separator.


Then you might also what to read up on the $/ variable in:

perldoc perlvar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top