I don't know what's wrong here !

W

Wes Groleau

I have a script to process a certain file format.
It was working at one time, but it doesn't now.
Obviously I changed something important, but
I have no memory of doing so, nor can I see anything wrong.

The input format: Each line has up to four parts:
Level, ID, Tag, and Text; separated by white space.

White space is optional _before_ the level.

ID is optional; if present, it is the second part.
It is printable characters, starting and ending
with '@'

Tag is always there. It is a single "word" of
letters, no white space. For simplicity, I am
going for any non-whitespace characters.

Text is optional and is everything after the
tag.

Here is what is not working:

@lines = <STDIN>; # Read all lines into array @lines
foreach (@lines) # Each line is placed into predefined scalar $_
{
# Get rid of CR/LF chars for originating platform.
# (So this platform's style can be put back on later.)
chomp;
s/[\r\n]//;

($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");

# Get the component parts of the line.
# Can UTF-8 input break this regexp???
#($Level, $ID, $Tag, $XRef, $Text, $Comment) =
# --- ----- --- ----- -- --
#/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{.*}})?/;
# 1 2 3 4 5 6

# Let's just use four (standard GEDCOM) for now.
# 1 2 3 4
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

# Get level of subrecord
if ( ! defined ( $1 ) )
{
$Level = 0;
$LineEOL = "NO LEVEL: $_\n";
}
else
{
$Level = $1;
}
print $Level . "\n"; # This part (Level) seems to work

# Save the label (if specified)
if ( ! defined ( $2 ) )
{
$ID = "";
}
else
{
$ID = $2;
}
print $ID . "\n";

# Uppercase the tag
if ( ! defined ( $3 ) )
{
$Tag = "";
$LineEOL = "NO TAG: $_\n";
}
else
{
$Tag = $3;
}
print $Tag . "\n"; # When input line is "0 HEAD" this should be "HEAD"
# but it is blank instead and the diagnostic "NO TAG"
# is added.

# Save everything else
if ( ! defined ( $4 ) )
{
$Text = "";
}
else
{
$Text = $4;
}
print $Text . "\n";
 
R

Ragnar Hafstað

snipped verbose description of problem

in case like this you should make a short script only containing the part
that seems to be failing
and usually the problem will become clear
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

look at this a bit:
(your input is '0 HEAD')
^\s* zero or more whitespace. ok
(\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
\s+ one or more ws. matches ' ', rest is 'HEAD'
(@\S+@)? maches ''
\s* matches ''
(\S+) matches 'HEAD' rest is ''
\s+ does not match . oooooops
(.*)/;

maybe your previous input contained space at the end of the '0 HEAD' line

gnari
 
J

John W. Krahn

Wes said:
I have a script to process a certain file format.
It was working at one time, but it doesn't now.
Obviously I changed something important, but
I have no memory of doing so, nor can I see anything wrong.

The input format: Each line has up to four parts:
Level, ID, Tag, and Text; separated by white space.

White space is optional _before_ the level.

ID is optional; if present, it is the second part.
It is printable characters, starting and ending
with '@'

Tag is always there. It is a single "word" of
letters, no white space. For simplicity, I am
going for any non-whitespace characters.

Text is optional and is everything after the
tag.

Here is what is not working:

use warnings;
use strict;
@lines = <STDIN>; # Read all lines into array @lines
foreach (@lines) # Each line is placed into predefined scalar $_
{

Do you really need to slurp the entire input into an array?

# Get rid of CR/LF chars for originating platform.
# (So this platform's style can be put back on later.)
chomp;
s/[\r\n]//;

chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
removes the FIRST occurrence of either \r or \n from $_. If you want to
remove any trailing whitespace including \r and \n then:

s/\s+\Z//;

($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");

# Get the component parts of the line.
# Can UTF-8 input break this regexp???
#($Level, $ID, $Tag, $XRef, $Text, $Comment) =
# --- ----- --- ----- -- --
#/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{.*}})?/;
# 1 2 3 4 5 6

# Let's just use four (standard GEDCOM) for now.
# 1 2 3 4
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

Why not just assign directly to your variables? That way you can
eliminate all the following else clauses. Also, the @ characters have
to be escaped or perl will try to interpolate them as arrays.

my ( $Level, $ID, $Tag, $Text ) =
/^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;

# Get level of subrecord
if ( not defined $Level )
{
$Level = 0;
$LineEOL = "NO LEVEL: $_\n";
}
print "$Level\n"; # This part (Level) seems to work

# Save the label (if specified)
if ( not defined $ID )
{
$ID = '';
}
print "$ID\n";

# Uppercase the tag
if ( not defined $Tag )
{
$Tag = '';
$LineEOL = "NO TAG: $_\n";
}
print "$Tag\n"; # When input line is "0 HEAD" this should be "HEAD"
# but it is blank instead and the diagnostic "NO TAG"
# is added.

# Save everything else
if ( not defined $Text )
{
$Text = '';
}
print "$Text\n";



John
 
G

Gunnar Hjalmarsson

Wes said:
Text is optional and is everything after the tag.

That condition is not reflected in this regex, which requires one or
more whitespace characters after Tag, or else it won't capture the
data as expected:
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
---------------------------------^^^

It's better written as:

/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
---------------------------------^^^-------^^

Other comments:

You should declare your variables and run your script with strictures
and warnings enabled.

use strict;
use warnings;
Here is what is not working:

That's a pointless description of your problem, isn't it? You'd better
explain what you expect the script to output, what it actually
outputs, and which error and warning messages you receive (if any).
@lines = <STDIN>; # Read all lines into array @lines

Are you really using the STDIN filehandle for reading the file? Since
STDIN is a special filehandle, you should use some other name.

Basically I think your code can be shortened to:

while (<FILE>) {
chomp;
my ($Level, $ID, $Tag, $Text) =
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
$Level ||= 0;
$ID ||= '';
$Text ||= '';
print "$Level\n$ID\n$Tag\n$Text\n";
}
 
W

Wes Groleau

Ragnar Hafsta�������������������� said:
look at this a bit:
(your input is '0 HEAD')
^\s* zero or more whitespace. ok
(\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
\s+ one or more ws. matches ' ', rest is 'HEAD'
(@\S+@)? maches ''
\s* matches ''
(\S+) matches 'HEAD' rest is ''
\s+ does not match . oooooops
(.*)/;

maybe your previous input contained space at the end of the '0 HEAD' line

OK, I see that. Thanks. Maybe there was a space.

However, if the regexp doesn't match, why was
it able to get the $Level and leave the other
parts undefined?
 
W

Wes Groleau

John said:
Do you really need to slurp the entire input into an array?

Not really. I have used foreach (<>) before
I got this from someone else, and just kept that part.
But the original regexp didn't work. I made it work
but now I've broken it again.
chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
removes the FIRST occurrence of either \r or \n from $_. If you want to
remove any trailing whitespace including \r and \n then:

s/\s+\Z//;

OK, thanks. That wasn't a problem in this particular
environment, but it's nice to know.
Why not just assign directly to your variables? That way you can
eliminate all the following else clauses. Also, the @ characters have
to be escaped or perl will try to interpolate them as arrays.

my ( $Level, $ID, $Tag, $Text ) =
/^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;

That was tried, but there was some problem I don't remember.
However, the '@' has always worked OK. But escaping them
won't hurt--I remember the error messages from that in other
contexts.

So you have helped me quite a bit (thanks), but I still
don't understand how the regexp as written was able to
define the $Level but not the $Tag
 
W

Wes Groleau

Gunnar said:
That condition is not reflected in this regex, which requires one or
more whitespace characters after Tag, or else it won't capture the
data as expected:


---------------------------------^^^

Sorry, I was unclear. The only way to determine
the end of the Tag is by whitespace. But as someone
else pointed out, if there is no text, then there may
be a line-end instead of white space. Perhaps this
is what you are saying?
It's better written as:

/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
---------------------------------^^^-------^^

So here, $4 matches the _inner_ parentheses?
I don't know what ?: means in this context.

That's a pointless description of your problem, isn't it? You'd better
explain what you expect the script to output, what it actually
outputs, and which error and warning messages you receive (if any).

Well, I explained that $1 or $Level gets defined,
$2 ($ID) and $4 ($Text) do not (but weren't expected to
on the first line), and $3 ($Tag) did not which was a
surprise and prevented reading any other lines.
After finding the fatal error of $Tag undefined,
it seemed pointless to quote the remaining two pages
of the script.
Are you really using the STDIN filehandle for reading the file? Since
STDIN is a special filehandle, you should use some other name.

Yes, I pipe cat (file) into the script. I figured
I should get the output right before improving the UI.
Basically I think your code can be shortened to:

while (<FILE>) {
chomp;

Oh, I definitely need to convert line ends on files
from other platforms. (Mac/Win/Unix) But as John
pointed out, I did it wrong.
my ($Level, $ID, $Tag, $Text) =
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;

I'll give this a try, along with the other two guys' suggestions.
$Level ||= 0;
$ID ||= '';
$Text ||= '';
print "$Level\n$ID\n$Tag\n$Text\n";

OK, but I definitely want to keep the error reporting
if Level or Tag is missing.

Thanks to all three responders for the additional understanding.

For the curious, what I am doing is reformatting a GEDCOM
file, AND splitting it into multiple files. Have to
open a new file every time a Level = 0 line comes in,
with the filename determined by parts of the line.

But all of that is later in the script, and is meaningless
if Tag is undefined or missing.

I will experiment with what you three guys have offered,
and I appreciate it very much. Thanks.

--
Wes Groleau

Nobody believes a theoretical analysis -- except the guy who did it.
Everybody believes an experimental analysis -- except the guy who did it.
-- Unknown
 
J

Jay Tilton

: "Ragnar Hafstað" wrote:
: >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
: >
: > look at this a bit:
: > (your input is '0 HEAD')
: > ^\s* zero or more whitespace. ok
: > (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
: > \s+ one or more ws. matches ' ', rest is 'HEAD'
: > (@\S+@)? maches ''
: > \s* matches ''
: > (\S+) matches 'HEAD' rest is ''
: > \s+ does not match . oooooops
: > (.*)/;
: >
: > maybe your previous input contained space at the end of the '0 HEAD' line
:
: OK, I see that. Thanks. Maybe there was a space.
:
: However, if the regexp doesn't match, why was
: it able to get the $Level and leave the other
: parts undefined?

You are probably seeing a stale value that was captured from a prior
iteration. The values in $1, $2, etc. are not reset when a match fails,
and the code does not verify that the match is successful before using
those variables.
 
G

Gunnar Hjalmarsson

Wes said:
Sorry, I was unclear. The only way to determine the end of the Tag
is by whitespace. But as someone else pointed out, if there is no
text, then there may be a line-end instead of white space. Perhaps
this is what you are saying?

Yes, that is what I mean.
So here, $4 matches the _inner_ parentheses?
Yes.

I don't know what ?: means in this context.
http://www.perldoc.com/perl5.8.0/pod/perlre.html#(--pattern)


Oh, I definitely need to convert line ends on files from other
platforms. (Mac/Win/Unix) But as John pointed out, I did it
wrong.

I just thought that that is better handled when transferring between
file systems (using "ASCII" or "Text" transfer mode). But maybe it's
wise to explicitly remove all trailing whitespace as well...
 
L

LÄÊ»ie Techie

For the curious, what I am doing is reformatting a GEDCOM file, AND
splitting it into multiple files. Have to open a new file every time a
Level = 0 line comes in, with the filename determined by parts of the
line.

That's what I thought when I read your original post. BTW, why reinvent
the wheel? Download the Gedcom module from CPAN. I'm sure that module
does most of what you want.

Hau'oli Makahiki Hou!
La'ie Techie
 
R

Ragnar Hafstað

Wes Groleau said:
However, if the regexp doesn't match, why was
it able to get the $Level and leave the other
parts undefined?

you should try to use print statements to help you debug.
a couple of them in the if/else blocks would have shown you
that $1 was indeed undefined, you set $Level=0, and test that later.
as the value '0' happened to be the value in the line, you just assumed
that it 'got' the $Level.

gnari
 
W

Wes Groleau

Jay said:
: However, if the regexp doesn't match, why was
: it able to get the $Level and leave the other
: parts undefined?

You are probably seeing a stale value that was captured from a prior
iteration. The values in $1, $2, etc. are not reset when a match fails,
and the code does not verify that the match is successful before using
those variables.

Ah, that's useful info, thanks. However, in this case,
it was after the first line in a new invocation.
But I will have to explicitly undefine them each time.
 
W

Wes Groleau

Ragnar Hafsta�������������������� said:
you should try to use print statements to help you debug.
a couple of them in the if/else blocks would have shown you
that $1 was indeed undefined, you set $Level=0, and test that later.
as the value '0' happened to be the value in the line, you just assumed
that it 'got' the $Level.

Ouch. I put the print statements later and "assumed"
it got the level because the diagnostic "no level"
wasn't there. But the reason it wasn't there is the
"no tag" diagnostic overwrote it! :)

I need to find a boook and learn how to use the debugger.
I've used it before, but without the book, I could only
single-step and that was a royal pain.
 
R

Ragnar Hafstað

Wes Groleau said:
I need to find a boook and learn how to use the debugger.
I've used it before, but without the book, I could only
single-step and that was a royal pain.

in this kind of case, simple print statements would do the trick

gnari
 
G

Gunnar Hjalmarsson

Wes said:
Ah, that's useful info, thanks. However, in this case, it was
after the first line in a new invocation. But I will have to
explicitly undefine them each time.

I don't think that you can undefine those variables explicitly. This
is how you typically do it instead:

if (/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/) {
# It matched - do something
} else {
# It did not match - do something else
}
 
W

Wes Groleau

Gunnar said:
I don't think that you can undefine those variables explicitly. This
is how you typically do it instead:

if (/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/) {
# It matched - do something
} else {
# It did not match - do something else
}

Isn't "undef" a predefined 'constant' that can be
assigned to a variable? I haven't tried it, but
it seems I've read that somewhere...

I may also try the 'if' but I'd prefer it to
match every line because if the syntax is bad
I'd like to use what I can in the diagnostic
messages.
 
W

Wes Groleau

LÄÊ»ie Techie said:
That's what I thought when I read your original post. BTW, why reinvent
the wheel? Download the Gedcom module from CPAN. I'm sure that module
does most of what you want.

Looked at it long ago, and did a little with it
but eventually rejected it for reasons I can't
remember. Maybe I should take another look.

OTOH, it _feels_ like I'm so close now....
(which is what many programmers say for the
last six months of their project.)

:)
 
T

Tad McClellan

Wes Groleau said:
Jay Tilton wrote:


But I will have to explicitly undefine them each time.


No you won't.

You will have to ensure that the match _succeeds_ before you
make use of the dollar-digit (capture) variables.
 
G

Gunnar Hjalmarsson

Wes said:
Isn't "undef" a predefined 'constant' that can be assigned to a
variable?

Yes, it is. But as regards those special $1, $2, etc. variables, you
cannot assign anything to them. They are populated only through regex
capturing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
Martindap

Latest Threads

Top