I don't know what's wrong here !

Discussion in 'Perl Misc' started by Wes Groleau, Dec 26, 2003.

  1. Wes Groleau

    Wes Groleau Guest

    I have a script to process a certain file format.
    It was working at one time, but it doesn't now.
    Obviously I changed something important, but
    I have no memory of doing so, nor can I see anything wrong.

    The input format: Each line has up to four parts:
    Level, ID, Tag, and Text; separated by white space.

    White space is optional _before_ the level.

    ID is optional; if present, it is the second part.
    It is printable characters, starting and ending
    with '@'

    Tag is always there. It is a single "word" of
    letters, no white space. For simplicity, I am
    going for any non-whitespace characters.

    Text is optional and is everything after the
    tag.

    Here is what is not working:

    @lines = <STDIN>; # Read all lines into array @lines
    foreach (@lines) # Each line is placed into predefined scalar $_
    {
    # Get rid of CR/LF chars for originating platform.
    # (So this platform's style can be put back on later.)
    chomp;
    s/[\r\n]//;

    ($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");

    # Get the component parts of the line.
    # Can UTF-8 input break this regexp???
    #($Level, $ID, $Tag, $XRef, $Text, $Comment) =
    # --- ----- --- ----- -- --
    #/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{.*}})?/;
    # 1 2 3 4 5 6

    # Let's just use four (standard GEDCOM) for now.
    # 1 2 3 4
    /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    # Get level of subrecord
    if ( ! defined ( $1 ) )
    {
    $Level = 0;
    $LineEOL = "NO LEVEL: $_\n";
    }
    else
    {
    $Level = $1;
    }
    print $Level . "\n"; # This part (Level) seems to work

    # Save the label (if specified)
    if ( ! defined ( $2 ) )
    {
    $ID = "";
    }
    else
    {
    $ID = $2;
    }
    print $ID . "\n";

    # Uppercase the tag
    if ( ! defined ( $3 ) )
    {
    $Tag = "";
    $LineEOL = "NO TAG: $_\n";
    }
    else
    {
    $Tag = $3;
    }
    print $Tag . "\n"; # When input line is "0 HEAD" this should be "HEAD"
    # but it is blank instead and the diagnostic "NO TAG"
    # is added.

    # Save everything else
    if ( ! defined ( $4 ) )
    {
    $Text = "";
    }
    else
    {
    $Text = $4;
    }
    print $Text . "\n";

    ----------

    In the form above, ID and Text are unknown, since the script
    wants to open and write to a file with name derived from $Tag.
    So it blows up on the first line "0 HEAD"

    --
    Wes Groleau
    "Lewis's case for the existence of God is fallacious."
    "You mean like circular reasoning?"
    "He believes in God. Therefore, he's fallacious."
     
    Wes Groleau, Dec 26, 2003
    #1
    1. Advertising

  2. "Wes Groleau" <> wrote in message
    news:...
    >

    snipped verbose description of problem

    in case like this you should make a short script only containing the part
    that seems to be failing
    and usually the problem will become clear

    > /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;


    look at this a bit:
    (your input is '0 HEAD')
    ^\s* zero or more whitespace. ok
    (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
    \s+ one or more ws. matches ' ', rest is 'HEAD'
    (@\S+@)? maches ''
    \s* matches ''
    (\S+) matches 'HEAD' rest is ''
    \s+ does not match . oooooops
    (.*)/;

    maybe your previous input contained space at the end of the '0 HEAD' line

    gnari
     
    Ragnar Hafstað, Dec 26, 2003
    #2
    1. Advertising

  3. Wes Groleau wrote:
    >
    > I have a script to process a certain file format.
    > It was working at one time, but it doesn't now.
    > Obviously I changed something important, but
    > I have no memory of doing so, nor can I see anything wrong.
    >
    > The input format: Each line has up to four parts:
    > Level, ID, Tag, and Text; separated by white space.
    >
    > White space is optional _before_ the level.
    >
    > ID is optional; if present, it is the second part.
    > It is printable characters, starting and ending
    > with '@'
    >
    > Tag is always there. It is a single "word" of
    > letters, no white space. For simplicity, I am
    > going for any non-whitespace characters.
    >
    > Text is optional and is everything after the
    > tag.
    >
    > Here is what is not working:


    use warnings;
    use strict;

    > @lines = <STDIN>; # Read all lines into array @lines
    > foreach (@lines) # Each line is placed into predefined scalar $_
    > {


    Do you really need to slurp the entire input into an array?


    > # Get rid of CR/LF chars for originating platform.
    > # (So this platform's style can be put back on later.)
    > chomp;
    > s/[\r\n]//;


    chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
    removes the FIRST occurrence of either \r or \n from $_. If you want to
    remove any trailing whitespace including \r and \n then:

    s/\s+\Z//;


    > ($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");
    >
    > # Get the component parts of the line.
    > # Can UTF-8 input break this regexp???
    > #($Level, $ID, $Tag, $XRef, $Text, $Comment) =
    > # --- ----- --- ----- -- --
    > #/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{.*}})?/;
    > # 1 2 3 4 5 6
    >
    > # Let's just use four (standard GEDCOM) for now.
    > # 1 2 3 4
    > /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;


    Why not just assign directly to your variables? That way you can
    eliminate all the following else clauses. Also, the @ characters have
    to be escaped or perl will try to interpolate them as arrays.

    my ( $Level, $ID, $Tag, $Text ) =
    /^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;

    # Get level of subrecord
    if ( not defined $Level )
    {
    $Level = 0;
    $LineEOL = "NO LEVEL: $_\n";
    }
    print "$Level\n"; # This part (Level) seems to work

    # Save the label (if specified)
    if ( not defined $ID )
    {
    $ID = '';
    }
    print "$ID\n";

    # Uppercase the tag
    if ( not defined $Tag )
    {
    $Tag = '';
    $LineEOL = "NO TAG: $_\n";
    }
    print "$Tag\n"; # When input line is "0 HEAD" this should be "HEAD"
    # but it is blank instead and the diagnostic "NO TAG"
    # is added.

    # Save everything else
    if ( not defined $Text )
    {
    $Text = '';
    }
    print "$Text\n";



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Dec 27, 2003
    #3
  4. Wes Groleau wrote:
    >
    > Text is optional and is everything after the tag.


    That condition is not reflected in this regex, which requires one or
    more whitespace characters after Tag, or else it won't capture the
    data as expected:

    > /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    ---------------------------------^^^

    It's better written as:

    /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
    ---------------------------------^^^-------^^

    Other comments:

    You should declare your variables and run your script with strictures
    and warnings enabled.

    use strict;
    use warnings;

    > Here is what is not working:


    That's a pointless description of your problem, isn't it? You'd better
    explain what you expect the script to output, what it actually
    outputs, and which error and warning messages you receive (if any).

    > @lines = <STDIN>; # Read all lines into array @lines


    Are you really using the STDIN filehandle for reading the file? Since
    STDIN is a special filehandle, you should use some other name.

    Basically I think your code can be shortened to:

    while (<FILE>) {
    chomp;
    my ($Level, $ID, $Tag, $Text) =
    /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
    $Level ||= 0;
    $ID ||= '';
    $Text ||= '';
    print "$Level\n$ID\n$Tag\n$Text\n";
    }

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 27, 2003
    #4
  5. Wes Groleau

    Wes Groleau Guest

    "Ragnar Hafsta��������������������" wrote:
    >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    >
    > look at this a bit:
    > (your input is '0 HEAD')
    > ^\s* zero or more whitespace. ok
    > (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
    > \s+ one or more ws. matches ' ', rest is 'HEAD'
    > (@\S+@)? maches ''
    > \s* matches ''
    > (\S+) matches 'HEAD' rest is ''
    > \s+ does not match . oooooops
    > (.*)/;
    >
    > maybe your previous input contained space at the end of the '0 HEAD' line


    OK, I see that. Thanks. Maybe there was a space.

    However, if the regexp doesn't match, why was
    it able to get the $Level and leave the other
    parts undefined?

    --
    Wes Groleau
    Alive and Well
    http://freepages.religions.rootsweb.com/~wgroleau/
     
    Wes Groleau, Dec 27, 2003
    #5
  6. Wes Groleau

    Wes Groleau Guest

    John W. Krahn wrote:
    > Do you really need to slurp the entire input into an array?


    Not really. I have used foreach (<>) before
    I got this from someone else, and just kept that part.
    But the original regexp didn't work. I made it work
    but now I've broken it again.

    > chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
    > removes the FIRST occurrence of either \r or \n from $_. If you want to
    > remove any trailing whitespace including \r and \n then:
    >
    > s/\s+\Z//;


    OK, thanks. That wasn't a problem in this particular
    environment, but it's nice to know.

    >> # 1 2 3 4
    >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    >
    > Why not just assign directly to your variables? That way you can
    > eliminate all the following else clauses. Also, the @ characters have
    > to be escaped or perl will try to interpolate them as arrays.
    >
    > my ( $Level, $ID, $Tag, $Text ) =
    > /^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;


    That was tried, but there was some problem I don't remember.
    However, the '@' has always worked OK. But escaping them
    won't hurt--I remember the error messages from that in other
    contexts.

    So you have helped me quite a bit (thanks), but I still
    don't understand how the regexp as written was able to
    define the $Level but not the $Tag

    --
    Wes Groleau
    -----------
    Daily Hoax: http://www.snopes2.com/cgi-bin/random/random.asp
     
    Wes Groleau, Dec 27, 2003
    #6
  7. Wes Groleau

    Wes Groleau Guest

    Gunnar Hjalmarsson wrote:
    >> Text is optional and is everything after the tag.

    >
    > That condition is not reflected in this regex, which requires one or
    > more whitespace characters after Tag, or else it won't capture the
    > data as expected:
    >
    >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    >
    > ---------------------------------^^^


    Sorry, I was unclear. The only way to determine
    the end of the Tag is by whitespace. But as someone
    else pointed out, if there is no text, then there may
    be a line-end instead of white space. Perhaps this
    is what you are saying?

    > It's better written as:
    >
    > /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
    > ---------------------------------^^^-------^^


    So here, $4 matches the _inner_ parentheses?
    I don't know what ?: means in this context.


    > That's a pointless description of your problem, isn't it? You'd better
    > explain what you expect the script to output, what it actually
    > outputs, and which error and warning messages you receive (if any).


    Well, I explained that $1 or $Level gets defined,
    $2 ($ID) and $4 ($Text) do not (but weren't expected to
    on the first line), and $3 ($Tag) did not which was a
    surprise and prevented reading any other lines.
    After finding the fatal error of $Tag undefined,
    it seemed pointless to quote the remaining two pages
    of the script.

    >> @lines = <STDIN>; # Read all lines into array @lines

    >
    > Are you really using the STDIN filehandle for reading the file? Since
    > STDIN is a special filehandle, you should use some other name.


    Yes, I pipe cat (file) into the script. I figured
    I should get the output right before improving the UI.

    > Basically I think your code can be shortened to:
    >
    > while (<FILE>) {
    > chomp;


    Oh, I definitely need to convert line ends on files
    from other platforms. (Mac/Win/Unix) But as John
    pointed out, I did it wrong.

    > my ($Level, $ID, $Tag, $Text) =
    > /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;


    I'll give this a try, along with the other two guys' suggestions.

    > $Level ||= 0;
    > $ID ||= '';
    > $Text ||= '';
    > print "$Level\n$ID\n$Tag\n$Text\n";


    OK, but I definitely want to keep the error reporting
    if Level or Tag is missing.

    Thanks to all three responders for the additional understanding.

    For the curious, what I am doing is reformatting a GEDCOM
    file, AND splitting it into multiple files. Have to
    open a new file every time a Level = 0 line comes in,
    with the filename determined by parts of the line.

    But all of that is later in the script, and is meaningless
    if Tag is undefined or missing.

    I will experiment with what you three guys have offered,
    and I appreciate it very much. Thanks.

    --
    Wes Groleau

    Nobody believes a theoretical analysis -- except the guy who did it.
    Everybody believes an experimental analysis -- except the guy who did it.
    -- Unknown
     
    Wes Groleau, Dec 27, 2003
    #7
  8. Wes Groleau

    Jay Tilton Guest

    Wes Groleau <> wrote:
    : "Ragnar Hafstað" wrote:
    : >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
    : >
    : > look at this a bit:
    : > (your input is '0 HEAD')
    : > ^\s* zero or more whitespace. ok
    : > (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
    : > \s+ one or more ws. matches ' ', rest is 'HEAD'
    : > (@\S+@)? maches ''
    : > \s* matches ''
    : > (\S+) matches 'HEAD' rest is ''
    : > \s+ does not match . oooooops
    : > (.*)/;
    : >
    : > maybe your previous input contained space at the end of the '0 HEAD' line
    :
    : OK, I see that. Thanks. Maybe there was a space.
    :
    : However, if the regexp doesn't match, why was
    : it able to get the $Level and leave the other
    : parts undefined?

    You are probably seeing a stale value that was captured from a prior
    iteration. The values in $1, $2, etc. are not reset when a match fails,
    and the code does not verify that the match is successful before using
    those variables.
     
    Jay Tilton, Dec 27, 2003
    #8
  9. Wes Groleau wrote:
    > Gunnar Hjalmarsson wrote:
    >> Wes Groleau wrote:
    >>>
    >>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

    >> --------------------------------^^^

    >
    > Sorry, I was unclear. The only way to determine the end of the Tag
    > is by whitespace. But as someone else pointed out, if there is no
    > text, then there may be a line-end instead of white space. Perhaps
    > this is what you are saying?


    Yes, that is what I mean.

    >> It's better written as:
    >>
    >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
    >> ---------------------------------^^^-------^^

    >
    > So here, $4 matches the _inner_ parentheses?


    Yes.

    > I don't know what ?: means in this context.


    http://www.perldoc.com/perl5.8.0/pod/perlre.html#(--pattern)

    >> Basically I think your code can be shortened to:
    >>
    >> while (<FILE>) {
    >> chomp;

    >
    > Oh, I definitely need to convert line ends on files from other
    > platforms. (Mac/Win/Unix) But as John pointed out, I did it
    > wrong.


    I just thought that that is better handled when transferring between
    file systems (using "ASCII" or "Text" transfer mode). But maybe it's
    wise to explicitly remove all trailing whitespace as well...

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 27, 2003
    #9
  10. On Fri, 26 Dec 2003 23:48:24 -0500, Wes Groleau wrote:

    > For the curious, what I am doing is reformatting a GEDCOM file, AND
    > splitting it into multiple files. Have to open a new file every time a
    > Level = 0 line comes in, with the filename determined by parts of the
    > line.


    That's what I thought when I read your original post. BTW, why reinvent
    the wheel? Download the Gedcom module from CPAN. I'm sure that module
    does most of what you want.

    Hau'oli Makahiki Hou!
    La'ie Techie
     
    LÄÊ»ie Techie, Dec 27, 2003
    #10
  11. "Wes Groleau" <> wrote in message
    news:...
    >
    > However, if the regexp doesn't match, why was
    > it able to get the $Level and leave the other
    > parts undefined?
    >


    you should try to use print statements to help you debug.
    a couple of them in the if/else blocks would have shown you
    that $1 was indeed undefined, you set $Level=0, and test that later.
    as the value '0' happened to be the value in the line, you just assumed
    that it 'got' the $Level.

    gnari
     
    Ragnar Hafstað, Dec 27, 2003
    #11
  12. Wes Groleau

    Wes Groleau Guest

    Jay Tilton wrote:
    > : However, if the regexp doesn't match, why was
    > : it able to get the $Level and leave the other
    > : parts undefined?
    >
    > You are probably seeing a stale value that was captured from a prior
    > iteration. The values in $1, $2, etc. are not reset when a match fails,
    > and the code does not verify that the match is successful before using
    > those variables.


    Ah, that's useful info, thanks. However, in this case,
    it was after the first line in a new invocation.
    But I will have to explicitly undefine them each time.

    --
    Wes Groleau
    Alive and Well
    http://freepages.religions.rootsweb.com/~wgroleau/
     
    Wes Groleau, Dec 27, 2003
    #12
  13. Wes Groleau

    Wes Groleau Guest

    "Ragnar Hafsta��������������������" wrote:
    > you should try to use print statements to help you debug.
    > a couple of them in the if/else blocks would have shown you
    > that $1 was indeed undefined, you set $Level=0, and test that later.
    > as the value '0' happened to be the value in the line, you just assumed
    > that it 'got' the $Level.


    Ouch. I put the print statements later and "assumed"
    it got the level because the diagnostic "no level"
    wasn't there. But the reason it wasn't there is the
    "no tag" diagnostic overwrote it! :)

    I need to find a boook and learn how to use the debugger.
    I've used it before, but without the book, I could only
    single-step and that was a royal pain.

    --
    Wes Groleau
    ----
    The man who reads nothing at all is better educated
    than the man who reads nothing but newspapers.
    -- Thomas Jefferson
     
    Wes Groleau, Dec 27, 2003
    #13
  14. "Wes Groleau" <> wrote in message
    news:...
    >
    > I need to find a boook and learn how to use the debugger.
    > I've used it before, but without the book, I could only
    > single-step and that was a royal pain.


    in this kind of case, simple print statements would do the trick

    gnari
     
    Ragnar Hafstað, Dec 27, 2003
    #14
  15. Wes Groleau wrote:
    > Jay Tilton wrote:
    >> You are probably seeing a stale value that was captured from a
    >> prior iteration. The values in $1, $2, etc. are not reset when a
    >> match fails, and the code does not verify that the match is
    >> successful before using those variables.

    >
    > Ah, that's useful info, thanks. However, in this case, it was
    > after the first line in a new invocation. But I will have to
    > explicitly undefine them each time.


    I don't think that you can undefine those variables explicitly. This
    is how you typically do it instead:

    if (/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/) {
    # It matched - do something
    } else {
    # It did not match - do something else
    }

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 27, 2003
    #15
  16. Wes Groleau

    Wes Groleau Guest

    Gunnar Hjalmarsson wrote:
    > I don't think that you can undefine those variables explicitly. This
    > is how you typically do it instead:
    >
    > if (/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/) {
    > # It matched - do something
    > } else {
    > # It did not match - do something else
    > }


    Isn't "undef" a predefined 'constant' that can be
    assigned to a variable? I haven't tried it, but
    it seems I've read that somewhere...

    I may also try the 'if' but I'd prefer it to
    match every line because if the syntax is bad
    I'd like to use what I can in the diagnostic
    messages.

    --
    Wes Groleau
    "Would the prodigal have gone home if
    the elder brother was running the farm?"
    -- James Jordan
     
    Wes Groleau, Dec 27, 2003
    #16
  17. Wes Groleau

    Wes Groleau Guest

    LÄÊ»ie Techie wrote:
    > That's what I thought when I read your original post. BTW, why reinvent
    > the wheel? Download the Gedcom module from CPAN. I'm sure that module
    > does most of what you want.


    Looked at it long ago, and did a little with it
    but eventually rejected it for reasons I can't
    remember. Maybe I should take another look.

    OTOH, it _feels_ like I'm so close now....
    (which is what many programmers say for the
    last six months of their project.)

    :)

    --
    Wes Groleau
    -----------
    Curmudgeon's Complaints on Courtesy:
    http://www.onlinenetiquette.com/courtesy1.html
    (Not necessarily my opinion, but worth reading)
     
    Wes Groleau, Dec 27, 2003
    #17
  18. Wes Groleau <> wrote:
    > John W. Krahn wrote:
    >> Do you really need to slurp the entire input into an array?

    >
    > Not really. I have used foreach (<>) before



    Don't do that, do:

    while ( <> )

    instead.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 27, 2003
    #18
  19. Wes Groleau <> wrote:
    > Jay Tilton wrote:



    >> The values in $1, $2, etc. are not reset when a match fails,



    > But I will have to explicitly undefine them each time.



    No you won't.

    You will have to ensure that the match _succeeds_ before you
    make use of the dollar-digit (capture) variables.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 27, 2003
    #19
  20. Wes Groleau wrote:
    > Isn't "undef" a predefined 'constant' that can be assigned to a
    > variable?


    Yes, it is. But as regards those special $1, $2, etc. variables, you
    cannot assign anything to them. They are populated only through regex
    capturing.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 27, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DP
    Replies:
    0
    Views:
    1,144
  2. Replies:
    3
    Views:
    545
    Markus Svilans
    Oct 20, 2006
  3. tranky
    Replies:
    0
    Views:
    306
    tranky
    Oct 15, 2007
  4. Andries

    I know, I know, I don't know

    Andries, Apr 23, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    240
    Gregory Toomey
    Apr 23, 2004
  5. Marc Miller

    I don't get what's wrong here...

    Marc Miller, Apr 24, 2006, in forum: Javascript
    Replies:
    9
    Views:
    105
    Dr John Stockton
    Apr 25, 2006
Loading...

Share This Page