strategy for parsing text file

Discussion in 'Perl Misc' started by ccc31807, Aug 28, 2009.

  1. ccc31807

    ccc31807 Guest

    I've solved this problem, but I'm just curious as to how by betters
    would approach this.

    The file is a long file, so I have copied only the first seven records
    below as an example. The file is from a table with nine fields, all of
    which are named in the first nine lines. The key is a five digit
    number beginning with either 91 or 92. For each record, sometimes all
    fields are populated (like the first, 91709), but normally only the
    first four are guaranteed to be populated while the remaining five may
    or may not have values. Each datum occupies a line all to itself, and
    the file does not contain record separators.

    The requirement is to capture the first four fields and write to an
    Excel readable file (CSV format).

    My solution was pretty dirty and crude, but I'll share it later (and
    take the hit for stupidity). My question is how others might approach
    the problem. Below is the first seven records of the file and the
    column header.

    Thanks, CC.

    -------------file below--------------------
    Number
    BandName
    Grade
    Branch
    Instr
    PipingInst
    PInstDate
    DrumInst
    DrumInstDate
    91709
    87th Cleveland Pipe Band IV
    PB4
    Ohio Valley
    y
    Tyler Tagliafero, Great Lakes
    01-Mar-09
    Drew Donnelly, Great Lakes
    01-Mar-09
    91068
    Adirondack Pipes & Drums
    PB5
    Northeast
    n
    91212
    Alabama Pipes & Drums
    PB4
    Southern
    n
    91801
    Albany Police P&D
    PB5
    Northeast
    y
    Dan Cole, Oran Mor
    01-Mar-09
    92033
    American Celtic Pipe Band
    PB5
    Metro
    n
    91826
    Anderson Pipe Band
    PB5
    Southwest
    y
    Victor Anderson, Westminster
    01-Mar-09
    Tim Vermillion, Westminster
    01-Mar-09
    91802
    AOH Pipe & Drum Band
    PB5
    Northeast
    n
    ccc31807, Aug 28, 2009
    #1
    1. Advertising

  2. ccc31807

    ccc31807 Guest

    On Aug 28, 5:59 pm, Tad J McClellan <> wrote:
    > I will assume that you are absolutely certain that none of the other
    > field's values will match that specification...


    Absolutely!

    > -------------------
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    > use Data::Dumper;
    >
    > while ( <DATA> ) {
    >     next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    >     my @record = $_;
    >     push @record, scalar(<DATA>) for 1..3;
    >     chomp @record;
    >     print Dumper \@record;}


    I see ... you can access the file within the while loop by using the
    <> in an inner loop. I maybe should have thought of that, but I had to
    produce it quickly and didn't want to experiment.

    Thanks, and here is the guts of my solution. Pretty crude, but it
    worked.

    open INFILE, '<', 'bands.txt';
    while (<INFILE>)
    {
    next unless /\w/;
    print; #debugging
    chomp;
    if (/9[12]\d{3}/)
    {
    $count++;
    $key = $_;
    $flag = 1;
    }
    elsif ($flag == 1)
    {
    $bands{$key}{name} = $_;
    $flag = 2;
    }
    elsif ($flag ==2)
    {
    $bands{$key}{grade} = $_;
    $flag = 3;
    }
    elsif ($flag == 3)
    {
    $bands{$key}{branch} = $_;
    $flag = 0;
    }

    }

    #print the %bands hash to a .csv file
    ccc31807, Aug 28, 2009
    #2
    1. Advertising

  3. ccc31807

    Steve C Guest

    RedGrittyBrick wrote:
    > ccc31807 wrote:
    >> I've solved this problem, but I'm just curious as to how by betters
    >> would approach this.
    >>
    >> The file is a long file, so I have copied only the first seven records
    >> below as an example. The file is from a table with nine fields, all of
    >> which are named in the first nine lines. The key is a five digit
    >> number beginning with either 91 or 92. For each record, sometimes all
    >> fields are populated (like the first, 91709), but normally only the
    >> first four are guaranteed to be populated while the remaining five may
    >> or may not have values. Each datum occupies a line all to itself, and
    >> the file does not contain record separators.
    >>
    >> The requirement is to capture the first four fields and write to an
    >> Excel readable file (CSV format).
    >>
    >> My solution was pretty dirty and crude, but I'll share it later (and
    >> take the hit for stupidity). My question is how others might approach
    >> the problem. Below is the first seven records of the file and the
    >> column header.
    >>

    >
    > #!perl
    > use strict;
    > use warnings;
    >
    > my @f;
    > while (<DATA>) {
    > chomp;
    > if (/^9[12]\d{3}$/) {
    > print join (',', @f), "\n" if @f;
    > @f=();
    > }
    > push @f, $_;
    > }
    >
    > __DATA__
    >
    >
    >

    I think you are losing the last record.
    Steve C, Aug 28, 2009
    #3
  4. ccc31807

    Guest

    On Fri, 28 Aug 2009 14:25:53 -0700 (PDT), ccc31807 <> wrote:

    >I've solved this problem, but I'm just curious as to how by betters
    >would approach this.
    >
    >The file is a long file, so I have copied only the first seven records
    >below as an example. The file is from a table with nine fields, all of
    >which are named in the first nine lines. The key is a five digit
    >number beginning with either 91 or 92. For each record, sometimes all
    >fields are populated (like the first, 91709), but normally only the
    >first four are guaranteed to be populated while the remaining five may
    >or may not have values. Each datum occupies a line all to itself, and
    >the file does not contain record separators.
    >
    >The requirement is to capture the first four fields and write to an
    >Excel readable file (CSV format).
    >
    >My solution was pretty dirty and crude, but I'll share it later (and
    >take the hit for stupidity). My question is how others might approach
    >the problem. Below is the first seven records of the file and the
    >column header.
    >
    >Thanks, CC.
    >


    This is knarly too. Its just the inner while if you
    can slurp the whole file, but somehow I don't think
    you want that.

    -sln

    Output:

    "Number","BandName","Grade","Branch","Instr","PipingInst","PInstDate","DrumInst","DrumInstDate"
    "91709","87th Cleveland Pipe Band IV","PB4","Ohio Valley"
    "91068","Adirondack Pipes & Drums","PB5","Northeast"
    "91212","Alabama Pipes & Drums","PB4","Southern"
    "91801","Albany Police P&D","PB5","Northeast"
    "92033","American Celtic Pipe Band","PB5","Metro"
    "91826","Anderson Pipe Band","PB5","Southwest"
    "91802","AOH Pipe & Drum Band","PB5","Northeast"

    ==========

    use strict;
    use warnings;

    my ($header,$line,$data) = (1);

    while ($line=<DATA>)
    {
    $line = '' if $line =~ /^\s*$/;
    my $end = eof(DATA);
    $data .= $line if $end;

    if ($end || $line =~ /^9[12]\d{3}/)
    {
    # process header
    if ($header) {
    $header = 0;
    my $cnt = 1;
    $data =~ /((?:^.*\n){9})/mg;
    print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
    }
    # process record
    else {
    while ($data =~ /(^9[12]\d{3}\n(?:^(?!9[12]\d{3}).*\n){4,8})/mg)
    {
    my $cnt = 1;
    print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
    }
    }
    $data = $line;
    next;
    }
    $data .= $line;
    }
    , Aug 29, 2009
    #4
  5. ccc31807

    Guest


    >use strict;
    >use warnings;
    >
    >my ($header,$line,$data) = (1);
    >
    >while ($line=<DATA>)
    >{
    > $line = '' if $line =~ /^\s*$/;
    > my $end = eof(DATA);
    > $data .= $line if $end;
    >
    > if ($end || $line =~ /^9[12]\d{3}/)
    > {

    # process header
    if ($header) {
    $header = 0;
    my $cnt = 1;
    if ($data =~ /((?:^.*\n){9})/mg) {
    print "\"$_\"".($cnt++ < 9 ? ',':"\n") for (split /\n/, $1);
    }
    }
    # process record
    else {
    my $cnt = 1;
    if ($data =~ /(^9[12]\d{3}\n(?:^.*\n){4,8})/mg) {
    print "\"$_\"".($cnt++ < 4 ? ',':"\n") for (split /\n/, $1)[0..3];
    }
    }
    > $data = $line;
    > next;
    > }
    > $data .= $line;
    >}


    Sorry, the short version: process record 'while' before was for if the file is slurped
    and used a negative look ahead. Still works for single record but is not needed.

    -sln
    , Aug 29, 2009
    #5
  6. ccc31807 wrote:
    > I've solved this problem, but I'm just curious as to how by betters
    > would approach this.
    >
    > The file is a long file, so I have copied only the first seven records
    > below as an example. The file is from a table with nine fields, all of
    > which are named in the first nine lines. The key is a five digit
    > number beginning with either 91 or 92. For each record, sometimes all
    > fields are populated (like the first, 91709), but normally only the
    > first four are guaranteed to be populated while the remaining five may
    > or may not have values. Each datum occupies a line all to itself, and
    > the file does not contain record separators.
    >
    > The requirement is to capture the first four fields and write to an
    > Excel readable file (CSV format).
    >
    > My solution was pretty dirty and crude, but I'll share it later (and
    > take the hit for stupidity). My question is how others might approach
    > the problem. Below is the first seven records of the file and the
    > column header.
    >
    > Thanks, CC.
    >
    > -------------file below--------------------
    > Number
    > BandName
    > Grade
    > Branch
    > Instr
    > PipingInst
    > PInstDate
    > DrumInst
    > DrumInstDate
    > 91709
    > 87th Cleveland Pipe Band IV
    > PB4
    > Ohio Valley
    > y
    > Tyler Tagliafero, Great Lakes
    > 01-Mar-09
    > Drew Donnelly, Great Lakes
    > 01-Mar-09
    > 91068
    > Adirondack Pipes & Drums
    > PB5
    > Northeast
    > n
    > 91212
    > Alabama Pipes & Drums
    > PB4
    > Southern
    > n
    > 91801
    > Albany Police P&D
    > PB5
    > Northeast
    > y
    > Dan Cole, Oran Mor
    > 01-Mar-09
    > 92033
    > American Celtic Pipe Band
    > PB5
    > Metro
    > n
    > 91826
    > Anderson Pipe Band
    > PB5
    > Southwest
    > y
    > Victor Anderson, Westminster
    > 01-Mar-09
    > Tim Vermillion, Westminster
    > 01-Mar-09
    > 91802
    > AOH Pipe & Drum Band
    > PB5
    > Northeast
    > n



    my @data = [];
    while ( <FILE> ) {
    chomp;
    /^9[12]/ && push @data, [];
    push @{ $data[ -1 ] }, qq/"$_"/;
    if ( @data == 2 || eof ) {
    no warnings 'uninitialized';
    print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
    }
    }




    John
    --
    Those people who think they know everything are a great
    annoyance to those of us who do. -- Isaac Asimov
    John W. Krahn, Aug 29, 2009
    #6
  7. ccc31807

    ccc31807 Guest

    On Aug 28, 9:04 pm, "John W. Krahn" <> wrote:

    John, sorry, but I haven't seen some of what you used. Do you mine
    helping me out?

    [] returns a reference to an anonymous array, right? How does it work
    assigning it to an array type?
    > my @data = [];
    > while ( <FILE> ) {
    >      chomp;


    I understand the use of the conjunctive Boolean, but again, I don't
    understand how pushing [] to the array works.
    >      /^9[12]/ && push @data, [];


    This pushes $_ to the end of the array, but how to you designate the
    value of $_ in this case?
    >      push @{ $data[ -1 ] }, qq/"$_"/;
    >      if ( @data == 2 || eof ) {
    >          no warnings 'uninitialized';


    Why '8'? The problem is that the values can be anywhere from three to
    eight, and you don't know how many or which ones.
    >          print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";
    >          }
    >      }


    When I looked at the data file, I saw this pseudocode:
    read each line
    if the line is the key:
    save the value as a key
    read the next three lines
    write each value as the value of a hash element for the key

    Two points -- (1) I didn't take the time to explore accessing the
    lines of the file in an inner loop, although that occurred to me,
    which is why Tad's example made the light bulb light up. (2) It seems
    much more natural to use a hash rather than an array to hold the data
    elements, and now I'm wondering if using an array to hold the records
    is a better solution.

    The output part of my script looks like this:
    foreach my $k (keys %bands)
    {
    print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
    {$k}{branch}"\n);

    }

    To me, this looks a lot more intuitive and understandable than some of
    the print statements above, which look convoluted (if not obfuscated)
    to me.

    CC.
    ccc31807, Aug 29, 2009
    #7
  8. ccc31807 wrote:
    > On Aug 28, 9:04 pm, "John W. Krahn" <> wrote:
    >
    > John, sorry, but I haven't seen some of what you used. Do you mine
    > helping me out?


    Ok, I'll try. :)


    > [] returns a reference to an anonymous array, right? How does it work
    > assigning it to an array type?


    Just the same as assigning any scalar to an array. The first element of
    the array now contains a reference to an array.


    >> my @data = [];
    >> while ( <FILE> ) {
    >> chomp;

    >
    > I understand the use of the conjunctive Boolean, but again, I don't
    > understand how pushing [] to the array works.
    >> /^9[12]/ && push @data, [];


    That adds a scalar value onto the end of the array. In this case the
    scalar value is a reference to an array.


    > This pushes $_ to the end of the array, but how to you designate the
    > value of $_ in this case?


    I don't know what you mean by "designate the value of $_"?


    >> push @{ $data[ -1 ] }, qq/"$_"/;
    >> if ( @data == 2 || eof ) {
    >> no warnings 'uninitialized';

    >
    > Why '8'? The problem is that the values can be anywhere from three to
    > eight, and you don't know how many or which ones.
    >> print join( ',', @{ shift @data }[ 0 .. 8 ] ), "\n";


    I assumed that you meant that each record *should* have 9 fields, but if
    that is not what you want then just remove the '[ 0 .. 8 ]' part.


    >> }
    >> }

    >
    > When I looked at the data file, I saw this pseudocode:
    > read each line
    > if the line is the key:
    > save the value as a key
    > read the next three lines
    > write each value as the value of a hash element for the key
    >
    > Two points -- (1) I didn't take the time to explore accessing the
    > lines of the file in an inner loop, although that occurred to me,
    > which is why Tad's example made the light bulb light up. (2) It seems
    > much more natural to use a hash rather than an array to hold the data
    > elements, and now I'm wondering if using an array to hold the records
    > is a better solution.


    TMTOWTDI ;-)


    > The output part of my script looks like this:
    > foreach my $k (keys %bands)
    > {
    > print OUTFILE qq("$k","$bands{$k}{name}","$bands{$k}{grade}","$bands
    > {$k}{branch}"\n);
    >
    > }
    >
    > To me, this looks a lot more intuitive and understandable than some of
    > the print statements above, which look convoluted (if not obfuscated)
    > to me.



    John
    --
    Those people who think they know everything are a great
    annoyance to those of us who do. -- Isaac Asimov
    John W. Krahn, Aug 29, 2009
    #8
  9. ccc31807

    Guest

    On Aug 28, 3:50 pm, Steve C <> wrote:
    > RedGrittyBrick wrote:
    >
    > > #!perl
    > > use strict;
    > > use warnings;

    >
    > > my @f;
    > > while (<DATA>) {
    > >   chomp;
    > >   if (/^9[12]\d{3}$/) {
    > >     print join (',', @f), "\n" if @f;
    > >     @f=();
    > >   }
    > >   push @f, $_;
    > > }

    >
    > > __DATA__

    >
    > I think you are losing the last record.
    >


    That script has one more flaw. It publishes all the elements of @f,
    whereas the OP wants only the first 4 elements.
    , Aug 29, 2009
    #9
  10. ccc31807

    ccc31807 Guest

    On Aug 28, 10:53 pm, Tad J McClellan <> wrote:
    > while ( <DATA> ) {
    >     next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    >     my %record = (number => $_);
    >     $record{bandname} = <DATA>;
    >     $record{grade} = <DATA>;
    >     $record{branch} = <DATA>;
    >     chomp %record;
    >     print Dumper \%record;}


    Yes. This is almost identical to what I had after I saw your first
    solution, except for a small variation in the hash variable. I chose a
    hash because I anticipated a need to sort by branch and possible by
    grade.

    This was a throwaway script, that I ran exactly once, so while I agree
    with checking the value of open() and using more meaningful names,
    this was just the first cut and was all I needed.

    Thanks for your help. I now know about using <> in inner loops.

    CC.
    ccc31807, Aug 29, 2009
    #10
  11. ccc31807

    Guest

    On Sat, 29 Aug 2009 03:33:26 -0700 (PDT), ccc31807 <> wrote:

    >On Aug 28, 10:53 pm, Tad J McClellan <> wrote:
    >> while ( <DATA> ) {
    >>     next unless /^9[12]\d\d\d$/;  # 5 digits, starts with 91 or 92
    >>     my %record = (number => $_);
    >>     $record{bandname} = <DATA>;
    >>     $record{grade} = <DATA>;
    >>     $record{branch} = <DATA>;
    >>     chomp %record;
    >>     print Dumper \%record;}

    >
    >Yes. This is almost identical to what I had after I saw your first
    >solution, except for a small variation in the hash variable. I chose a
    >hash because I anticipated a need to sort by branch and possible by
    >grade.
    >
    >This was a throwaway script, that I ran exactly once, so while I agree
    >with checking the value of open() and using more meaningful names,
    >this was just the first cut and was all I needed.
    >
    >Thanks for your help. I now know about using <> in inner loops.
    >
    >CC.


    Another difference is that you are accumulating a hash of the
    total of all the records, he is just making a temp hash on a record
    by record basis.

    Neither way cares about error checking, blank lines, headers,
    field position or any validation whatsoever.
    So, in all the responces here, there is no method or technique being better
    or worse in this light, its just throwaway.


    while (<DATA>) {
    chomp;
    (/^9[12]\d{3}$/ and
    @{$bands{$_}}{'name','grade','branch'}
    = split /\n/, <DATA>.<DATA>.<DATA>)
    }

    or same, but slurp file ..

    $_ = join '',<DATA>;
    while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
    @{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;
    }

    -sln
    , Sep 1, 2009
    #11
  12. ccc31807

    ccc31807 Guest

    On Sep 1, 2:05 pm, wrote:
    > So, in all the responces here, there is no method or technique being better
    > or worse in this light, its just throwaway.
    >
    > while (<DATA>) {
    >     chomp;
    >     (/^9[12]\d{3}$/ and
    >     @{$bands{$_}}{'name','grade','branch'}
    >       = split /\n/, <DATA>.<DATA>.<DATA>)
    >
    > }


    Yes! I like this!

    I have developed a habit of using a hash slice when dealing with data
    file that come with their own header, and use the hash to populate a
    hash for each line to manage and mangle the output.

    Sometimes I have a need to sort the data by some strange and alien
    method, so I have also developed the habit of using a hash for the
    data. Recently I have build several scripts that output PDFs of
    multiple records categorized in various ways, and have found that
    hashes are ideal for this purpose.

    Anyway, the essential insight is that <> can be used to get the next
    record regardless of the level of the braces.

    CC

    >
    > or same, but slurp file ..
    >
    > $_ = join '',<DATA>;
    > while (/(^9[12]\d{3})\n((?:^(?!9[12]\d{3}\n).*\n){3})/mg) {
    >   @{$bands{$1}}{'name','grade','branch'} = split /\n/, $2;
    >
    > }
    >
    > -sln
    ccc31807, Sep 1, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    366
  2. Andrew Thompson

    Strategy for server 'old file' delete

    Andrew Thompson, Jan 12, 2004, in forum: Java
    Replies:
    7
    Views:
    418
    Andrew Thompson
    Jan 13, 2004
  3. Robb
    Replies:
    6
    Views:
    77
    ThoML
    Jul 19, 2008
  4. Henry
    Replies:
    2
    Views:
    279
    David K. Wall
    Dec 18, 2003
  5. Domenico Discepola

    Assistance parsing text file using Text::CSV_XS

    Domenico Discepola, Sep 1, 2004, in forum: Perl Misc
    Replies:
    6
    Views:
    444
    Domenico Discepola
    Sep 2, 2004
Loading...

Share This Page