Trouble with parsing text file and grabbing values needed

Discussion in 'Perl Misc' started by donaldjones@gmail.com, Jul 21, 2006.

  1. Guest

    I have a large text file with many records I'd like to parse and
    extract data. I'm trying to conceptually figure out how to pull out
    what I need and put in a CSV file. Here is a snipped of what 2 records
    look like, the beginning of each records always has a "TEST1:" and an
    "NN:" within the first line:

    -------------------snip----------------------

    TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
    CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    PT:2005 2004
    DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    RRRU NONE
    ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    ZMTH 11/02/02 2 0102 156 156 11
    ZMTH 11/02/02 2 0202 156 156 11
    ZMTH 11/02/02 2 0302 156 156 22
    ZMTH 11/02/02 2 0402 96 96 11
    TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
    CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    PT:2005 2004
    DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    RRRU NONE
    ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    ZMTH 11/02/02 2 0102 156 156 11
    ZMTH 11/02/02 2 0202 156 156 11
    ZMTH 11/02/02 2 0302 156 156 22
    ZMTH 11/02/02 2 0402 96 96 11


    -------------------snip----------------------

    Here is an example of what I'm looking for to put in a CSV File, with
    the first line being the header of each piece of data I'm trying to
    grab:

    NN,CFL,Name,UI
    007-74,L0E,JOHN DOE,DI

    So as you can see, I want to be able to pull out the value after each
    ##: (for particular ##:'s or all of them if that's easy) where ## is a
    character representation followed by a colon that identifies each piece
    of data in the text file snippet above. Also want to note that the
    name has no delimiter, but always comes on the 4th line of each record,
    at the beginning.

    I'm not looking for someone to write the program for me, but was
    looking for some ideas on how to go about grabbing this data out and
    putting into another file. I use Perl mainly as a system administrator
    to get tasks done as needed and can figure out where to go if pointed
    in the right direction. I'm used to working with the same predictable
    fileds on each line for parsing and splitting, but not when the values
    can be found in a span of multiple lines.

    I've googled for this a few times, but I haven't found quite what I'm
    looking for.

    Any ideas? Any help would be greatly appreciated.
     
    , Jul 21, 2006
    #1
    1. Advertising

  2. Guest

    wrote:
    > I have a large text file with many records I'd like to parse and
    > extract data. I'm trying to conceptually figure out how to pull out
    > what I need and put in a CSV file. Here is a snipped of what 2 records
    > look like, the beginning of each records always has a "TEST1:" and an
    > "NN:" within the first line:


    Will "TEST1:" ever occur *other* than on the first line of a record?
    If not, then you can set $/='TEST1:' to isolate records. (You will have to
    burn the very first read on the file, because TEST1: is considered the end
    rather than the beginning of each record, which causes a spurious empty
    (other than TEST1: itself) first record.


    >
    > -------------------snip----------------------
    >
    > TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
    > CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > PT:2005 2004
    > DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    > RRRU NONE
    > ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    > ZMTH 11/02/02 2 0102 156 156 11
    > ZMTH 11/02/02 2 0202 156 156 11
    > ZMTH 11/02/02 2 0302 156 156 22
    > ZMTH 11/02/02 2 0402 96 96 11
    > TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
    > CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > PT:2005 2004
    > DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    > RRRU NONE
    > ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    > ZMTH 11/02/02 2 0102 156 156 11
    > ZMTH 11/02/02 2 0202 156 156 11
    > ZMTH 11/02/02 2 0302 156 156 22
    > ZMTH 11/02/02 2 0402 96 96 11
    >
    > -------------------snip----------------------
    >
    > Here is an example of what I'm looking for to put in a CSV File, with
    > the first line being the header of each piece of data I'm trying to
    > grab:
    >
    > NN,CFL,Name,UI
    > 007-74,L0E,JOHN DOE,DI
    >
    > So as you can see, I want to be able to pull out the value after each
    > ##: (for particular ##:'s or all of them if that's easy) where ## is a
    > character representation followed by a colon that identifies each piece
    > of data in the text file snippet above.


    Eh, I don't see that. Do you want just the three you specified (NN, CFL,
    UI) or do you want all the others fitting that format (MT, FSS, etc.) as
    well?

    Also, are these always in the same order? Always on the same line?
    Can any of these field labels be a ending substring of any other of the
    labels?

    > Also want to note that the
    > name has no delimiter, but always comes on the 4th line of each record,
    > at the beginning.


    skip over 3 lines, then skip over the initial white space on the 4th line,
    then take everything upto the first PZY: (not taking the whitespace before
    PZY:)

    $record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
    my $name=$1;


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 21, 2006
    #2
    1. Advertising

  3. DJ Guest

    wrote:

    > Will "TEST1:" ever occur *other* than on the first line of a record?
    > If not, then you can set $/='TEST1:' to isolate records. (You will have to
    > burn the very first read on the file, because TEST1: is considered the end
    > rather than the beginning of each record, which causes a spurious empty
    > (other than TEST1: itself) first record.


    "TEST1:" Will only occur on the first line of a record

    > Eh, I don't see that. Do you want just the three you specified (NN, CFL,
    > UI) or do you want all the others fitting that format (MT, FSS, etc.) as
    > well?


    I would like to pull out all of them if possible (NN, CFL, MT, FSS) and
    then decide later if I want to put in the CSV file

    > Also, are these always in the same order? Always on the same line?
    > Can any of these field labels be a ending substring of any other of the
    > labels?


    It is always the same order, however, some of the records have extra
    lines every so often that we are not interested in. The field labels
    aren't an ending substring of other labels so you can pretty much
    consider them always "field labels" when you see "###:"

    > skip over 3 lines, then skip over the initial white space on the 4th line,
    > then take everything upto the first PZY: (not taking the whitespace before
    > PZY:)
    >
    > $record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
    > my $name=$1;


    I'll give this part a shot...

    Thanks for your reply!
     
    DJ, Jul 21, 2006
    #3
  4. Mumia W. Guest

    On 07/21/2006 12:18 PM, wrote:
    > I have a large text file with many records I'd like to parse and
    > extract data. I'm trying to conceptually figure out how to pull out
    > what I need and put in a CSV file. Here is a snipped of what 2 records
    > look like, the beginning of each records always has a "TEST1:" and an
    > "NN:" within the first line:
    >
    > -------------------snip----------------------
    >
    > TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
    > CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > PT:2005 2004
    > [...]


    I would follow xhoster's advise and set $/ (the record
    separator sequence) to "TEST1:". You can also find a regular
    expression that matches your "##:" sequences. Use the match
    operator with the /g option to get all of them.

    Those "##:" sequences look like they are 2-3 alphabetic
    characters followed by a colon followed by several
    non-whitespace characters.

    Read "perldoc perlrequick" and "perldoc perlre" to find out
    how to make the right regular expression for your needs.

    Good luck.
     
    Mumia W., Jul 21, 2006
    #4
  5. DJ Stunks Guest

    wrote:
    > I have a large text file with many records I'd like to parse and
    > extract data. I'm trying to conceptually figure out how to pull out
    > what I need and put in a CSV file. Here is a snipped of what 2 records
    > look like, the beginning of each records always has a "TEST1:" and an
    > "NN:" within the first line:
    >
    > -------------------snip----------------------
    >
    > TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
    > CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > PT:2005 2004
    > DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    > RRRU NONE
    > ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    > ZMTH 11/02/02 2 0102 156 156 11
    > ZMTH 11/02/02 2 0202 156 156 11
    > ZMTH 11/02/02 2 0302 156 156 22
    > ZMTH 11/02/02 2 0402 96 96 11
    > TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
    > CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > PT:2005 2004
    > DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
    > RRRU NONE
    > ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
    > ZMTH 11/02/02 2 0102 156 156 11
    > ZMTH 11/02/02 2 0202 156 156 11
    > ZMTH 11/02/02 2 0302 156 156 22
    > ZMTH 11/02/02 2 0402 96 96 11
    >
    >
    > -------------------snip----------------------


    ugly. I parse records similar to this using Parse::RecDescent, but
    it's slow so if you have 10,000,000 of them you might be waiting a
    while...

    plus the learning curve for P::RD is pretty steep.

    -jp
     
    DJ Stunks, Jul 21, 2006
    #5
  6. wrote:

    > $record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
    > my $name=$1;


    I prefer to see that written.

    my ($name)=$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
     
    Brian McCauley, Jul 22, 2006
    #6
  7. Mumia W. wrote:
    > On 07/21/2006 12:18 PM, wrote:
    > > I have a large text file with many records I'd like to parse and
    > > extract data. I'm trying to conceptually figure out how to pull out
    > > what I need and put in a CSV file. Here is a snipped of what 2 records
    > > look like, the beginning of each records always has a "TEST1:" and an
    > > "NN:" within the first line:
    > >
    > > -------------------snip----------------------
    > >
    > > TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
    > > CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
    > > PT:2005 2004
    > > [...]

    >
    > I would follow xhoster's advise and set $/ (the record
    > separator sequence) to "TEST1:". You can also find a regular
    > expression that matches your "##:" sequences. Use the match
    > operator with the /g option to get all of them.


    I would consider "\nTEST1:" although this would mean the first record
    had TEST1: at each end rather than having a first ecord containing only
    "TEST1:"

    >
    > Those "##:" sequences look like they are 2-3 alphabetic
    > characters followed by a colon followed by several
    > non-whitespace characters.
    >
    > Read "perldoc perlrequick" and "perldoc perlre" to find out
    > how to make the right regular expression for your needs.


    I think we can be a little more help:

    my %tagged_data = /(\w+):\s*(\S+)/g

    Note: this assumes that the data values are not to contain whitespace
    and are never null.

    "DTP:07/17/06 SSZ4" is SSZ4 part of the DTP data item?

    If data can contain whitespace or can be empty then the pattern needs
    to be more complex as you need a lookahead to see if we've reached the
    end of the data item.

    my %tagged_data = /(\w+):\s*(.*?)\s*(?=\n|\w+:)/g; # Untested
     
    Brian McCauley, Jul 22, 2006
    #7
  8. Dr.Ruud Guest

    schreef:

    > skip over 3 lines, then skip over the initial white space on the 4th
    > line, then take everything upto the first PZY: (not taking the
    > whitespace before PZY:)
    >
    > $record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
    > my $name=$1;


    That would also match

    ---------------------
    A
    B
    C

    D


    PZY:
    ---------------------

    but maybe that's OK.

    One should use [[:blank:]] (or [ \t]), and not \s, if one only wants to
    match SP and TAB.

    A nice shortcut for [[:blank:]] would be \h, for horizontal whitespace,
    though CR should probably not be included.
    See http://dev.perl.org/perl6/doc/design/apo/A05.html about \h and \v.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 22, 2006
    #8
  9. DJ Guest

    Thanks much for all the responses, was able to get that data out the
    way I want it. Setting up that hash was really nice because I can
    choose what I want out of there at any time. One more challenge I'm
    having is I want to be able to pull out any lines within each record
    that match a certain pattern when one line contains a specific number
    at a specific place and the next line has different specific number in
    a specific place. Here is an example:

    ZMTH 11/02/02 2 0102 156 156 11
    ZMTH 11/02/02 2 0202 156 156 11
    ZMTH 11/02/02 2 0302 156 156 22
    ZMTH 11/02/02 2 0402 96 96 11

    The above is normal output. Say this output changes to this:

    ZMTH 11/02/02 2 0102 156 156 11
    ZMTH 11/02/02 6 0202 156 156 11
    ZMTH 11/02/02 9 0302 156 156 22
    ZMTH 11/02/02 2 0402 96 96 11

    The second line here has a "6" in the third column and the third line
    has a "9" in the third column. I want to know every time this happens
    where lines have the 6 immediately followed by a 9 and extract out the
    values in each column and probably put into a hash for grabbing
    specific output later on through the program. I'm still treating the
    record delimiter as "\nTEST1:".

    Thanks again!
     
    DJ, Jul 22, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Darrel
    Replies:
    3
    Views:
    381
    Scott M.
    Nov 13, 2004
  2. David W. Thomas

    Grabbing text from a Java window

    David W. Thomas, Oct 20, 2003, in forum: Java
    Replies:
    2
    Views:
    400
    David W. Thomas
    Oct 22, 2003
  3. neu
    Replies:
    2
    Views:
    308
    VisionSet
    Nov 15, 2003
  4. Kermit Piper

    Grabbing ASCII values in a text box

    Kermit Piper, Mar 10, 2006, in forum: Javascript
    Replies:
    1
    Views:
    134
  5. Kermit Piper
    Replies:
    1
    Views:
    146
    Thomas 'PointedEars' Lahn
    Mar 12, 2006
Loading...

Share This Page