Trouble with parsing text file and grabbing values needed

D

donaldjones

I have a large text file with many records I'd like to parse and
extract data. I'm trying to conceptually figure out how to pull out
what I need and put in a CSV file. Here is a snipped of what 2 records
look like, the beginning of each records always has a "TEST1:" and an
"NN:" within the first line:

-------------------snip----------------------

TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11
TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11


-------------------snip----------------------

Here is an example of what I'm looking for to put in a CSV File, with
the first line being the header of each piece of data I'm trying to
grab:

NN,CFL,Name,UI
007-74,L0E,JOHN DOE,DI

So as you can see, I want to be able to pull out the value after each
##: (for particular ##:'s or all of them if that's easy) where ## is a
character representation followed by a colon that identifies each piece
of data in the text file snippet above. Also want to note that the
name has no delimiter, but always comes on the 4th line of each record,
at the beginning.

I'm not looking for someone to write the program for me, but was
looking for some ideas on how to go about grabbing this data out and
putting into another file. I use Perl mainly as a system administrator
to get tasks done as needed and can figure out where to go if pointed
in the right direction. I'm used to working with the same predictable
fileds on each line for parsing and splitting, but not when the values
can be found in a span of multiple lines.

I've googled for this a few times, but I haven't found quite what I'm
looking for.

Any ideas? Any help would be greatly appreciated.
 
X

xhoster

I have a large text file with many records I'd like to parse and
extract data. I'm trying to conceptually figure out how to pull out
what I need and put in a CSV file. Here is a snipped of what 2 records
look like, the beginning of each records always has a "TEST1:" and an
"NN:" within the first line:

Will "TEST1:" ever occur *other* than on the first line of a record?
If not, then you can set $/='TEST1:' to isolate records. (You will have to
burn the very first read on the file, because TEST1: is considered the end
rather than the beginning of each record, which causes a spurious empty
(other than TEST1: itself) first record.

-------------------snip----------------------

TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11
TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11

-------------------snip----------------------

Here is an example of what I'm looking for to put in a CSV File, with
the first line being the header of each piece of data I'm trying to
grab:

NN,CFL,Name,UI
007-74,L0E,JOHN DOE,DI

So as you can see, I want to be able to pull out the value after each
##: (for particular ##:'s or all of them if that's easy) where ## is a
character representation followed by a colon that identifies each piece
of data in the text file snippet above.

Eh, I don't see that. Do you want just the three you specified (NN, CFL,
UI) or do you want all the others fitting that format (MT, FSS, etc.) as
well?

Also, are these always in the same order? Always on the same line?
Can any of these field labels be a ending substring of any other of the
labels?
Also want to note that the
name has no delimiter, but always comes on the 4th line of each record,
at the beginning.

skip over 3 lines, then skip over the initial white space on the 4th line,
then take everything upto the first PZY: (not taking the whitespace before
PZY:)

$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
my $name=$1;


Xho
 
D

DJ

Will "TEST1:" ever occur *other* than on the first line of a record?
If not, then you can set $/='TEST1:' to isolate records. (You will have to
burn the very first read on the file, because TEST1: is considered the end
rather than the beginning of each record, which causes a spurious empty
(other than TEST1: itself) first record.

"TEST1:" Will only occur on the first line of a record
Eh, I don't see that. Do you want just the three you specified (NN, CFL,
UI) or do you want all the others fitting that format (MT, FSS, etc.) as
well?

I would like to pull out all of them if possible (NN, CFL, MT, FSS) and
then decide later if I want to put in the CSV file
Also, are these always in the same order? Always on the same line?
Can any of these field labels be a ending substring of any other of the
labels?

It is always the same order, however, some of the records have extra
lines every so often that we are not interested in. The field labels
aren't an ending substring of other labels so you can pretty much
consider them always "field labels" when you see "###:"
skip over 3 lines, then skip over the initial white space on the 4th line,
then take everything upto the first PZY: (not taking the whitespace before
PZY:)

$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
my $name=$1;

I'll give this part a shot...

Thanks for your reply!
 
M

Mumia W.

I have a large text file with many records I'd like to parse and
extract data. I'm trying to conceptually figure out how to pull out
what I need and put in a CSV file. Here is a snipped of what 2 records
look like, the beginning of each records always has a "TEST1:" and an
"NN:" within the first line:

-------------------snip----------------------

TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
[...]

I would follow xhoster's advise and set $/ (the record
separator sequence) to "TEST1:". You can also find a regular
expression that matches your "##:" sequences. Use the match
operator with the /g option to get all of them.

Those "##:" sequences look like they are 2-3 alphabetic
characters followed by a colon followed by several
non-whitespace characters.

Read "perldoc perlrequick" and "perldoc perlre" to find out
how to make the right regular expression for your needs.

Good luck.
 
D

DJ Stunks

I have a large text file with many records I'd like to parse and
extract data. I'm trying to conceptually figure out how to pull out
what I need and put in a CSV file. Here is a snipped of what 2 records
look like, the beginning of each records always has a "TEST1:" and an
"NN:" within the first line:

-------------------snip----------------------

TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11
TEST1: DTP:07/17/06 SSZ4 NN:745-88 REC:01 UN:pZZ PG: 001+
CCTL FUN:745-88 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
DOE, JOHN PZY:C01 TMR:DV,DI-03/02 UI:DI ZZL:07/10/06 SEQU:2
RRRU NONE
ZMTH ZMTH 1 2 3 5 DDAA YXA XXA XXS
ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11


-------------------snip----------------------

ugly. I parse records similar to this using Parse::RecDescent, but
it's slow so if you have 10,000,000 of them you might be waiting a
while...

plus the learning curve for P::RD is pretty steep.

-jp
 
B

Brian McCauley

$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
my $name=$1;

I prefer to see that written.

my ($name)=$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
 
B

Brian McCauley

Mumia said:
I have a large text file with many records I'd like to parse and
extract data. I'm trying to conceptually figure out how to pull out
what I need and put in a CSV file. Here is a snipped of what 2 records
look like, the beginning of each records always has a "TEST1:" and an
"NN:" within the first line:

-------------------snip----------------------

TEST1: DTP:07/17/06 SSZ4 NN:007-74 REC:01 UN:pZZ PG: 001+
CCTL FUN:007-74 CFL:L0E MT:09/11/05-R FSS:L0E MN:L00
PT:2005 2004
[...]

I would follow xhoster's advise and set $/ (the record
separator sequence) to "TEST1:". You can also find a regular
expression that matches your "##:" sequences. Use the match
operator with the /g option to get all of them.

I would consider "\nTEST1:" although this would mean the first record
had TEST1: at each end rather than having a first ecord containing only
"TEST1:"
Those "##:" sequences look like they are 2-3 alphabetic
characters followed by a colon followed by several
non-whitespace characters.

Read "perldoc perlrequick" and "perldoc perlre" to find out
how to make the right regular expression for your needs.

I think we can be a little more help:

my %tagged_data = /(\w+):\s*(\S+)/g

Note: this assumes that the data values are not to contain whitespace
and are never null.

"DTP:07/17/06 SSZ4" is SSZ4 part of the DTP data item?

If data can contain whitespace or can be empty then the pattern needs
to be more complex as you need a lookahead to see if we've reached the
end of the data item.

my %tagged_data = /(\w+):\s*(.*?)\s*(?=\n|\w+:)/g; # Untested
 
D

Dr.Ruud

(e-mail address removed) schreef:
skip over 3 lines, then skip over the initial white space on the 4th
line, then take everything upto the first PZY: (not taking the
whitespace before PZY:)

$record =~ /^.*\n.*\n.*\n\s*(.*?)\s+PZY:/ or die;
my $name=$1;

That would also match

---------------------
A
B
C

D


PZY:
---------------------

but maybe that's OK.

One should use [[:blank:]] (or [ \t]), and not \s, if one only wants to
match SP and TAB.

A nice shortcut for [[:blank:]] would be \h, for horizontal whitespace,
though CR should probably not be included.
See http://dev.perl.org/perl6/doc/design/apo/A05.html about \h and \v.
 
D

DJ

Thanks much for all the responses, was able to get that data out the
way I want it. Setting up that hash was really nice because I can
choose what I want out of there at any time. One more challenge I'm
having is I want to be able to pull out any lines within each record
that match a certain pattern when one line contains a specific number
at a specific place and the next line has different specific number in
a specific place. Here is an example:

ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 2 0202 156 156 11
ZMTH 11/02/02 2 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11

The above is normal output. Say this output changes to this:

ZMTH 11/02/02 2 0102 156 156 11
ZMTH 11/02/02 6 0202 156 156 11
ZMTH 11/02/02 9 0302 156 156 22
ZMTH 11/02/02 2 0402 96 96 11

The second line here has a "6" in the third column and the third line
has a "9" in the third column. I want to know every time this happens
where lines have the 6 immediately followed by a 9 and extract out the
values in each column and probably put into a hash for grabbing
specific output later on through the program. I'm still treating the
record delimiter as "\nTEST1:".

Thanks again!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top