Difficult text file to parse.

R

richardkreidl

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.


My sample Input file: Input.txt

agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy
Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory
Transaction Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory
Transaction Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for
server.



My desired Output file: Output.txt

agencyKillCFSLegacySync | TOM JONES | RICH STEVENS| SUE LONG | TIM MAYS
| This job kills the Legacy Transformer.
ebsssecrDirTxnSubTESTStop | ALEXIS KING | MIKE JONES ||| To Stop TEST
Directory Transaction Subscriber.
ebsssecrAvotusSyncNT | DON RAINS |||| SunONE Synchronization
Process-NT.
ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | CRAIG GRAVES | DB2UDB
|| Updates Statistics for server.

Thanks
 
M

Matt Garrish

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

Yes, a hash probably would be the best approach to take. This isn't the
place to be asking other people to write code for you, however.

Please make an effort to solve the problem yourself and if you get stuck on
something in particular you're welcome to post back for help.

Matt
 
A

axel

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.
What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.
If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.
I hope I explained everything correctly. I think a hash would be the
best way to approach this problem.

Yes it would.
I'm not good on using hashes.

It's obviously an excellent opportunity to gain experience in using them.

Axel
 
W

William James

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.


My sample Input file: Input.txt

agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy
Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory
Transaction Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory
Transaction Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for
server.



My desired Output file: Output.txt

agencyKillCFSLegacySync | TOM JONES | RICH STEVENS| SUE LONG | TIM MAYS
| This job kills the Legacy Transformer.
ebsssecrDirTxnSubTESTStop | ALEXIS KING | MIKE JONES ||| To Stop TEST
Directory Transaction Subscriber.
ebsssecrAvotusSyncNT | DON RAINS |||| SunONE Synchronization
Process-NT.
ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | CRAIG GRAVES | DB2UDB
|| Updates Statistics for server.

Thanks

In Ruby:

h = Hash.new([])
DATA.each { |line| a,b,c = line.chomp.split(/ \| /)
h[ [a,c] ] += [ b ] }
puts h.map{ |k,v| v += ["","",""]
[k[0],v[0,4],k[1]].flatten.join(' | ').gsub(/ /,"")
}.sort

__END__
tagA | TOM JONES | Comment-1
tagA | RICH STEVENS | Comment-1
tagA | SUE LONG | Comment-1
tagA | TIM MAYS | Comment-1
tagA | BOB SMITH | Comment-1
tagA | STEVE WILLS | Comment-1
tagB | ALEXIS KING | Comment-2
tagB | MIKE JONES | Comment-2
tagC | DON RAINS | Comment-3
tagD | SCOTT FRANKS | Comment-4
tagD | CRAIG GRAVES | Comment-4
tagD | DB2UDB | Comment-4

------------------------------------------------------

Output:

tagA | TOM JONES | RICH STEVENS | SUE LONG | TIM MAYS | Comment-1
tagB | ALEXIS KING | MIKE JONES ||| Comment-2
tagC | DON RAINS |||| Comment-3
tagD | SCOTT FRANKS | CRAIG GRAVES | DB2UDB || Comment-4
 
J

John W. Krahn

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

[ snip data ]


This will work:

#!/usr/bin/perl
use warnings;
use strict;


my %data;

while ( <DATA> ) {
( my @fields = split /\|/ ) == 3 or next;

if ( %data && !exists $data{ $fields[ 0 ], $fields[ 2 ] } ) {
my ( $first, $last ) = map split( $; ), keys %data;
print join '|', $first, ( map @$_, values %data, [ '', '', '' ] )[ 0
... 3 ], $last;
%data = ();
}

push @{ $data{ $fields[ 0 ], $fields[ 2 ] } }, $fields[ 1 ];
}

my ( $first, $last ) = map split( $; ), keys %data;
print join '|', $first, ( map @$_, values %data, [ '', '', '' ] )[ 0 .. 3 ],
$last;


__DATA__
agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory Transaction
Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory Transaction
Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for server.





John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top