Difficult text file to parse.

richardkreidl · Sep 11, 2005

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

My sample Input file: Input.txt

agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy
Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory
Transaction Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory
Transaction Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for
server.

My desired Output file: Output.txt

agencyKillCFSLegacySync | TOM JONES | RICH STEVENS| SUE LONG | TIM MAYS
| This job kills the Legacy Transformer.
ebsssecrDirTxnSubTESTStop | ALEXIS KING | MIKE JONES ||| To Stop TEST
Directory Transaction Subscriber.
ebsssecrAvotusSyncNT | DON RAINS |||| SunONE Synchronization
Process-NT.
ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | CRAIG GRAVES | DB2UDB
|| Updates Statistics for server.

Thanks

Matt Garrish · Sep 11, 2005

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

Yes, a hash probably would be the best approach to take. This isn't the
place to be asking other people to write code for you, however.

Please make an effort to solve the problem yourself and if you get stuck on
something in particular you're welcome to post back for help.

Matt

axel · Sep 11, 2005

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem.

Yes it would.

I'm not good on using hashes.

It's obviously an excellent opportunity to gain experience in using them.

Axel

William James · Sep 12, 2005

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

My sample Input file: Input.txt

agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy
Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy
Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory
Transaction Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory
Transaction Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics
for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for
server.

My desired Output file: Output.txt

agencyKillCFSLegacySync | TOM JONES | RICH STEVENS| SUE LONG | TIM MAYS
| This job kills the Legacy Transformer.
ebsssecrDirTxnSubTESTStop | ALEXIS KING | MIKE JONES ||| To Stop TEST
Directory Transaction Subscriber.
ebsssecrAvotusSyncNT | DON RAINS |||| SunONE Synchronization
Process-NT.
ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | CRAIG GRAVES | DB2UDB
|| Updates Statistics for server.

Thanks

In Ruby:

h = Hash.new([])
DATA.each { |line| a,b,c = line.chomp.split(/ \| /)
h[ [a,c] ] += [ b ] }
puts h.map{ |k,v| v += ["","",""]
[k[0],v[0,4],k[1]].flatten.join(' | ').gsub(/ /,"")
}.sort

__END__
tagA | TOM JONES | Comment-1
tagA | RICH STEVENS | Comment-1
tagA | SUE LONG | Comment-1
tagA | TIM MAYS | Comment-1
tagA | BOB SMITH | Comment-1
tagA | STEVE WILLS | Comment-1
tagB | ALEXIS KING | Comment-2
tagB | MIKE JONES | Comment-2
tagC | DON RAINS | Comment-3
tagD | SCOTT FRANKS | Comment-4
tagD | CRAIG GRAVES | Comment-4
tagD | DB2UDB | Comment-4

------------------------------------------------------

Output:

tagA | TOM JONES | RICH STEVENS | SUE LONG | TIM MAYS | Comment-1
tagB | ALEXIS KING | MIKE JONES ||| Comment-2
tagC | DON RAINS |||| Comment-3
tagD | SCOTT FRANKS | CRAIG GRAVES | DB2UDB || Comment-4

John W. Krahn · Sep 12, 2005

Basically, I have a large input file which is delimited by the pipe '|'
symbol . Records in the file can have the same data in field 1 and
field 3.
Example the first six records are the same except for field 2.

What I need is to match on field 1 for a possible of 4 matches and no
more than that.
Then take the names from field 2 and create a new record like the first
one in my Output file below.

If the match on field 1 is less than 4 records like the second set of
records are which there are only two, look at the output file below to
see how it would be displayed. I want to show the delimiters even if
there is no data to show.

I hope I explained everything correctly. I think a hash would be the
best way to approach this problem. I'm not good on using hashes.

[ snip data ]

This will work:

#!/usr/bin/perl
use warnings;
use strict;

my %data;

while ( <DATA> ) {
( my @fields = split /\|/ ) == 3 or next;

if ( %data && !exists $data{ $fields[ 0 ], $fields[ 2 ] } ) {
my ( $first, $last ) = map split( $; ), keys %data;
print join '|', $first, ( map @$_, values %data, [ '', '', '' ] )[ 0
... 3 ], $last;
%data = ();
}

push @{ $data{ $fields[ 0 ], $fields[ 2 ] } }, $fields[ 1 ];
}

my ( $first, $last ) = map split( $; ), keys %data;
print join '|', $first, ( map @$_, values %data, [ '', '', '' ] )[ 0 .. 3 ],
$last;

__DATA__
agencyKillCFSLegacySync | TOM JONES | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | RICH STEVENS | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | SUE LONG | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | TIM MAYS | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | BOB SMITH | This job kills the Legacy Transformer.
agencyKillCFSLegacySync | STEVE WILLS | This job kills the Legacy Transformer.

ebsssecrDirTxnSubTESTStop | ALEXIS KING | To Stop TEST Directory Transaction
Subscriber.
ebsssecrDirTxnSubTESTStop | MIKE JONES | To Stop TEST Directory Transaction
Subscriber.

ebsssecrAvotusSyncNT | DON RAINS | SunONE Synchronization Process-NT

ReorgAldaudbRunStatsAldarFmcdb | SCOTT FRANKS | Updates Run Statistics for server.
ReorgAldaudbRunStatsAldarFmcdb | CRAIG GRAVES | Updates Run Statistics for server.
ReorgAldaudbRunStatsAldarFmcdb | DB2UDB | Updates Run Statistics for server.

John

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
SQL Server and .NET Interview questions free download	0	Oct 28, 2006
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006
Download the JAVA , .NET and SQL Server interview with answers	0	Sep 14, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
Stuff the purple heart programmers cook up	10	Dec 30, 2004

Difficult text file to parse.

richardkreidl

Matt Garrish

axel

William James

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads