compare 2 data files and extract fields for matched lines

S

shree

Hello Friends,

The best way to describe what I'm trying to do is through an example.
I have 2 pipe delimited input files and want to extract a field from
file 2 and append it to file 1. Note I would like Output file to have
the same number of rows as Input File 1, with an additional field
whose value if present in file 2, should be inserted in this new
field. If its not present, then insert '0000'.

Input File 1 (zipcode, city, state, county)
36003|Autaugaville|AL|AUTAUGA
36006|Billingsley|AL|AUTAUGA
72314|Birdeye|AR|CROSS
72324|Cherry Valley|AR|CROSS
57437|Eureka|SD|MCPHERSON
67460|Mc Pherson|KS|MCPHERSON
67464|Marquette|KS|MCPHERSON
69167|Tryon|NE|MCPHERSON
...
...
Input File 2 (county, state, county population)
AUTAUGA|AL|49730
CROSS|AR|19056
MCPHERSON|KS|29380
...

Desired Output (zipcode, city, state, county, county population)
36003|Autaugaville|AL|AUTAUGA|49730
36006|Billingsley|AL|AUTAUGA|49730
72314|Birdeye|AR|CROSS|19056
72324|Cherry Valley|AR|CROSS|19056
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|29380
67464|Marquette|KS|MCPHERSON|29380
69167|Tryon|NE|MCPHERSON|0000

---
I wrote the program below but it has logic error. Instead of getting
the above, I get the following.

Any guidance with fixing the code or perhaps a better way to do this
is really appreciated. The above is just a few lines from my real
input files, which are considerably larger.

Thank you and best wishes,
Shree


36003|Autaugaville|AL|AUTAUGA|49730
36006|Billingsley|AL|AUTAUGA|49730
72314|Birdeye|AR|CROSS|0000
72324|Cherry Valley|AR|CROSS|0000
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|0000
67464|Marquette|KS|MCPHERSON|0000
69167|Tryon|NE|MCPHERSON|0000
36003|Autaugaville|AL|AUTAUGA|0000
36006|Billingsley|AL|AUTAUGA|0000
72314|Birdeye|AR|CROSS|19056
72324|Cherry Valley|AR|CROSS|19056
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|0000
67464|Marquette|KS|MCPHERSON|0000
69167|Tryon|NE|MCPHERSON|0000
36003|Autaugaville|AL|AUTAUGA|0000
36006|Billingsley|AL|AUTAUGA|0000
72314|Birdeye|AR|CROSS|0000
72324|Cherry Valley|AR|CROSS|0000
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|29380
67464|Marquette|KS|MCPHERSON|29380
69167|Tryon|NE|MCPHERSON|0000

---------------------------------------------------------------------
#!/usr/bin/perl

use strict;
my $File_In1 = "dat1.txt";
my $File_In2 = "dat2.txt";
my (@array1, @array2) = ();
my ($line1, $line2) = "";
my ($zip, $city, $state, $county) = "";
my ($county2, $state2, $pop) = "";

open (FILE_IN1, $File_In1) or die "cannot open file in FILE_IN1 $!";
@array1 = <FILE_IN1>;
close (FILE_IN1);

open (FILE_IN2, $File_In2) or die "cannot open file in FILE_IN2 $!";
@array2 = <FILE_IN2>;
close (FILE_IN2);

foreach $line2 (@array2) {
chomp ($line2);
($county2, $state2, $pop) = split (/\|/, $line2);
foreach $line1 (@array1) {
chomp ($line1);
($zip, $city, $state, $county) = split (/\|/, $line1);
if (($county2 eq $county) && ($state2 eq $state)) {
print "$zip|$city|$state|$county|$pop\n";
} else {
print "$zip|$city|$state|$county|0000\n";
}
}
}
 
D

dn.perl

open (FILE_IN1, $File_In1) or die "cannot open file in FILE_IN1 $!";
@array1 = <FILE_IN1>;
close (FILE_IN1);

Start with two sample files of 2-3 lines each.
Make sure it is okay to read the entire file
in an array by printing the value of @array1
and printing the value of $line1 after running
foreach $line1 (@array1); and read the files
line by line instead of reading them all at once.
Something like :
while ($line1 = <FILE_IN1>) { statements }.
As part of debugging the code, print out the value
of $line1 to see for yourself what strings or arrays
your code is dealing with.
 
N

nolo contendere

Hello Friends,

The best way to describe what I'm trying to do is through an example.
I have 2 pipe delimited input files and want to extract a field from
file 2 and append it to file 1.  Note I would like Output file to have
the same number of rows as Input File 1, with an additional field
whose value if present in file 2, should be inserted in this new
field. If its not present, then insert '0000'.

Input File 1 (zipcode, city, state, county)
36003|Autaugaville|AL|AUTAUGA
36006|Billingsley|AL|AUTAUGA
72314|Birdeye|AR|CROSS
72324|Cherry Valley|AR|CROSS
57437|Eureka|SD|MCPHERSON
67460|Mc Pherson|KS|MCPHERSON
67464|Marquette|KS|MCPHERSON
69167|Tryon|NE|MCPHERSON
..
..
Input File 2 (county, state, county population)
AUTAUGA|AL|49730
CROSS|AR|19056
MCPHERSON|KS|29380
..

stick this info in a hash.

so, start your script:

#!/usr/bin/perl

use strict; use warnings;

my %pop;
open my $fh2, '<', $file2 or die "can't open '$file2': $!\n";
while ( <$fh2> ) {
chomp;
my ( $county, $state, $population ) = split /\|/;
$pop{"$state|$county"} = $population;
}
close $fh2;

# now you just need to open theother file, and do a lookup on the
"$state|$county" key
# if it's not there, append your 0000;

open my $fh1, '<', $file1 or die "can't open '$file1': $!\n";
while ( <$fh1> ) {
chomp;
my ( $zip, $city, $state, $county ) = split /\|/;
if ( $pop{"$state|$county"} ) {
print "$_|".$pop{"$state|$county"}."\n";
}
else {
print "$_|0000\n";
}
}
close $fh1;


__END__

* Note: untested
 
J

Jürgen Exner

The best way to describe what I'm trying to do is through an example.

That helps somewhat.
I have 2 pipe delimited input files and want to extract a field from
file 2 and append it to file 1. Note I would like Output file to have
the same number of rows as Input File 1, with an additional field
whose value if present in file 2, should be inserted in this new
field. If its not present, then insert '0000'.

You forgot to mention and it is not clear from your example _which field_ is
the link between those 2 files.
Input File 1 (zipcode, city, state, county)
36003|Autaugaville|AL|AUTAUGA
36006|Billingsley|AL|AUTAUGA
..
..
Input File 2 (county, state, county population)
AUTAUGA|AL|49730
CROSS|AR|19056
MCPHERSON|KS|29380
..

Desired Output (zipcode, city, state, county, county population)
36003|Autaugaville|AL|AUTAUGA|49730
36006|Billingsley|AL|AUTAUGA|49730
72314|Birdeye|AR|CROSS|19056

---
I wrote the program below but it has logic error. Instead of getting
the above, I get the following.

Any guidance with fixing the code or perhaps a better way to do this
is really appreciated. The above is just a few lines from my real

[attempt with 2 arrays and nested loops snipped]

There is a much easier approach:
- read file 2 into a hash, using the link between the 2 files as the key and
the desired number as the value in each hash entry.
- then read file 1 line by line and if the key exists then write the line
with the hash value to the new file, otherwise write the line with 0000
appended to the new file.

Not only is this much easier to comprehend, it is also much faster with
O(n+m) instead of O(n*m).

jue
 
J

John W. Krahn

shree said:
The best way to describe what I'm trying to do is through an example.
I have 2 pipe delimited input files and want to extract a field from
file 2 and append it to file 1. Note I would like Output file to have
the same number of rows as Input File 1, with an additional field
whose value if present in file 2, should be inserted in this new
field. If its not present, then insert '0000'.

Input File 1 (zipcode, city, state, county)
36003|Autaugaville|AL|AUTAUGA
36006|Billingsley|AL|AUTAUGA
72314|Birdeye|AR|CROSS
72324|Cherry Valley|AR|CROSS
57437|Eureka|SD|MCPHERSON
67460|Mc Pherson|KS|MCPHERSON
67464|Marquette|KS|MCPHERSON
69167|Tryon|NE|MCPHERSON
..
..
Input File 2 (county, state, county population)
AUTAUGA|AL|49730
CROSS|AR|19056
MCPHERSON|KS|29380
..

Desired Output (zipcode, city, state, county, county population)
36003|Autaugaville|AL|AUTAUGA|49730
36006|Billingsley|AL|AUTAUGA|49730
72314|Birdeye|AR|CROSS|19056
72324|Cherry Valley|AR|CROSS|19056
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|29380
67464|Marquette|KS|MCPHERSON|29380
69167|Tryon|NE|MCPHERSON|0000

---
I wrote the program below but it has logic error. Instead of getting
the above, I get the following.

Any guidance with fixing the code or perhaps a better way to do this
is really appreciated. The above is just a few lines from my real
input files, which are considerably larger.

Thank you and best wishes,
Shree


36003|Autaugaville|AL|AUTAUGA|49730
36006|Billingsley|AL|AUTAUGA|49730
72314|Birdeye|AR|CROSS|0000
72324|Cherry Valley|AR|CROSS|0000
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|0000
67464|Marquette|KS|MCPHERSON|0000
69167|Tryon|NE|MCPHERSON|0000
36003|Autaugaville|AL|AUTAUGA|0000
36006|Billingsley|AL|AUTAUGA|0000
72314|Birdeye|AR|CROSS|19056
72324|Cherry Valley|AR|CROSS|19056
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|0000
67464|Marquette|KS|MCPHERSON|0000
69167|Tryon|NE|MCPHERSON|0000
36003|Autaugaville|AL|AUTAUGA|0000
36006|Billingsley|AL|AUTAUGA|0000
72314|Birdeye|AR|CROSS|0000
72324|Cherry Valley|AR|CROSS|0000
57437|Eureka|SD|MCPHERSON|0000
67460|Mc Pherson|KS|MCPHERSON|29380
67464|Marquette|KS|MCPHERSON|29380
69167|Tryon|NE|MCPHERSON|0000

---------------------------------------------------------------------
#!/usr/bin/perl

use strict;
my $File_In1 = "dat1.txt";
my $File_In2 = "dat2.txt";
my (@array1, @array2) = ();

That is the same as:

my @array1 = ();
my @array2;
my ($line1, $line2) = "";

That is the same as:

my $line1 = "";
my $line2;
my ($zip, $city, $state, $county) = "";

That is the same as:

my $zip = "";
my ($city, $state, $county);
my ($county2, $state2, $pop) = "";

That is the same as:

my $county2 = "";
my ($state2, $pop);

Anyway, you should declare your variables in the smallest possible scope
instead of all at the top of the file.

open (FILE_IN1, $File_In1) or die "cannot open file in FILE_IN1 $!";
@array1 = <FILE_IN1>;
close (FILE_IN1);

open (FILE_IN2, $File_In2) or die "cannot open file in FILE_IN2 $!";
@array2 = <FILE_IN2>;
close (FILE_IN2);

foreach $line2 (@array2) {
chomp ($line2);
($county2, $state2, $pop) = split (/\|/, $line2);
foreach $line1 (@array1) {
chomp ($line1);
($zip, $city, $state, $county) = split (/\|/, $line1);
if (($county2 eq $county) && ($state2 eq $state)) {
print "$zip|$city|$state|$county|$pop\n";
} else {
print "$zip|$city|$state|$county|0000\n";
}
}
}

Try it like this:

#!/usr/bin/perl
use warnings;
use strict;

my $File_In1 = 'dat1.txt';
my $File_In2 = 'dat2.txt';

open FILE_IN2, '<', $File_In2 or die "cannot open '$File_In2' $!";

my %population;
while ( <FILE_IN2> ) {
my ( $county, $state, $pop ) = /\A([^|]+)\|([^|]+)\|(\d+)\Z/;
$population{ "$state|$county" } = $pop;
}

close FILE_IN2;

open FILE_IN1, '<', $File_In1 or die "cannot open '$File_In1' $!";

while ( my $line = <FILE_IN1> ) {
chomp $line;
my ( $key ) = $line =~ /\|([^|]+\|[^|]+)\z/;
print "$line|", $population{ $key } || '0000', "\n";
}

close FILE_IN1;

__END__



John
 
S

shree

#!/usr/bin/perl
use warnings;
use strict;

my $File_In1 = 'dat1.txt';
my $File_In2 = 'dat2.txt';

open FILE_IN2, '<', $File_In2 or die "cannot open '$File_In2' $!";

my %population;
while ( <FILE_IN2> ) {
my ( $county, $state, $pop ) = /\A([^|]+)\|([^|]+)\|(\d+)\Z/;
$population{ "$state|$county" } = $pop;
}

close FILE_IN2;

open FILE_IN1, '<', $File_In1 or die "cannot open '$File_In1' $!";

while ( my $line = <FILE_IN1> ) {
chomp $line;
my ( $key ) = $line =~ /\|([^|]+\|[^|]+)\z/;
print "$line|", $population{ $key } || '0000', "\n";
}

close FILE_IN1;

__END__

John

Dear all,

Thanks for showing me how to do this. And an added thanks to John for
teaching good programming techniques in perl.

I was able to literally use the above and it worked like a charm.

Shree
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,999
Latest member
MakersCBDGummiesReview

Latest Threads

Top