data file

F

friend.05

I have a large file in following format:

ID | Time | IP | Code


I want only data lines which has unique IP+Code.

If IP+Code is repeated then I don't want line.
 
F

friend.05

perldoc -q unique

Ben


Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.

#!/usr/local/bin/perl

print "Welcome\n";

$pri_file = "out_pri.txt";

$cnt = 0;
$flag = 0;

open(INFO_PRI,$pri_file)or die $!;
open(INFO,$pri_file)or die $!;

@pri_lines_ = <INFO>;

while($pri_line = <INFO_PRI>)
{
@primary = split('\|',$pri_line);
$pri_cli_ip = $primary[4];
$pri_id = $primary[7];
print "$pri_id\n";


foreach $p_line (@pri_lines_)
{
@pri = split('\|',$p_line);
$cli_ip = $pri[4];
$id = $pri[7];

if(($pri_cli_ip == $cli_ip) && ($pri__id == $id))
{
$cnt++;
if($cnt == 2){
$cnt = 0;
$flag = 1;
last;
}
}
}
if($flag == 0){
open(FILE,'>>pri_unique.txt');
print FILE "$pri_line\n";
close(FILE);
}else{
$flag = 0;
}
}

close(INFO_PRI);
close(INFO);
 
J

Jürgen Exner

Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.

#!/usr/local/bin/perl
$pri_file = "out_pri.txt";

$cnt = 0;
$flag = 0;

open(INFO_PRI,$pri_file)or die $!;
open(INFO,$pri_file)or die $!;

@pri_lines_ = <INFO>;

while($pri_line = <INFO_PRI>)
[rest of code snipped]

Many things I don't understand in this code, among them why you are
using 2 file handles to the same file, why you are slurping in the whole
file on one file handle and then process the file line by line on the
other file handle, why you have a nested loop, etc, etc.

Your requirements seem to be straight forward and easy to translate into
a simple algorithm (warning, sketch only, not tested):

my %idtable;
open ($F, '<', $myfile) of die "Cannot read $myfile because $!\n";
while (<$F>) { #loop through file and gather all IP | Code combinations
(undef, undef, $ip, $code) = split '\|';
$idtable{"$ip|$code"}++; #record this ip-code combination
}
seek $F, 0; #reset file to start
while (<$F>) { #loop through file again and ....
(undef, undef, $ip, $code) = split '\|';
print if $idtable{"$ip|$code"} == 1;
#... print that line if the ip-code combination
#exists exactly once in the file
close $F;

jue
 
B

Ben Morrow

[don't quote .signatures]

Quoth "[email protected] said:
Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.

#!/usr/local/bin/perl

Where is

use warnings;
use strict;

? You have already been told to include this.
print "Welcome\n";

$pri_file = "out_pri.txt";

$cnt = 0;
$flag = 0;

open(INFO_PRI,$pri_file)or die $!;
open(INFO,$pri_file)or die $!;

You have already been told to use lexical filehandles and 3-arg open.
You should make the error message actually useful:

open (my $INFO_PRI, "<", $pri_file)
or die "can't open '$pri_file': $!";

Why are you opening the same file twice? Just iterate over @pri_lines_
instead.
@pri_lines_ = <INFO>;

Why on earth are you using a variable name ending in _?
while($pri_line = <INFO_PRI>)
{
@primary = split('\|',$pri_line);
$pri_cli_ip = $primary[4];
$pri_id = $primary[7];
print "$pri_id\n";


foreach $p_line (@pri_lines_)
{
@pri = split('\|',$p_line);

You keep doing the same split over and over. Split the line first, and
keep the results in a datastructure till you need them.
$cli_ip = $pri[4];
$id = $pri[7];

if(($pri_cli_ip == $cli_ip) && ($pri__id == $id))

Did you read perldoc -q unique? It says to use a hash for finding
uniqueness.
{
$cnt++;

You are not resetting $cnt between iterations of the outer loop, so
every other line will be considered duplicate.
if($cnt == 2){
$cnt = 0;
$flag = 1;
last;

If you give the outer loop a label, you can use next LABEL and avoid
$flag.
}
}
}
if($flag == 0){
open(FILE,'>>pri_unique.txt');
print FILE "$pri_line\n";
close(FILE);

Why do you keep opening and closing this file?

Ben
 
J

J. Gleixner

Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.

Well, it's not the way you posted.

Did you actually read the perldoc Ben mentioned above? You don't use a
hash at all, so I'm guessing not.
#!/usr/local/bin/perl
use strict;

open( my $INFO, '<', $pri_file ) or die "Can't open $pri_file: $!";
open( my $OUT, '>', 'unique.out' ) or die "Can't open unique.out: $!";

my %info;
while ( my $line = <$INFO> )
{
chomp( $line );
# split the data.. you can split directly into the variables..
# my ( $v1, $v2 ) = ( split( /\|/, $line ) )[1,2];
# print $line to $OUT if the hash key of $cli_ip and $id doesn't already
exist.

}
 
J

Jürgen Exner

J. Gleixner said:
Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.

Well, it's not the way you posted.

Did you actually read the perldoc Ben mentioned above? You don't use a
hash at all, so I'm guessing not.
ACK!

while ( my $line = <$INFO> )
{
chomp( $line );
# split the data.. you can split directly into the variables..
# my ( $v1, $v2 ) = ( split( /\|/, $line ) )[1,2];
# print $line to $OUT if the hash key of $cli_ip and $id doesn't already
exist.

That will print each IP+code exactly once. I think (but I may be
mistaken, the OPs isn't clear on that) he wants only those lines, that
_are_ unique wrt. the IP+code, i.e. where there is no second line with
the same IP+code.

jue
 
F

friend.05

Well, it's not the way you posted.
Did you actually read the perldoc Ben mentioned above?  You don't use a
hash at all, so I'm guessing not.
ACK!

while ( my $line = <$INFO> )
{
   chomp( $line );
# split the data.. you can split directly into the variables..
# my ( $v1, $v2 ) = ( split( /\|/, $line ) )[1,2];
# print $line to $OUT if the hash key of $cli_ip and $id doesn't already
exist.

That will print each IP+code exactly once. I think (but I may be
mistaken, the OPs isn't clear on that) he wants only those lines, that
_are_ unique wrt. the IP+code, i.e. where there is no second line with
the same IP+code.

jue- Hide quoted text -

- Show quoted text -

Thanks to all for help. That was helpful.

But.

I created the hash (IP+Code) combination.

But How to chk if this hash(each combination) is exactly one time in
file ?
 
J

Jürgen Exner

I created the hash (IP+Code) combination.

But How to chk if this hash(each combination) is exactly one time in
file ?

You could count the number of occurences and then compare the count
against 1?

$IDTable{"$IP+$Code"}++;
[......]

if ($IDTable{"$IP+$Code"} == 1) {
print "Look ma, $IP+$Code occurs exactly once in the file\n";
 
J

J. Gleixner

Jürgen Exner said:
J. Gleixner said:
Quoth "(e-mail address removed)" <[email protected]>:

I have a large file in following format:
ID | Time | IP | Code
I want only data lines which has unique IP+Code.
If IP+Code is repeated then I don't want line.
perldoc -q unique

Ben
Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.
Well, it's not the way you posted.

Did you actually read the perldoc Ben mentioned above? You don't use a
hash at all, so I'm guessing not.
ACK!

while ( my $line = <$INFO> )
{
chomp( $line );
# split the data.. you can split directly into the variables..
# my ( $v1, $v2 ) = ( split( /\|/, $line ) )[1,2];
# print $line to $OUT if the hash key of $cli_ip and $id doesn't already
exist.

That will print each IP+code exactly once. I think (but I may be
mistaken, the OPs isn't clear on that) he wants only those lines, that
_are_ unique wrt. the IP+code, i.e. where there is no second line with
the same IP+code.

You're right, I mis-understood.

A fairly easy to follow solution would be to keep track of the data,
using two hashes.

my (%times, %line );

while(...)
{
# chomp,split,...
# times is the number of times the $cli_ip and $id were found
$times{ $cli_ip . $id }++;
# could 'next' if it is > 1
# and store the line itself, for the $cli_ip and $id
$line{ $cli_ip . $id } = $line;
}

Then, after the while, for each of the keys in %times, print the
value from %line where the value of $times{ $key } is 1, to the output file.

That should be enough to get the OP in the right direction, without
writing the whole darn thing for them.
 
F

friend.05

Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).
I am not sure which will be best way to do this.
#!/usr/local/bin/perl
$pri_file = "out_pri.txt";
$cnt = 0;
$flag = 0;
open(INFO_PRI,$pri_file)or die $!;
open(INFO,$pri_file)or die $!;
@pri_lines_ = <INFO>;
while($pri_line = <INFO_PRI>)

[rest of code snipped]

Many things I don't understand in this code, among them why you are
using 2 file handles to the same file, why you are slurping in the whole
file on one file handle and then process the file line by line on the
other file handle, why you have a nested loop, etc, etc.

Your requirements seem to be straight forward and easy to translate into
a simple algorithm (warning, sketch only, not tested):

my %idtable;
open ($F, '<', $myfile) of die "Cannot read $myfile because $!\n";
while (<$F>) { #loop through file and gather all IP | Code combinations
        (undef, undef, $ip, $code) = split '\|';
        $idtable{"$ip|$code"}++; #record this ip-code combination}

seek $F, 0; #reset file to start
while (<$F>) { #loop through file again and ....
        (undef, undef, $ip, $code) = split '\|';
        print if $idtable{"$ip|$code"} == 1;
                #... print that line if the ip-code combination
                #exists exactly once in the file
close $F;

jue- Hide quoted text -

- Show quoted text -

Hi jue,

IF I use

$idtable{"$ip|$code"}++; #record this ip-code combination

will this not replace previous valuse if same key(ip-code) comes
again ?
 
J

Jürgen Exner

What is this "Hide quoted text - Show quoted text" nonsense?
IF I use

$idtable{"$ip|$code"}++; #record this ip-code combination

will this not replace previous valuse if same key(ip-code) comes
again ?

Of course it does, that is the whole purpose. Or how do you suggest to
count the number of occurences if not by replacing the previous number
with the new number?

jue
 
F

friend.05

What is this "Hide quoted text - Show quoted text" nonsense?




Of course it does, that is the whole purpose. Or how do you suggest to
count the number of occurences if not by replacing the previous number
with the new number?

jue

Got it thanks.

Sorry abt hide quoted text. I also don't knw wht is tht by mistake I
must click it while replying
 
S

sln

Quoth "(e-mail address removed)" <[email protected]>:
I have a large file in following format:
ID | Time | IP | Code
I want only data lines which has unique IP+Code.
If IP+Code is repeated then I don't want line.
Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).
I am not sure which will be best way to do this.
#!/usr/local/bin/perl
$pri_file = "out_pri.txt";
$cnt = 0;
$flag = 0;
open(INFO_PRI,$pri_file)or die $!;
open(INFO,$pri_file)or die $!;
@pri_lines_ = <INFO>;
while($pri_line = <INFO_PRI>)

[rest of code snipped]

Many things I don't understand in this code, among them why you are
using 2 file handles to the same file, why you are slurping in the whole
file on one file handle and then process the file line by line on the
other file handle, why you have a nested loop, etc, etc.

Your requirements seem to be straight forward and easy to translate into
a simple algorithm (warning, sketch only, not tested):

my %idtable;
open ($F, '<', $myfile) of die "Cannot read $myfile because $!\n";
while (<$F>) { #loop through file and gather all IP | Code combinations
        (undef, undef, $ip, $code) = split '\|';
        $idtable{"$ip|$code"}++; #record this ip-code combination}

seek $F, 0; #reset file to start
while (<$F>) { #loop through file again and ....
        (undef, undef, $ip, $code) = split '\|';
        print if $idtable{"$ip|$code"} == 1;
                #... print that line if the ip-code combination
                #exists exactly once in the file
close $F;

jue- Hide quoted text -

- Show quoted text -

Hi jue,

IF I use

$idtable{"$ip|$code"}++; #record this ip-code combination

will this not replace previous valuse if same key(ip-code) comes
again ?

This may not have been clear....

"$idtable{"$ip|$code"}", in this case is just a variable used as
a counter. Its no different than incrementing any other counter,
like $cnt++

In that respect, it just uses the IP and Code as a concantinated
string as a key into a hash array. Where the key contains the
encoded data.

In my opinion, this is not the way to go. If there is only a few IP
and many many Code, this could create an inordinantly large hash,
resulting in long lookup times.

You could double your money by getting unique IP, as well as shortening the
cpu overhead if you do it this way:

$idtable{$ip}->{$code}++

There is a tradeoff. Don't know really. Depends on the prediction if the amount of unique
Codes outnumbers the amount of IPs ... or something like that.

sln
 
S

sln

Jürgen Exner said:
J. Gleixner said:
(e-mail address removed) wrote:
Quoth "(e-mail address removed)" <[email protected]>:

I have a large file in following format:
ID | Time | IP | Code
I want only data lines which has unique IP+Code.
If IP+Code is repeated then I don't want line.
perldoc -q unique

Ben
Below is code which I have written to extract unique IP+Code from
large file. (File format is ID | Time | IP | code).

I am not sure which will be best way to do this.
Well, it's not the way you posted.

Did you actually read the perldoc Ben mentioned above? You don't use a
hash at all, so I'm guessing not.
ACK!

while ( my $line = <$INFO> )
{
chomp( $line );
# split the data.. you can split directly into the variables..
# my ( $v1, $v2 ) = ( split( /\|/, $line ) )[1,2];
# print $line to $OUT if the hash key of $cli_ip and $id doesn't already
exist.

That will print each IP+code exactly once. I think (but I may be
mistaken, the OPs isn't clear on that) he wants only those lines, that
_are_ unique wrt. the IP+code, i.e. where there is no second line with
the same IP+code.

You're right, I mis-understood.

A fairly easy to follow solution would be to keep track of the data,
using two hashes.

my (%times, %line );

while(...)
{
# chomp,split,...
# times is the number of times the $cli_ip and $id were found
$times{ $cli_ip . $id }++;
# could 'next' if it is > 1
# and store the line itself, for the $cli_ip and $id
$line{ $cli_ip . $id } = $line;
}

Then, after the while, for each of the keys in %times, print the
value from %line where the value of $times{ $key } is 1, to the output file.

That should be enough to get the OP in the right direction, without
writing the whole darn thing for them.

Doesen't this overwrite whats already there? Not sure.
$line{ $cli_ip . $id } = $line;

sln
 
X

xhoster

Doesen't this overwrite whats already there? Not sure.
$line{ $cli_ip . $id } = $line;

Yes, of course. But since those lines won't get printed anyway (because
count > 1) then it doesn't matter if they get overwritten.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

J. Gleixner said:
You're right, I mis-understood.

A fairly easy to follow solution would be to keep track of the data,
using two hashes.

my (%times, %line );

while(...)
{
# chomp,split,...
# times is the number of times the $cli_ip and $id were found
$times{ $cli_ip . $id }++;
# could 'next' if it is > 1
# and store the line itself, for the $cli_ip and $id
$line{ $cli_ip . $id } = $line;
}

I might go with just a single hash, using undef as a special value to
indicate we already have seen more than one.

my %line;

while(...)
{
# chomp,split,...
if (exists $line{ $cli_ip . $id }) {
$line{ $cli_ip . $id } = undef; #skunked
} else {
$line{ $cli_ip . $id } = $line;
};
}


Then, after the while, for each of the keys in %times, print the
value from %line where the value of $times{ $key } is 1, to the output
file.

Under my method, print the things from %line where the value is defined.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
J

Jürgen Exner

J. Gleixner said:
$times{ $cli_ip . $id }++;

Careful! This may give wrong results in odd circumstances.
Example:
$cli_ip='foobar', $id='buz';
and
$cli_ip='foo', $id='barbuz';

Better to use the same separator as in the original data set, regardless
of if such a scenario may or may not happen with the OPs data set:

$times{ $cli_ip . '|' . $id }++;

jue
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top