dup remove - why/how does this work - NEWBIE

J

jason

The below simple code works at removing dups from a 20k record file.
Looking for somebody to explain how/why.

$db = "workb.txt";
open (FILE,"$db");
@lines=<FILE>;
close(FILE);
foreach $key (@lines){
$lines{$key} = 1;
}
@lines = keys(%lines);
print @lines;


I understand I am adding a key = 1 to every line (is it to every
line?), but when we recreate @lines what exactly is keys(%lines)
doing/saying? I see that %lines contains 1+unique records in the
file).

Thanks.
 
T

Tony Curtis

On 16 Feb 2004 13:44:10 -0800,
The below simple code works at removing dups from a 20k
record file. Looking for somebody to explain how/why.

It's not even close, I'm afraid.

No strict, warnings.
$db = "workb.txt";
open (FILE,"$db");

open() untested. Unnecessary quotes around variable.
@lines=<FILE>;
close(FILE);

Slurp all lines into memory, then below do a 2nd pass. This
is wasteful, you only need to see each line once.

You'll probably want to chomp() the lines too, since the
trailing newline sequence is usually part of the file
representation, not part of the data content per se.
foreach $key (@lines){
$lines{$key} = 1;
}
@lines = keys(%lines);
print @lines;
I understand I am adding a key = 1 to every line (is it to
every line?), but when we recreate @lines what exactly is

"Adding" is a misleading word here, implying that the value of
the line is being changed. "Associating" would be closer.
keys(%lines) doing/saying? I see that %lines contains
1+unique records in the file).

Using a hash is the right choice here, but see

perldoc -q duplicate

Essentially you want to, for each line, output the line only
if you haven't seen that same line before (i.e. it's not th
key of a hash). Output means either print() or save into an
array for later processing, judging from your code.

hth
t
 
G

gnari

The below simple code works at removing dups from a 20k record file.
Looking for somebody to explain how/why.

$db = "workb.txt";
open (FILE,"$db");
@lines=<FILE>;
close(FILE);
foreach $key (@lines){
$lines{$key} = 1;
}
@lines = keys(%lines);
print @lines;


I understand I am adding a key = 1 to every line (is it to every
line?), but when we recreate @lines what exactly is keys(%lines)
doing/saying? I see that %lines contains 1+unique records in the
file).

this is a common technique using a hash.

a hash is a data structure that map a set of 'keys' to their
respective 'values'. each key has one value.

in this case the hash is %lines (totally unrelated to the array @lines)
each line of the input file is in turn addad as a key to the hash, with
an arbitrary value, in this case 1. as each key can only have 1 value,
when a duplicate is encountered, the value is simply replaced with
the new value, in this case the same value 1.

the function keys() returns a list of the keys of a hash in an
undefined order. in this case, the lines of the input file, with
duplicates removed.

the nice integration of hashes into the language, is one of the
distinctive features of Perl, and they are, along with regexes,
usually the key to solve most perl problems.

perldoc perldata

gnari
 
B

Ben Morrow

Tony Curtis said:
It's not even close, I'm afraid.

Well, it solves the problem asked. Yes, it has problems, but...
You'll probably want to chomp() the lines too, since the
trailing newline sequence is usually part of the file
representation, not part of the data content per se.

In this case it isn't necessary: the lines are being compared for
uniquness, so the line with the $/ on the end is just as good as
without. Think before you say things like this.
"Adding" is a misleading word here, implying that the value of
the line is being changed. "Associating" would be closer.

Indeed. The important point, though, is that each key can only go into
the hash once.
Using a hash is the right choice here, but see

perldoc -q duplicate

Essentially you want to, for each line, output the line only
if you haven't seen that same line before (i.e. it's not th
key of a hash).

Yes, another WTDI would be to print the lines as you go along: this is
more parsimonious, and outputs the lines in the original order.

while (<F>) {
print unless $lines{$_};
$lines{$_} = 1;
}

This doesn't mean that the script as given is wrong, however.

Ben
 
T

Tony Curtis

In this case it isn't necessary: the lines are being
compared for uniquness, so the line with the $/ on the end
is just as good as without. Think before you say things like
this.

Oh, I thought about it :)

The OP posted similar code before that did something slightly
different. It all depends on what is meant to happen later,
this small example is almost certainly mot the full story.
Which is why I qualified the suggestion ("probably").

For myself, I'd rather lose the newline as it's read; this way
I have a canonicalised internal representation of my data
immediately. The newline is a sequence that serves to
separate individual data units in a serialisation of the data,
so away it goes.
 
E

Eric Bohlman

For myself, I'd rather lose the newline as it's read; this way
I have a canonicalised internal representation of my data
immediately. The newline is a sequence that serves to
separate individual data units in a serialisation of the data,
so away it goes.

Except the only thing the OP needed to do with the data was print (part of)
it out again, which means he'd just have to put the newlines back anyway.
IOW, he's not working with his lines as abstract data, just as pure
representations of the serialized form.
 
T

Tony Curtis

On 17 Feb 2004 00:33:45 GMT,
Except the only thing the OP needed to do with the data was
print (part of) it out again, which means he'd just have to
put the newlines back anyway. IOW, he's not working with
his lines as abstract data, just as pure representations of
the serialized form.

Possibly. But we don't know for sure do we?

Do it or don't do it; whichever is best for the situation...
 
M

Mina Naguib

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Ben said:
while (<F>) {
print unless $lines{$_};
$lines{$_} = 1;
}

Not for the clarity-seekers (or good-coding-standards learning
purposes), but the whole script can be summarized to:

#!/usr/bin/perl -n

print unless $seen{$_}++;

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFAMZ4ueS99pGMif6wRAk7AAKD0qZKmQLr0/9ovvsXFG9YQRU2iNwCghRBg
X7eM2zh8SnOjedrZd/7erIE=
=zdHW
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top