dup remove - why/how does this work - NEWBIE

Discussion in 'Perl Misc' started by jason@cyberpine.com, Feb 16, 2004.

  1. Guest

    The below simple code works at removing dups from a 20k record file.
    Looking for somebody to explain how/why.

    $db = "workb.txt";
    open (FILE,"$db");
    @lines=<FILE>;
    close(FILE);
    foreach $key (@lines){
    $lines{$key} = 1;
    }
    @lines = keys(%lines);
    print @lines;


    I understand I am adding a key = 1 to every line (is it to every
    line?), but when we recreate @lines what exactly is keys(%lines)
    doing/saying? I see that %lines contains 1+unique records in the
    file).

    Thanks.
    , Feb 16, 2004
    #1
    1. Advertising

  2. Tony Curtis Guest

    >> On 16 Feb 2004 13:44:10 -0800,
    >> said:


    > The below simple code works at removing dups from a 20k
    > record file. Looking for somebody to explain how/why.


    It's not even close, I'm afraid.

    No strict, warnings.

    > $db = "workb.txt";
    > open (FILE,"$db");


    open() untested. Unnecessary quotes around variable.

    > @lines=<FILE>;
    > close(FILE);


    Slurp all lines into memory, then below do a 2nd pass. This
    is wasteful, you only need to see each line once.

    You'll probably want to chomp() the lines too, since the
    trailing newline sequence is usually part of the file
    representation, not part of the data content per se.

    > foreach $key (@lines){
    > $lines{$key} = 1;
    > }
    > @lines = keys(%lines);
    > print @lines;


    > I understand I am adding a key = 1 to every line (is it to
    > every line?), but when we recreate @lines what exactly is


    "Adding" is a misleading word here, implying that the value of
    the line is being changed. "Associating" would be closer.

    > keys(%lines) doing/saying? I see that %lines contains
    > 1+unique records in the file).


    Using a hash is the right choice here, but see

    perldoc -q duplicate

    Essentially you want to, for each line, output the line only
    if you haven't seen that same line before (i.e. it's not th
    key of a hash). Output means either print() or save into an
    array for later processing, judging from your code.

    hth
    t
    Tony Curtis, Feb 16, 2004
    #2
    1. Advertising

  3. gnari Guest

    <> wrote in message
    news:...
    > The below simple code works at removing dups from a 20k record file.
    > Looking for somebody to explain how/why.
    >
    > $db = "workb.txt";
    > open (FILE,"$db");
    > @lines=<FILE>;
    > close(FILE);
    > foreach $key (@lines){
    > $lines{$key} = 1;
    > }
    > @lines = keys(%lines);
    > print @lines;
    >
    >
    > I understand I am adding a key = 1 to every line (is it to every
    > line?), but when we recreate @lines what exactly is keys(%lines)
    > doing/saying? I see that %lines contains 1+unique records in the
    > file).


    this is a common technique using a hash.

    a hash is a data structure that map a set of 'keys' to their
    respective 'values'. each key has one value.

    in this case the hash is %lines (totally unrelated to the array @lines)
    each line of the input file is in turn addad as a key to the hash, with
    an arbitrary value, in this case 1. as each key can only have 1 value,
    when a duplicate is encountered, the value is simply replaced with
    the new value, in this case the same value 1.

    the function keys() returns a list of the keys of a hash in an
    undefined order. in this case, the lines of the input file, with
    duplicates removed.

    the nice integration of hashes into the language, is one of the
    distinctive features of Perl, and they are, along with regexes,
    usually the key to solve most perl problems.

    perldoc perldata

    gnari
    gnari, Feb 16, 2004
    #3
  4. Ben Morrow Guest

    Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote:
    > >> On 16 Feb 2004 13:44:10 -0800,
    > >> said:

    >
    > > The below simple code works at removing dups from a 20k
    > > record file. Looking for somebody to explain how/why.

    >
    > It's not even close, I'm afraid.


    Well, it solves the problem asked. Yes, it has problems, but...

    > You'll probably want to chomp() the lines too, since the
    > trailing newline sequence is usually part of the file
    > representation, not part of the data content per se.


    In this case it isn't necessary: the lines are being compared for
    uniquness, so the line with the $/ on the end is just as good as
    without. Think before you say things like this.

    > > foreach $key (@lines){
    > > $lines{$key} = 1;
    > > }
    > > @lines = keys(%lines);
    > > print @lines;

    >
    > > I understand I am adding a key = 1 to every line (is it to
    > > every line?), but when we recreate @lines what exactly is

    >
    > "Adding" is a misleading word here, implying that the value of
    > the line is being changed. "Associating" would be closer.


    Indeed. The important point, though, is that each key can only go into
    the hash once.

    > > keys(%lines) doing/saying? I see that %lines contains
    > > 1+unique records in the file).

    >
    > Using a hash is the right choice here, but see
    >
    > perldoc -q duplicate
    >
    > Essentially you want to, for each line, output the line only
    > if you haven't seen that same line before (i.e. it's not th
    > key of a hash).


    Yes, another WTDI would be to print the lines as you go along: this is
    more parsimonious, and outputs the lines in the original order.

    while (<F>) {
    print unless $lines{$_};
    $lines{$_} = 1;
    }

    This doesn't mean that the script as given is wrong, however.

    Ben

    --
    $.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
    $x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
    {$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t #
    $J::u::s::t, $a::n::eek:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
    Ben Morrow, Feb 16, 2004
    #4
  5. Tony Curtis Guest

    >> On Mon, 16 Feb 2004 23:49:10 +0000 (UTC),
    >> Ben Morrow <> said:


    >> Me:
    >> You'll probably want to chomp() the lines too, since the
    >> trailing newline sequence is usually part of the file
    >> representation, not part of the data content per se.


    > In this case it isn't necessary: the lines are being
    > compared for uniquness, so the line with the $/ on the end
    > is just as good as without. Think before you say things like
    > this.


    Oh, I thought about it :)

    The OP posted similar code before that did something slightly
    different. It all depends on what is meant to happen later,
    this small example is almost certainly mot the full story.
    Which is why I qualified the suggestion ("probably").

    For myself, I'd rather lose the newline as it's read; this way
    I have a canonicalised internal representation of my data
    immediately. The newline is a sequence that serves to
    separate individual data units in a serialisation of the data,
    so away it goes.
    Tony Curtis, Feb 16, 2004
    #5
  6. Eric Bohlman Guest

    Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote in
    news::

    > For myself, I'd rather lose the newline as it's read; this way
    > I have a canonicalised internal representation of my data
    > immediately. The newline is a sequence that serves to
    > separate individual data units in a serialisation of the data,
    > so away it goes.


    Except the only thing the OP needed to do with the data was print (part of)
    it out again, which means he'd just have to put the newlines back anyway.
    IOW, he's not working with his lines as abstract data, just as pure
    representations of the serialized form.
    Eric Bohlman, Feb 17, 2004
    #6
  7. Tony Curtis Guest

    >> On 17 Feb 2004 00:33:45 GMT,
    >> Eric Bohlman <> said:


    > Tony Curtis <tony_curtis32@_SPAMTRAP_yahoo.com> wrote in
    > news::


    >> For myself, I'd rather lose the newline as it's read; this
    >> way I have a canonicalised internal representation of my
    >> data immediately. The newline is a sequence that serves to
    >> separate individual data units in a serialisation of the
    >> data, so away it goes.


    > Except the only thing the OP needed to do with the data was
    > print (part of) it out again, which means he'd just have to
    > put the newlines back anyway. IOW, he's not working with
    > his lines as abstract data, just as pure representations of
    > the serialized form.


    Possibly. But we don't know for sure do we?

    Do it or don't do it; whichever is best for the situation...
    Tony Curtis, Feb 17, 2004
    #7
  8. Mina Naguib Guest

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1


    Ben Morrow wrote:
    > while (<F>) {
    > print unless $lines{$_};
    > $lines{$_} = 1;
    > }


    Not for the clarity-seekers (or good-coding-standards learning
    purposes), but the whole script can be summarized to:

    #!/usr/bin/perl -n

    print unless $seen{$_}++;

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.3 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    iD8DBQFAMZ4ueS99pGMif6wRAk7AAKD0qZKmQLr0/9ovvsXFG9YQRU2iNwCghRBg
    X7eM2zh8SnOjedrZd/7erIE=
    =zdHW
    -----END PGP SIGNATURE-----
    Mina Naguib, Feb 17, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,762
    Smokey Grindel
    Dec 2, 2006
  2. Mystifier
    Replies:
    1
    Views:
    90
  3. François Beausoleil

    :s.respond_to?(:dup) && :s.dup raises

    François Beausoleil, Apr 5, 2007, in forum: Ruby
    Replies:
    1
    Views:
    91
    Tim Hunter
    Apr 5, 2007
  4. Brian Ross

    Use of dup to remove references

    Brian Ross, Aug 15, 2008, in forum: Ruby
    Replies:
    5
    Views:
    105
    Peña, Botp
    Aug 15, 2008
  5. Luka Stolyarov
    Replies:
    10
    Views:
    282
    Thomas Sondergaard
    Sep 11, 2010
Loading...

Share This Page