Re: removing duplication from a huge list.

Discussion in 'Python' started by Chris Rebert, Feb 27, 2009.

  1. Chris Rebert

    Chris Rebert Guest

    On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson <> wrote:
    > Shanmuga Rajan <m.shanmugarajan <at> gmail.com> writes:
    >
    >> f any one suggests better solution then i will be very happy.Advance thanks

    > for any help.Shan
    >
    > Use a set.


    To expand on that a bit:

    counted_recs = set(rec[0] for rec in some_fun())
    #or in Python 3.0:
    counted_recs = {rec[0] for rec in some_fun()}

    Cheers,
    Chris

    --
    Follow the path of the Iguana...
    http://rebertia.com
     
    Chris Rebert, Feb 27, 2009
    #1
    1. Advertising

  2. Chris Rebert

    odeits Guest

    On Feb 26, 9:15 pm, Chris Rebert <> wrote:
    > On Thu, Feb 26, 2009 at 8:49 PM, Benjamin Peterson <> wrote:
    > > Shanmuga Rajan <m.shanmugarajan <at> gmail.com> writes:

    >
    > >> f any one suggests better solution then i will be very happy.Advance thanks

    > > for any help.Shan

    >
    > > Use a set.

    >
    > To expand on that a bit:
    >
    > counted_recs = set(rec[0] for rec in some_fun())
    > #or in Python 3.0:
    > counted_recs = {rec[0] for rec in some_fun()}
    >
    > Cheers,
    > Chris
    >
    > --
    > Follow the path of the Iguana...http://rebertia.com


    How big of a list are we talking about? If the list is so big that the
    entire list cannot fit in memory at the same time this approach wont
    work e.g. removing duplicate lines from a very large file. A couple of
    things come into play at that point. If order does not matter then I
    would suggest looking at some recipes for first sorting the large file
    and then iterating through the lines removing duplicates as you go
    ( if cur_line != last_line: write cur_line; last_line = cur_line )

    If order does matter then let me know and I will post a recipe for
    that.
     
    odeits, Feb 27, 2009
    #2
    1. Advertising

  3. Chris Rebert

    Guest

    odeits:
    > How big of a list are we talking about? If the list is so big that the
    > entire list cannot fit in memory at the same time this approach wont
    > work e.g. removing duplicate lines from a very large file.


    If the data are lines of a file, and keeping the original order isn't
    important, then the first to try may be to use the unix (or cygwin on
    Windows) commands sort and uniq.

    Bye,
    bearophile
     
    , Feb 27, 2009
    #3
  4. wrote:
    > odeits:
    >> How big of a list are we talking about? If the list is so big that the
    >> entire list cannot fit in memory at the same time this approach wont
    >> work e.g. removing duplicate lines from a very large file.

    >
    > If the data are lines of a file, and keeping the original order isn't
    > important, then the first to try may be to use the unix (or cygwin on
    > Windows) commands sort and uniq.


    or preferably "sort -u", in case that's supported.

    Stefan
     
    Stefan Behnel, Feb 27, 2009
    #4
  5. Chris Rebert

    odeits Guest

    On Feb 27, 1:18 am, Stefan Behnel <> wrote:
    > wrote:
    > > odeits:
    > >> How big of a list are we talking about? If the list is so big that the
    > >> entire list cannot fit in memory at the same time this approach wont
    > >> work e.g. removing duplicate lines from a very large file.

    >
    > > If the data are lines of a file, and keeping the original order isn't
    > > important, then the first to try may be to use the unix (or cygwin on
    > > Windows) commands sort and uniq.

    >
    > or preferably "sort -u", in case that's supported.
    >
    > Stefan


    Although this is true, that is more of an answer to the question "How
    do i remove duplicates from a huge list in Unix?".
     
    odeits, Feb 27, 2009
    #5
  6. Chris Rebert

    Guest

    odeits:
    > Although this is true, that is more of an answer to the question "How
    > do i remove duplicates from a huge list in Unix?".


    Don't you like cygwin?

    Bye,
    bearophile
     
    , Feb 27, 2009
    #6
  7. Chris Rebert

    Tim Rowe Guest

    2009/2/27 odeits <>:

    > How big of a list are we talking about? If the list is so big that the
    > entire list cannot fit in memory at the same time this approach wont
    > work e.g. removing duplicate lines from a very large file.


    We were told in the original question: more than 15 million records,
    and it won't all fit into memory. So your observation is pertinent.

    --
    Tim Rowe
     
    Tim Rowe, Feb 27, 2009
    #7
  8. Chris Rebert

    Falcolas Guest

    On Feb 27, 8:33 am, Tim Rowe <> wrote:
    > 2009/2/27 odeits <>:
    >
    > > How big of a list are we talking about? If the list is so big that the
    > > entire list cannot fit in memory at the same time this approach wont
    > > work e.g. removing duplicate lines from a very large file.

    >
    > We were told in the original question: more than 15 million records,
    > and it won't all fit into memory. So your observation is pertinent.
    >
    > --
    > Tim Rowe


    If order did matter, and the list itself couldn't be stored in memory,
    I would personally do some sort of hash of each item (or something as
    simple as first 5 bytes, last 5 bytes and length), keeping a reference
    to which item the hash belongs, sort and identify duplicates in the
    hash, and using the reference check to see if the actual items in
    question match as well.

    Pretty brutish and slow, but it's the first algorithm which comes to
    mind. Of course, I'm assuming that the list items are long enough to
    warrant using a hash and not the values themselves.

    ~G
     
    Falcolas, Feb 27, 2009
    #8
  9. Chris Rebert

    Steve Holden Guest

    Falcolas wrote:
    > On Feb 27, 8:33 am, Tim Rowe <> wrote:
    >> 2009/2/27 odeits <>:
    >>
    >>> How big of a list are we talking about? If the list is so big that the
    >>> entire list cannot fit in memory at the same time this approach wont
    >>> work e.g. removing duplicate lines from a very large file.

    >> We were told in the original question: more than 15 million records,
    >> and it won't all fit into memory. So your observation is pertinent.
    >>
    >> --
    >> Tim Rowe

    >
    > If order did matter, and the list itself couldn't be stored in memory,
    > I would personally do some sort of hash of each item (or something as
    > simple as first 5 bytes, last 5 bytes and length), keeping a reference
    > to which item the hash belongs, sort and identify duplicates in the
    > hash, and using the reference check to see if the actual items in
    > question match as well.
    >

    Assuming no duplicates, how does this help? You still have to verify
    collisions.

    > Pretty brutish and slow, but it's the first algorithm which comes to
    > mind. Of course, I'm assuming that the list items are long enough to
    > warrant using a hash and not the values themselves.
    >

    The problem is you can't guarantee any hash value will be unique, so
    ultimately you have to check whether list entries hashing to the same
    value are or are not equal. Which would mean either having the whole
    list in memory or having access to it via some other random access method.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
     
    Steve Holden, Feb 27, 2009
    #9
  10. Chris Rebert

    Tim Chase Guest

    >> How big of a list are we talking about? If the list is so big that the
    >> entire list cannot fit in memory at the same time this approach wont
    >> work e.g. removing duplicate lines from a very large file.

    >
    > We were told in the original question: more than 15 million records,
    > and it won't all fit into memory. So your observation is pertinent.


    Assuming the working set of unique items will still fit within
    memory, it can be done with the following regardless of the
    input-file's size:

    def deduplicator(iterable):
    seen = set()
    for item in iterable:
    if item not in seen:
    seen.add(item)
    yield item

    s = [7,6,5,4,3,6,9,5,4,3,2,5,4,3,2,1]
    print list(deduplicator(s))

    for line in deduplicator(file('huge_test.txt')):
    print line


    It maintains order, emitting only new items as they're encountered.

    -tkc
     
    Tim Chase, Feb 27, 2009
    #10
  11. Chris Rebert

    Tim Rowe Guest

    2009/2/27 Steve Holden <>:

    > Assuming no duplicates, how does this help? You still have to verify
    > collisions.
    >
    >> Pretty brutish and slow, but it's the first algorithm which comes to
    >> mind. Of course, I'm assuming that the list items are long enough to
    >> warrant using a hash and not the values themselves.
    >>

    > The problem is you can't guarantee any hash value will be unique, so
    > ultimately you have to check whether list entries hashing to the same
    > value are or are not equal. Which would mean either having the whole
    > list in memory or having access to it via some other random access method.


    You don't need random access to the whole list, just to the records
    that collide on a hash. although you do end up using the external
    storage as random access, which can be very slow, you've possibly
    drastically reduced the number of such random accesses, which should
    be a big saving. The best case is if the number of collisions is low
    enough for the random access to be dealt with in buffering, avoiding
    wasteful seeks of the external storage, in which case it's a complete
    win. Remember that if the identical hashes are not collisions but
    genuine duplicates you can throw them away as soon as they're checked,
    so with some clever buffering you can stop them from clogging up the
    buffer. The worst case is if there are a lot of genuine collisions, in
    which case it's probably not a very good hash.

    --
    Tim Rowe
     
    Tim Rowe, Feb 27, 2009
    #11
  12. Chris Rebert

    Falcolas Guest

    On Feb 27, 10:07 am, Steve Holden <> wrote:
    > Assuming no duplicates, how does this help? You still have to verify
    > collisions.


    Absolutely. But a decent hashing function (particularly since you know
    quite a bit about the data beforehand) will have very few collisions
    (theoretically no collisions, again since we know the data
    beforehand).

    > The problem is you can't guarantee any hash value will be unique, so
    > ultimately you have to check whether list entries hashing to the same
    > value are or are not equal. Which would mean either having the whole
    > list in memory or having access to it via some other random access method..


    Again, absolutely. But as Tim pointed out, with a decent hash the
    number of superfluous checks should be minimal, so your checks will
    mostly occur on valid duplicate values.

    It's worth noting that I picked this algorithm with the assumption
    that any storage to physical memory (such as required by using a set)
    would be out of the question.

    ~G
     
    Falcolas, Feb 27, 2009
    #12
  13. Chris Rebert

    Steve Holden Guest

    Falcolas wrote:
    > On Feb 27, 10:07 am, Steve Holden <> wrote:
    >> Assuming no duplicates, how does this help? You still have to verify
    >> collisions.

    >
    > Absolutely. But a decent hashing function (particularly since you know
    > quite a bit about the data beforehand) will have very few collisions
    > (theoretically no collisions, again since we know the data
    > beforehand).
    >
    >> The problem is you can't guarantee any hash value will be unique, so
    >> ultimately you have to check whether list entries hashing to the same
    >> value are or are not equal. Which would mean either having the whole
    >> list in memory or having access to it via some other random access method.

    >
    > Again, absolutely. But as Tim pointed out, with a decent hash the
    > number of superfluous checks should be minimal, so your checks will
    > mostly occur on valid duplicate values.
    >
    > It's worth noting that I picked this algorithm with the assumption
    > that any storage to physical memory (such as required by using a set)
    > would be out of the question.
    >

    I take your point about the relatively rare collisions. Devising a
    zero-collision algorithm is far from trivial, since at the minimum it
    requires storing all hashes in a set to verify each new hash doesn't
    collide with previous ones.

    But unless you can *guarantee* zero collisions you do still need a
    mapping from hash values to the original data, since this is the only
    way to determine whether collisions represent a duplicate value or not.
    That's not a trivial detail for fifteen million values, though it
    depends some on the format of the OP's "records". I suppose a list of
    file positions stored with each hash might be usable assuming either
    fixed-length or properly delimited records.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
     
    Steve Holden, Feb 27, 2009
    #13
  14. Chris Rebert

    Paul Rubin Guest

    Tim Rowe <> writes:
    > We were told in the original question: more than 15 million records,
    > and it won't all fit into memory. So your observation is pertinent.


    That is not terribly many records by today's standards. The knee-jerk
    approach is to sort them externally, then make a linear pass skipping
    the duplicates. Is the exercise to write an external sort in Python?
    It's worth doing if you've never done it before.
     
    Paul Rubin, Feb 27, 2009
    #14
  15. Chris Rebert

    Adam Olsen Guest

    On Feb 27, 9:55 am, Falcolas <> wrote:
    > If order did matter, and the list itself couldn't be stored in memory,
    > I would personally do some sort of hash of each item (or something as
    > simple as first 5 bytes, last 5 bytes and length), keeping a reference
    > to which item the hash belongs, sort and identify duplicates in the
    > hash, and using the reference check to see if the actual items in
    > question match as well.
    >
    > Pretty brutish and slow, but it's the first algorithm which comes to
    > mind. Of course, I'm assuming that the list items are long enough to
    > warrant using a hash and not the values themselves.


    Might as well move all the duplication checking to sqlite.

    Although it seems tempting to stick a layer in front, you will always
    require either a full comparison or a full update, so there's no
    potential for a fast path.
     
    Adam Olsen, Mar 3, 2009
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ll

    duplication?

    ll, Nov 9, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    406
    Rocky Moore
    Nov 9, 2003
  2. msnews.microsoft.com

    Duplication of html generated from htmlwebresponse

    msnews.microsoft.com, Mar 28, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    389
    msnews.microsoft.com
    Apr 5, 2005
  3. Chris  Chiasson
    Replies:
    6
    Views:
    635
    Richard Tobin
    Nov 14, 2006
  4. Replies:
    3
    Views:
    522
  5. Jim

    Design-time duplication of list items

    Jim, Jan 20, 2005, in forum: ASP .Net Web Controls
    Replies:
    0
    Views:
    142
Loading...

Share This Page