[Q] removing array duplicates where a subset is unique

Discussion in 'Ruby' started by Chuck Remes, Jul 17, 2009.

  1. Chuck Remes

    Chuck Remes Guest

    I need to remove duplicates from an array of arrays. I can't use
    Array#uniq because some fields are different and not part of the
    "key." Here's an example where the first 3 elements of each sub array
    are the "key" and determine uniqueness. I want to keep only the first
    one I get.
    => [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

    The return value of deduplicating this array should be: [[1, 2, 3, 4,
    5]]

    Here is my first attempt at solving the problem:

    ?> dupes = ary.select { |row| row[0..2] == line[0..2] }
    ?> dupes.first?> dedup a
    => [[1, 2, 3, 4, 5]]

    This works. However, it is *super slow* when operating on my dataset.
    My arrays contain hundreds of thousands of sub arrays. The unique key
    for each sub array is the first 12 (of 18) elements. It is taking many
    seconds to produce each intermediate array ("dupes" in the example
    above), so deduping the entire thing would likely take days.

    Anyone have a superior and faster solution?

    cr
     
    Chuck Remes, Jul 17, 2009
    #1
    1. Advertisements

  2. Might be faster if your intermediate is a hash, so instead of N**2 time
    it's N.

    a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]

    h = {}
    a.each do |row|
    h[row[0..2]] ||= row # record the first match
    end

    p h.values # ==> [[1, 2, 3, 4, 5]]


    Note that the output may come out in a different order. Does that matter?
     
    Joel VanderWerf, Jul 17, 2009
    #2
    1. Advertisements

  3. Hi --

    See if this speeds it up meaningfully (and make sure I've got the
    logic right):

    def dedup(ary)
    uniq = {}
    ary.each do |line|
    uniq[ary[0..2]] ||= line
    end
    uniq.values
    end


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 17, 2009
    #3
  4. Chuck Remes

    Bil Kleb Guest

    Sweet! (I love Ruby; thanks Matz.)

    Regards,
     
    Bil Kleb, Jul 17, 2009
    #4
  5. Chuck Remes

    Chuck Remes Guest

    David and Joel,

    you both provided the same solution. I will test this to see what kind
    of performance I get. It will be hell on memory, but I assumed any
    solution likely would be. (And Joel, I have presorted the array prior
    to removing the dupes so I have already taken care of the ordering
    issue.)

    Thanks for your suggestions.

    cr
     
    Chuck Remes, Jul 18, 2009
    #5
  6. Chuck Remes

    brabuhr Guest

    I think what Joel was referring to was that in Ruby 1.8 a Hash doesn't
    maintain insertion order when traversed (Ruby 1.9 does maintain
    insertion order):

    ruby 1.8.2 (2004-12-25) [powerpc-darwin8.0]:
    irb(main):001:0> h = {}
    => {}
    irb(main):002:0> 5.times{|n| h[n] = n}
    => 5
    irb(main):003:0> h
    => {0=>0, 1=>1, 2=>2, 3=>3, 4=>4}
    irb(main):004:0> h["sadf"] = 3
    => 3
    irb(main):005:0> h
    => {0=>0, 1=>1, "sadf"=>3, 2=>2, 3=>3, 4=>4}
     
    brabuhr, Jul 18, 2009
    #6
  7. Hi --

    I believe the version you had originally, where you do a mapping of
    the whole array, will typically use much more memory than the hash
    version. Let's say your original array has 1000 inner arrays, with 10
    that are considered unique. The mapping will be a new array, also of
    1000 elements. The hash will have 10 key/value pairs -- thus much
    smaller.


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 18, 2009
    #7
  8. Hi --

    If you wanted to maintain order you could (for some but probably not
    much performance penalty) do something like:

    def dedup(ary)
    uniq = {}
    res = []
    ary.each do |line|
    key = line[0..2]
    next if uniq[key]
    res << (uniq[key] = line)
    end
    res
    end


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 18, 2009
    #8
  9. Chuck Remes

    Chuck Remes Guest

    Oh yes, my version had terrible execution performance and memory
    performance. I was trying to figure out how to use a hash but did not
    make the leap to the ||= construction on my own. I knew I was missing
    something obvious... all of your rapid responses proved it.

    FYI, the dedup code you provided performs quite admirably. I'll take a
    look at its memory footprint when I get in the office Monday and
    report back.

    cr
     
    Chuck Remes, Jul 18, 2009
    #9
  10. Chuck Remes

    s.ross Guest

    Hi---

    I missed the beginning of this thread, but here is an implementation
    I've used successfully:

    def uniq_by(subject, &block)
    h = {}
    a = []
    subject.each do |s|
    comparator = yield(s)
    unless h[comparator]
    a.push(s)
    h[comparator] = s
    end
    end
    a
    end

    Usage:

    u = uniq_by(ary|{ |item| item.element }

    Basically, what this allows you to do is specify what exactly about an
    array item must be unique. It also preserves the original array order,
    with a "first entry wins" approach to duplicate elimination.

    Hope this is useful.
     
    s.ross, Jul 19, 2009
    #10
  11. Chuck Remes

    7stud -- Guest

    That's well and good, but in the process of using a hash to remove the
    duplicates, the result will be out of order. See example below.

    A simple if statement will achieve the same result:

    a = [
    [1, 2, 3, 10],
    [1, 2, 3, 20],
    [1, 2, 3, 30],
    [2, 2, 3, 40]
    ]

    h = {}

    a.each do |suba|
    key = suba.slice(0, 3)

    if h[key]
    next
    else
    h[key] = suba
    end

    end

    p h.values

    --output:--
    [[2, 2, 3, 40], [1, 2, 3, 10]]
     
    7stud --, Jul 19, 2009
    #11
  12. Hi --

    I'm not sure what that buys you, though. The ||= idiom should work
    fine, unless you need to do something extra during the if statement.


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 19, 2009
    #12
  13. Facets has enumerable#uniq_by implemented like this:

    def uniq_by #:yield:
    h = {}; inject([]) {|a,x| h[yield(x)] ||= a << x}
    end

    So you can do:

    require 'facets'
    a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
    res = a.uniq_by{|el| el[0..2]}

    Don't know about performance.

    hth,

    Siep
     
    Siep Korteling, Jul 19, 2009
    #13
  14. Chuck Remes

    7stud -- Guest

    Simplicity of comprehension over brevity.
    Of course.
     
    7stud --, Jul 19, 2009
    #14
  15. Hi --

    I guess I like the ||= idiom because it has both. But the if statement
    will definitely work, of course.


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 19, 2009
    #15
  16. Chuck Remes

    7stud -- Guest


    Just remember this:

    inject() = shite
     
    7stud --, Jul 19, 2009
    #16
  17. Chuck Remes

    7stud -- Guest

    For you, yes. But apparently not for the op:
    ..until it doesn't (I think you know what I'm refering to).
     
    7stud --, Jul 19, 2009
    #17
  18. Hi --

    I wouldn't recommend freezing one's knowledge or leap abilities at a
    particular point, though :) I'm certainly in sympathy with being
    suspicious of punctuation-heavy stuff, but in the case of ||= it's
    such a common idiom, and so easy to learn, that it seems like a bit of
    an artificial hardship not to learn it. Still, if will work too :)
    The hash default thing? I don't think that comes into play here, does
    it?


    David

    --
    David A. Black / Ruby Power and Light, LLC
    Ruby/Rails consulting & training: http://www.rubypal.com
    Now available: The Well-Grounded Rubyist (http://manning.com/black2)
    Training! Intro to Ruby, with Black & Kastner, September 14-17
    (More info: http://rubyurl.com/vmzN)
     
    David A. Black, Jul 19, 2009
    #18
  19. Chuck Remes

    7stud -- Guest

    No, but whenever I see ||= now, I get scared. :(
     
    7stud --, Jul 19, 2009
    #19
  20. I do to! It looks like a very angry gnome.
     
    Joel VanderWerf, Jul 20, 2009
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.