Merging ordered lists

E

etal

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]


This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?
 
R

Raymond Hettinger

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

I would do it two steps. There's a number of ways to merge depending
on whether everything is pulled into memory or not:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

result = [k for k, g in groupby(imerge(*sources))]


Raymond
 
P

Peter Otten

etal said:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]


This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?


#untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))

Peter
 
M

Méta-MCI \(MVP\)

Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)


@-salutations

Michel Claveau
 
P

Peter Otten

Peter said:
#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result
 
M

MClaveau

Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)


@-salutations

Michel Claveau
 
T

Taekyon

etal said:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

Looks as if set does it for you.
 
E

etal

Peter said:
#untested

Already found two major blunders :(

# still untested
import difflib

def _merge(a, b):
    sm = difflib.SequenceMatcher(None, a, b)
    for op, a1, a2, b1, b2 in sm.get_opcodes():
        if op == "insert":
            yield b[b1:b2]
        elif op == "replace":
            yield a[a1:a2]
            yield b[b1:b2]
        else: # delete, equal
            yield a[a1:a2]

def merge(a, b):
    return sum(_merge(a, b), [])

def merge_to_unique(sources):
    return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

difflib.SequenceMatcher looks promising; I'll try it. Thanks!

def unique(items):
    u = set(items)
    if len(u) == len(items):
        return items
    result = []
    for item in items:
        if item in u:
            result.append(item)
            u.remove(item)
    return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)
 
E

etal

I would do it two steps.  There's a number of ways to merge depending
on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/C...estate.com/ASPN/Cookbook/Python/Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

   result = [k for k, g in groupby(imerge(*sources))]

Raymond

Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):
[k for k, g in itertools.groupby(list("asdfdfffdf"))]
['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']


Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?
 
R

Raymond Hettinger

Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):
[k for k, g in itertools.groupby(list("asdfdfffdf"))]

['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']

That's why the example ran imerge() first. That takes several sorted
inputs and gives a single sorted output. Then, groupby() eliminates
the dups from the sorted input where the equal valued entries have
been brought together.

Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?

If your inputs were already sorted, then the above algorithms maintain
that order and eliminate duplicates.

If the inputs were not sorted, then I don't think you have a precise
idea of what it means to merge them while preserving order. For
example if the inputs are XYZPDQ and bYlmPz, then what does a merged
sequence look like once the Y and P duplicates are removed? Is it
XZPDQblmz or some intermix of the two like XbZlPmDzQ? If it is the
first of those, then the answer is simple:

seen = set()
for elem in chain(*sources):
if elem in seen:
continue
yield elem
seen.add(elem)

Raymond
 
P

Peter Otten

etal said:
def unique(items):
    u = set(items)
    if len(u) == len(items):
        return items
    result = []
    for item in items:
        if item in u:
            result.append(item)
            u.remove(item)
    return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result


def unique_lc(ref):
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]


sample_a = range(1000)
sample_b = range(1000)
import random
for i in random.sample(sample_b, 10):
sample_b = 0

$ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
10000 loops, best of 3: 52.8 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_a as
s" "u(s)"
10 loops, best of 3: 44.2 msec per loop

3 orders of magnitude for the shortcut.

$ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
1000 loops, best of 3: 646 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_b as
s" "u(s)"
10 loops, best of 3: 39 msec per loop

Still 2 orders of magnitude if my unique() does some actual work.

Peter
 
E

etal

If the inputs were not sorted, then I don't think you have a precise
idea of what it means to merge them while preserving order.   For
example if the inputs are XYZPDQ and bYlmPz, then what does a merged
sequence look like once the Y and P duplicates are removed? Is it
XZPDQblmz or some intermix of the two like XbZlPmDzQ?  If it is the
first of those, then the answer is simple:

I was looking at Bram Cohen's description of his diff algorithm,
implemented in Bazaar:
http://bramcohen.livejournal.com/37690.html
"Instead of doing a longest common subsequence on everything, it does
a longest common subsequence on lines which occur exactly once on both
sides, then recurses between lines which got matched on that pass."

So, if sources looks like:
[list("XYZPDQ"), list("bYlmPz")]

Then the proper merge would use Y and P as the delimiters, putting X
and b before Y; Z, l and m before P; and D, Q and z after P -- like
your second example (but keeping an instance of Y). Right? Leaving us
with the context-dependent issue of ordering between the delimiters,
probably depending on which list is used as the reference initially.
So, if the reference list's elements go first, the result would be
XbYZlmPDQz. If the elements from later sources go after the last
common element (my first approach), it's bXYlmZPzDQ.

However, in the cold light of morning it looks like that was the wrong
approach -- it makes sense to treat any data that doesn't occur in the
reference list as semi-obsolete and append it instead -- so I think
I'll use your much simpler algorithm. Thanks!
 
E

etal

Yes :)

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
    u = set(items)
    if len(u) == len(items):
        return items
    result = []
    for item in items:
        if item in u:
            result.append(item)
            u.remove(item)
    return result

Yikes... and the list comprehension looked so innocent. I had resigned
myself to O(n) lookup, but you and Raymond appear to agree on the
basic concept for unique() -- check set membership, then generate the
final sequence and update the set based on that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top