Merging ordered lists

etal · Jun 1, 2008

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]

This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

Raymond Hettinger · Jun 1, 2008

Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

I would do it two steps. There's a number of ways to merge depending
on whether everything is pulled into memory or not:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/491285
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

result = [k for k, g in groupby(imerge(*sources))]

Raymond

Peter Otten · Jun 1, 2008

etal said:
Here's an algorithm question: How should I efficiently merge a
collection of mostly similar lists, with different lengths and
arbitrary contents, while eliminating duplicates and preserving order
as much as possible?

My code:

def merge_to_unique(sources):
"""Merge the unique elements from each list in sources into new
list.

Using the longest input list as a reference, merges in the
elements from
each of the smaller or equal-length lists, and removes duplicates.

@return: Combined list of elements.
"""
sources.sort(None, len, True) # Descending length
ref = sources[0]
for src in sources[1:]:
for i, s in enumerate(src):
if s and (ref != s) and s not in ref:
ref.insert(ref.index(src[i-1])+1, s)
# Remove duplicates
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]

This comes up with using the CSV module's DictWriter class to merge a
set (list, here) of not-quite-perfect CSV sources. The DictWriter
constructor needs a list of field names so that it can convert
dictionaries into rows of the CSV file it writes. Some of the input
CSV files are missing columns, some might have extras -- all of this
should be accepted, and the order of the columns in the merged file
should match the order of the input files as much as possible (not
alphabetical). All of the list elements are strings, in this case, but
it would be nice if the function didn't require it.

Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

#untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
else:
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return reduce(merge, sorted(sources, key=len, reverse=True))

Peter

Méta-MCI \(MVP\) · Jun 1, 2008

Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)

@-salutations

Michel Claveau

Peter Otten · Jun 1, 2008

Peter said:
#untested

Already found two major blunders

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result

MClaveau · Jun 1, 2008

Hi!

Use set (union).
Example:

la=[2,1,3,5,4,6]
lb=[2,8,6,4,12]

#compact:
print list(set(la).union(set(lb)))

#detail:
s1 = set(la)
s2 = set(lb)
s3 = s1.union(s2)
print list(s3)

@-salutations

Michel Claveau

Taekyon · Jun 1, 2008

etal said:
Speed actually isn't a problem yet; it might matter some day, but for
now it's just an issue of conceptual aesthetics. Any suggestions?

Looks as if set does it for you.

etal · Jun 2, 2008

Peter said:
Peter said:

#untested

Click to expand...

Already found two major blunders

# still untested
import difflib

def _merge(a, b):
sm = difflib.SequenceMatcher(None, a, b)
for op, a1, a2, b1, b2 in sm.get_opcodes():
if op == "insert":
yield b[b1:b2]
elif op == "replace":
yield a[a1:a2]
yield b[b1:b2]
else: # delete, equal
yield a[a1:a2]

def merge(a, b):
return sum(_merge(a, b), [])

def merge_to_unique(sources):
return unique(reduce(merge, sorted(sources, key=len, reverse=True)))

difflib.SequenceMatcher looks promising; I'll try it. Thanks!

def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)

etal · Jun 2, 2008

I would do it two steps. There's a number of ways to merge depending
on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/C...estate.com/ASPN/Cookbook/Python/Recipe/305269

After merging, the groupby itertool is good for removing duplicates:

result = [k for k, g in groupby(imerge(*sources))]

Raymond

Thanks for the tip; itertools never ceases to amaze. One issue:
groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):

[k for k, g in itertools.groupby(list("asdfdfffdf"))]

Click to expand...

Click to expand...

['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']

Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?

Raymond Hettinger · Jun 3, 2008

Thanks for the tip; itertools never ceases to amaze. One issue:

groupby doesn't seem to remove all duplicates, just consecutive ones
(for lists of strings and integers, at least):

[k for k, g in itertools.groupby(list("asdfdfffdf"))]

Click to expand...

Click to expand...

['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']

That's why the example ran imerge() first. That takes several sorted
inputs and gives a single sorted output. Then, groupby() eliminates
the dups from the sorted input where the equal valued entries have
been brought together.

Another issue: dropping everything into a heap and pulling it back out
looks like it loses the original ordering, which isn't necessarily
alphabetical, but "however the user wants to organize the
spreadsheet". That's why I originally avoided using
sorted(set(itertools.chain(*sources))). Do you see another way around
this?

If your inputs were already sorted, then the above algorithms maintain
that order and eliminate duplicates.

If the inputs were not sorted, then I don't think you have a precise
idea of what it means to merge them while preserving order. For
example if the inputs are XYZPDQ and bYlmPz, then what does a merged
sequence look like once the Y and P duplicates are removed? Is it
XZPDQblmz or some intermix of the two like XbZlPmDzQ? If it is the
first of those, then the answer is simple:

seen = set()
for elem in chain(*sources):
if elem in seen:
continue
yield elem
seen.add(elem)

Raymond

Peter Otten · Jun 3, 2008

etal said:
def unique(items):
Â Â u = set(items)
Â Â if len(u) == len(items):
Â Â Â Â return items
Â Â result = []
Â Â for item in items:
Â Â Â Â if item in u:
Â Â Â Â Â Â result.append(item)
Â Â Â Â Â Â u.remove(item)
Â Â return result

Click to expand...

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

Yes

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result

def unique_lc(ref):
return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]

sample_a = range(1000)
sample_b = range(1000)
import random
for i in random.sample(sample_b, 10):
sample_b = 0

$ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
10000 loops, best of 3: 52.8 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_a as
s" "u(s)"
10 loops, best of 3: 44.2 msec per loop

3 orders of magnitude for the shortcut.

$ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
1000 loops, best of 3: 646 usec per loop
$ python -m timeit -s"from unique import unique_lc as u, sample_b as
s" "u(s)"
10 loops, best of 3: 39 msec per loop

Still 2 orders of magnitude if my unique() does some actual work.

Peter

etal · Jun 3, 2008

If the inputs were not sorted, then I don't think you have a precise
idea of what it means to merge them while preserving order. For
example if the inputs are XYZPDQ and bYlmPz, then what does a merged
sequence look like once the Y and P duplicates are removed? Is it
XZPDQblmz or some intermix of the two like XbZlPmDzQ? If it is the
first of those, then the answer is simple:

I was looking at Bram Cohen's description of his diff algorithm,
implemented in Bazaar:
http://bramcohen.livejournal.com/37690.html
"Instead of doing a longest common subsequence on everything, it does
a longest common subsequence on lines which occur exactly once on both
sides, then recurses between lines which got matched on that pass."

So, if sources looks like:
[list("XYZPDQ"), list("bYlmPz")]

Then the proper merge would use Y and P as the delimiters, putting X
and b before Y; Z, l and m before P; and D, Q and z after P -- like
your second example (but keeping an instance of Y). Right? Leaving us
with the context-dependent issue of ordering between the delimiters,
probably depending on which list is used as the reference initially.
So, if the reference list's elements go first, the result would be
XbYZlmPDQz. If the elements from later sources go after the last
common element (my first approach), it's bXYlmZPzDQ.

However, in the cold light of morning it looks like that was the wrong
approach -- it makes sense to treat any data that doesn't occur in the
reference list as semi-obsolete and append it instead -- so I think
I'll use your much simpler algorithm. Thanks!

etal · Jun 3, 2008

Yes

Seriously, you are using O(n) containers and O(n) lookup where mine uses
O(1). For short lists it doesn't matter, but as the list length grows the
difference gets huge:

$ cat unique.py
def unique(items):
u = set(items)
if len(u) == len(items):
return items
result = []
for item in items:
if item in u:
result.append(item)
u.remove(item)
return result

Yikes... and the list comprehension looked so innocent. I had resigned
myself to O(n) lookup, but you and Raymond appear to agree on the
basic concept for unique() -- check set membership, then generate the
final sequence and update the set based on that.

Nan values after merging 2 dataframes	1	Apr 19, 2023
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
CSV, lists, and functions	4	Mar 19, 2013
Why should I convert PST file to CSV format?	1	Apr 2, 2026
Database Manager: A C++ Console Application	14	May 12, 2025
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
Ordered list question	1	Jul 17, 2011
HTML form to csv file on server	1	Feb 12, 2025

Merging ordered lists

etal

Raymond Hettinger

Peter Otten

Méta-MCI \(MVP\)

Peter Otten

MClaveau

Taekyon

etal

etal

Raymond Hettinger

Peter Otten

etal

etal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads