Dict to "flat" list of (key,value)

Nicolas Girard · Jul 30, 2003

Hi,

Forgive me if the answer is trivial, but could you tell me how to achieve
the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

The subtle point (at least to me) is to "flatten" values that are lists.

Thanks in advance,
Nicolas

Bengt Richter · Jul 30, 2003

Hi,

Forgive me if the answer is trivial, but could you tell me how to achieve
the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

The subtle point (at least to me) is to "flatten" values that are lists.

Assuming v3 and others like it can't be references to lists
(or how could you tell it from the format of k1's list?):

(I quoted your names to avoid haing to define them separately)

>>> d = {'k1':['v1','v2'],'k2':'v3'}
>>> flat = []
>>> for k,v in d.items():

Click to expand...

Click to expand...

... if isinstance(v,list):
... for v2 in v:
... flat.append([k,v2])
... else:
... flat.append([k,v])
... [['k2', 'v3'], ['k1', 'v1'], ['k1', 'v2']]

Or would you prefer an inscrutable one-liner ;-)

Note that there is no guarantee of any particular order in d.items(),
though you could sort them:

>>> d = {'k1':['v1','v2'],'k2':'v3'}
>>> flat = []
>>> items = d.items()
>>> items.sort() # on keys
>>> for k,v in items:

Click to expand...

Click to expand...

... if isinstance(v,list):
# Note that v is a list here, but does have pre-existing order you might want to keep.
# If you wanted to guarantee sorted output of the v2's, you could
# do v.sort() here, though you might surprise others holding references
# to those lists, since they would see the sorting. To avoid that, make a copy first, e.g.,
# v = v[:]; v.sort() #(untested, but I think it should work ;-)

... for v2 in v:
... flat.append([k,v2])
... else:
... flat.append([k,v])
... [['k1', 'v1'], ['k1', 'v2'], ['k2', 'v3']]

Regards,
Bengt Richter

Brandon Beck · Jul 30, 2003

Someone will definitely come up with a better solution than this, but
this could work for you at least as a first try

import types
def flatten(d, iterables=(types.ListType, types.TupleType)):
rv = []
for k, v in d.items():
if type(v) in iterables:
rv += [[k,x] for x in v]
else:
rv += [[k,v]]
return rv

If you want to change the types of things that get iterated over, just
pass in a second parameter of your own.

Hope that helps,
Brandon

Nicolas Girard said:
Hi,

Forgive me if the answer is trivial, but could you tell me how to achieve
the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

The subtle point (at least to me) is to "flatten" values that are lists.

Thanks in advance,
Nicolas

----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeed.com The #1 Newsgroup Service in the World!
100,000 Newsgroups
---= 19 East/West-Coast Specialized Servers - Total Privacy via

Encryption =---

David C. Fox · Jul 30, 2003

Nicolas said:
Hi,

Forgive me if the answer is trivial, but could you tell me how to achieve
the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

The subtle point (at least to me) is to "flatten" values that are lists.

Well, I can get you at least part way, but there is an ambiguity in your
original data structure which causes problems:

1. d.items() will get you

[(k1, [v1, v2]), (k2, v3), ...]

2. Now, you want to expand out the list elements where the second part
of each tuple is a list, which you can do with.

flatten = lambda u: map(lambda x: (u[0], x), u[1])

or, if the double lambda expression is confusing:

def flatten(u):
return map(lambda x: (u[0], x), u[1])

(I've used a tuple (u[0], x) instead of a list, because your [k1, vn]
are always pairs, but you can use a list if you prefer)

Then flatten will take the nested element

(k1, [v1, v2])

and convert it to a list of tuples

[(k1, v1), [k2, v2])]

3. You want to apply flatten to each element of d.items(), which you
can do with

lol = map(flatten, d.items())

which will give you a list of lists of tuples,

4. and then reduce that to a list of tuples with

reduce(operator.add, lol)

Unfortunately, this doesn't quite work with your example above, because
flatten won't work properly when applied to (k2, v3). If v3 is a
sequence, it will instead give you [(k2, v3[0]), (k2, v3[1]), ...],
which is probably not what you want (especially if v3 is a string - try
flatten manually and see what happens). If v3 is not a sequence, you'll
get a TypeError saying that argument 2 to map() must support iteration.

If you *know* that v3 will NEVER be a list, you can modify flatten to
handle the special case like so:

def flatten(u):
if isinstance(u[1], type([])):
return (u[1], [u[2]])
return map(lambda x: (u[0], x), u[1])

Alternatively, if you can modify how the initial dictionary is generated
to ensure that cases where a key has a single value always appear as
k2: [v3] instead of k2: v3, then you can use the steps above without
modification. This is probably better if it is feasible, because your
original data structure is inherently ambiguous unless you know that v3
will never be a list.

David

David C. Fox · Jul 30, 2003

David C. Fox wrote:

Oops - a mistake and an omission

4. and then reduce that to a list of tuples with

reduce(operator.add, lol)

You need to import the operator module first.

def flatten(u):
if isinstance(u[1], type([])):
return (u[1], [u[2]])
return map(lambda x: (u[0], x), u[1])

I got my cases reversed - that should be "if not isinstance...

Mike Rovner · Jul 31, 2003

Nicolas said:
Forgive me if the answer is trivial, but could you tell me how to
achieve the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

[list(i) for i in d.items()]

Mike

Mike Rovner · Jul 31, 2003

Mike said:
Nicolas said:

Forgive me if the answer is trivial, but could you tell me how to
achieve the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

Click to expand...

[list(i) for i in d.items()]

Sorry, I missed v1,v2 flattening.

Raymond Hettinger · Aug 2, 2003

Nicolas Girard said:
Hi,

Forgive me if the answer is trivial, but could you tell me how to achieve
the following:

{k1:[v1,v2],k2:v3,...} --> [[k1,v1],[k1,v2],[k2,v3],...]

The subtle point (at least to me) is to "flatten" values that are lists.

The other posters answered the question (as asked)
by showing a loop that differentiated the two cases
of inner lists vs single values. One further thought,
is that the original data structure could be improved
by building it so that every value is in a list:

{k1:[v1,v2],k2:[v3],...}

This is typically done by building the values with setdefault:

index = {}
for pagenum in range(len(pages)):
page = pages[pagenum]
for word in page:
index.setdefault(word, []).append(pagenum)

The new structure is much more easily flattened:

[(k,v) for k, values in newstruct.iteritems() for v in values]

it-all-starts-with-a-good-data-structure-ly yours,

Raymond Hettinger

John Machin · Aug 3, 2003

it-all-starts-with-a-good-data-structure-ly yours,

Amen, brother.

Even worse: I recall seeing code somewhere that had a data structure
like this:

{k1: [v1,v2], k2: v3, k3: None, ...}
instead of
{k1: [v1,v2], k2: [v3], k3: [], ...}

I liked the elegant code example for building a book index. However in
practice the user requirement would be not to have duplicated page
numbers when a word occurs more than once on the same page. If you can
achieve that elegantly, please post it!

Cheers,
John

Scott David Daniels · Aug 3, 2003

John said:
[a perfectly fine chunk of code]
it-all-starts-with-a-good-data-structure-ly yours,

Click to expand...

I liked the elegant code example for building a book index. However in
practice the user requirement would be not to have duplicated page
numbers when a word occurs more than once on the same page. If you can
achieve that elegantly, please post it!

OK, I'll bite:
def genindex(pages):
index = {}
for pagenum in range(len(pages)):
page = pages[pagenum]
for word in page:
index.setdefault(word, {})[pagenum] = None

or with 2.3:
def genindex(pages):
index = {}
for pagenum, page in enumerate(pages):
for word in page:
index.setdefault(word, {})[pagenum] = None
return index

With either one, flattening looks like:
def flatten(index):
flattened = [(word, list(pages)) for word, pages
in index.items()]
for word, pages in flattened:
pages.sort()
flattened.sort()
return flatten

For fun, try:

book = ['this is a test'.split(),
'this is only a test'.split(),
'had this been a real function, you care a bit'.split()]
for word, pages in flatten(index(book)):
print word, pages

-Scott David Daniels
(e-mail address removed)

Bengt Richter · Aug 3, 2003

it-all-starts-with-a-good-data-structure-ly yours,

Click to expand...

Amen, brother.

Even worse: I recall seeing code somewhere that had a data structure
like this:

{k1: [v1,v2], k2: v3, k3: None, ...}
instead of
{k1: [v1,v2], k2: [v3], k3: [], ...}

Page 1

I liked the elegant code example for building a book index. However in
practice the user requirement would be not to have duplicated page
numbers when a word occurs more than once on the same page. If you can
achieve that elegantly, please post it!

Page 2

Don't know about elegance, but I wanted to play with my new 2.3 version ;-)

Assuming the Timbot has special-cased small sets efficiently, and
assuming pageSeq below is a generator can yield a sequence of word-iterable
page objects (e.g. lists of words left after cleaning out punctuation
etc of a page, however page boundaries are detected, etc etc):

(FTHOI, the following pageSeq generator expects to find a "Page xxx"
line ending each page, with nothing else on the line). I'm going to use
this post as a "book" ;-)

Page 3

====< bookindex.py >=================================================

keepalnum = ''.join([c.isalnum() and c or ' ' for c in [chr(i) for i in range(256)]])

def pageSeq(bookFile): # expect open bookFile ready to read
pageWords = []
pageId = '<prologue>'
while 1:
line = bookFile.readline()
if not line: break
lineToks = line.translate(keepalnum).split()
if not lineToks: continue
# expect single footer line with only "Page pgid" at end of page
if len(lineToks)==2 and lineToks[0]=='Page':
pageId = lineToks[1]
yield pageId, pageWords
pageWords = []; pageId = '(after %s)'% pageId
else: pageWords.extend([tok for tok in lineToks if tok.isalpha()]) # rej digits in words
if pageWords:
yield pageId, pageWords

def bookIndex(bookFile):
# (untested, requires 2.3 features)
from sets import Set
wordIndex = {}
for pgId, page in pageSeq(bookFile):
for word in page:
try:
wordIndex[word].add(pgId)
except KeyError:
wordIndex[word] = Set([pgId])

words = wordIndex.keys()
words.sort()
for word in words:
pageIds = list(wordIndex[word])
pnMaxWidth = max(map(len,pageIds))
pageIds = [pid.rjust(pnMaxWidth) for pid in pageIds]
pageIds.sort()
pageIds = map(str.strip, pageIds)
yield (word, ', '.join(pageIds))

if __name__ == '__main__':
import sys
args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-': f = sys.stdin
else: f = file(arg)
iCol=0
for wordOccLine in bookIndex(f):
print ('%14s: %-10s' % wordOccLine),
if iCol%3 == 2: print
iCol += 1
print
else:
raise SystemExit(
'Usage: python bookindex.py bookfilepath\n'
' a single minus for bookfilepath uses sys.stdin')
=====================================================================
Page CODE

I started to post just the bookIndex generator (still marked untested, which is
almost right still ;-) Then one think led to another ;-)

Page LAST

I'll copy above this line to the clip board and filter it through. Result:

[22:22] C:\pywk\clp>getclip | python bookindex.py - |more
Amen: 1 Assuming: 3 Aug: 1
Don: 3 Even: 1 FTHOI: 3
GMT: 1 Hettinger: 1 However: 2
I: 1, 2, 3, LAST If: 2 John: 1
KeyError: CODE Machin: 1 None: 1
On: 1 Page: 3, CODE Raymond: 1
Sat: 1 Set: CODE Sun: 1
SystemExit: CODE Then: LAST Timbot: 3
Usage: CODE a: 1, 2, 3, CODE about: 3
achieve: 2 add: CODE after: 3, CODE
all: 1 almost: LAST and: 3, CODE
another: LAST are: 3 arg: CODE
args: CODE argv: CODE as: 3
assuming: 3 at: CODE be: 2
below: 3 book: 2, 3 bookFile: CODE
bookIndex: CODE, LAST bookfilepath: CODE bookindex: CODE
boundaries: 3 break: CODE brother: 1
building: 2 but: 3 c: CODE
can: 2, 3 cased: 3 chr: CODE
cleaning: 3 code: 1, 2 continue: CODE
data: 1 def: CODE detected: 3
digits: CODE duplicated: 2 e: 3
each: 3 efficiently: 3 elegance: 3
elegant: 2 elegantly: 2 else: 3, CODE
end: CODE ending: 3 etc: 3
example: 2 except: CODE expect: CODE
expects: 3 extend: CODE f: CODE
features: CODE file: CODE find: 3
following: 3 footer: CODE for: 2, CODE
from: CODE g: 3 generator: 3, LAST
going: 3 good: 1 had: 1
has: 3 have: 2 however: 3
i: CODE iCol: CODE if: CODE
import: CODE in: 2, CODE index: 2
instead: 1 is: 3, LAST isalnum: CODE
isalpha: CODE it: 1, 2 iterable: 3
join: CODE just: LAST keepalnum: CODE
keys: CODE know: 3 led: LAST
left: 3 len: CODE lexicon: 1
like: 1 liked: 2 line: 3, CODE
lineToks: CODE list: CODE lists: 3
ly: 1 m: 3 main: CODE
map: CODE marked: LAST max: CODE
minus: CODE more: 2 my: 3
n: CODE name: CODE net: 1
new: 3 not: 2, CODE nothing: 3
numbers: 2 objects: 3 occurs: 2
of: 1, 3, CODE on: 2, 3 once: 2
one: LAST only: CODE open: CODE
or: CODE out: 3 page: 2, 3, CODE
pageId: CODE pageIds: CODE pageSeq: 3, CODE
pageWords: CODE pgId: CODE pgid: CODE
pid: CODE play: 3 please: 2
pnMaxWidth: CODE pop: CODE post: 2, 3, LAST
practice: 2 print: CODE prologue: CODE
punctuation: 3 py: CODE python: CODE
raise: CODE range: CODE read: CODE
readline: CODE ready: CODE recall: 1
rej: CODE requirement: 2 requires: CODE
right: LAST rjust: CODE s: CODE
same: 2 seeing: 1 sequence: 3
sets: 3, CODE single: CODE sjmachin: 1
small: 3 somewhere: 1 sort: CODE
special: 3 split: CODE started: LAST
starts: 1 stdin: CODE still: LAST
str: CODE strip: CODE structure: 1
sys: CODE t: 3 than: 2
that: 1, 2 the: 2, 3, LAST think: LAST
this: 1, 3 to: 2, 3, CODE, LAST tok: CODE
translate: CODE try: CODE untested: CODE, LAST
use: 3 user: 2 uses: CODE
verizon: 1 version: 3 wanted: 3
when: 2 which: LAST while: CODE
with: 1, 3, CODE word: 2, 3, CODE wordIndex: CODE
wordOccLine: CODE words: 3, CODE worse: 1
would: 2 wrote: 1 xxx: 3
yield: 3, CODE you: 2 yours: 1

Regards,
Bengt Richter

Raymond Hettinger · Aug 4, 2003

[Raymond]

index = {}
for pagenum in range(len(pages)):
page = pages[pagenum]
for word in page:
index.setdefault(word, []).append(pagenum)

it-all-starts-with-a-good-data-structure-ly yours,

Click to expand...

[John Machin]

Amen, brother.

Even worse: I recall seeing code somewhere that had a data structure
like this:

{k1: [v1,v2], k2: v3, k3: None, ...}
instead of
{k1: [v1,v2], k2: [v3], k3: [], ...}

I liked the elegant code example for building a book index. However in
practice the user requirement would be not to have duplicated page
numbers when a word occurs more than once on the same page. If you can
achieve that elegantly, please post it!

How about a simple Py2.3 clean-up pass:

# Uniquify and sort each list of page numbers
for word, pagelist in index.iteritems():
newlist = dict.fromkeys(pagelist).keys()
newlist.sort()
index[word] = newlist

Raymond Hettinger

Chanelling Guido - dict subclasses	11	Jan 15, 2014
__delitem__ "feature"	5	Dec 26, 2010
dict slice in python (translating perl to python)	12	Sep 10, 2008
Duplicate keys in dict?	6	Mar 7, 2010
Lowest Value in List	5	Oct 2, 2013
Assigning a list to a key of a dict	3	May 14, 2009
Tuple of coordinates	4	May 29, 2008
Arrays and Hashes	4	May 24, 2009

Dict to "flat" list of (key,value)

Nicolas Girard

Bengt Richter

Brandon Beck

David C. Fox

David C. Fox

Mike Rovner

Mike Rovner

Raymond Hettinger

John Machin

Scott David Daniels

Bengt Richter

Raymond Hettinger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads