it-all-starts-with-a-good-data-structure-ly yours,
Amen, brother.
Even worse: I recall seeing code somewhere that had a data structure
like this:
{k1: [v1,v2], k2: v3, k3: None, ...}
instead of
{k1: [v1,v2], k2: [v3], k3: [], ...}
Page 1
I liked the elegant code example for building a book index. However in
practice the user requirement would be not to have duplicated page
numbers when a word occurs more than once on the same page. If you can
achieve that elegantly, please post it!
Page 2
Don't know about elegance, but I wanted to play with my new 2.3 version ;-)
Assuming the Timbot has special-cased small sets efficiently, and
assuming pageSeq below is a generator can yield a sequence of word-iterable
page objects (e.g. lists of words left after cleaning out punctuation
etc of a page, however page boundaries are detected, etc etc):
(FTHOI, the following pageSeq generator expects to find a "Page xxx"
line ending each page, with nothing else on the line). I'm going to use
this post as a "book" ;-)
Page 3
====< bookindex.py >=================================================
keepalnum = ''.join([c.isalnum() and c or ' ' for c in [chr(i) for i in range(256)]])
def pageSeq(bookFile): # expect open bookFile ready to read
pageWords = []
pageId = '<prologue>'
while 1:
line = bookFile.readline()
if not line: break
lineToks = line.translate(keepalnum).split()
if not lineToks: continue
# expect single footer line with only "Page pgid" at end of page
if len(lineToks)==2 and lineToks[0]=='Page':
pageId = lineToks[1]
yield pageId, pageWords
pageWords = []; pageId = '(after %s)'% pageId
else: pageWords.extend([tok for tok in lineToks if tok.isalpha()]) # rej digits in words
if pageWords:
yield pageId, pageWords
def bookIndex(bookFile):
# (untested, requires 2.3 features)
from sets import Set
wordIndex = {}
for pgId, page in pageSeq(bookFile):
for word in page:
try:
wordIndex[word].add(pgId)
except KeyError:
wordIndex[word] = Set([pgId])
words = wordIndex.keys()
words.sort()
for word in words:
pageIds = list(wordIndex[word])
pnMaxWidth = max(map(len,pageIds))
pageIds = [pid.rjust(pnMaxWidth) for pid in pageIds]
pageIds.sort()
pageIds = map(str.strip, pageIds)
yield (word, ', '.join(pageIds))
if __name__ == '__main__':
import sys
args = sys.argv[1:]
if args:
arg = args.pop(0)
if arg == '-': f = sys.stdin
else: f = file(arg)
iCol=0
for wordOccLine in bookIndex(f):
print ('%14s: %-10s' % wordOccLine),
if iCol%3 == 2: print
iCol += 1
print
else:
raise SystemExit(
'Usage: python bookindex.py bookfilepath\n'
' a single minus for bookfilepath uses sys.stdin')
=====================================================================
Page CODE
I started to post just the bookIndex generator (still marked untested, which is
almost right still ;-) Then one think led to another ;-)
Page LAST
I'll copy above this line to the clip board and filter it through. Result:
[22:22] C:\pywk\clp>getclip | python bookindex.py - |more
Amen: 1 Assuming: 3 Aug: 1
Don: 3 Even: 1 FTHOI: 3
GMT: 1 Hettinger: 1 However: 2
I: 1, 2, 3, LAST If: 2 John: 1
KeyError: CODE Machin: 1 None: 1
On: 1 Page: 3, CODE Raymond: 1
Sat: 1 Set: CODE Sun: 1
SystemExit: CODE Then: LAST Timbot: 3
Usage: CODE a: 1, 2, 3, CODE about: 3
achieve: 2 add: CODE after: 3, CODE
all: 1 almost: LAST and: 3, CODE
another: LAST are: 3 arg: CODE
args: CODE argv: CODE as: 3
assuming: 3 at: CODE be: 2
below: 3 book: 2, 3 bookFile: CODE
bookIndex: CODE, LAST bookfilepath: CODE bookindex: CODE
boundaries: 3 break: CODE brother: 1
building: 2 but: 3 c: CODE
can: 2, 3 cased: 3 chr: CODE
cleaning: 3 code: 1, 2 continue: CODE
data: 1 def: CODE detected: 3
digits: CODE duplicated: 2 e: 3
each: 3 efficiently: 3 elegance: 3
elegant: 2 elegantly: 2 else: 3, CODE
end: CODE ending: 3 etc: 3
example: 2 except: CODE expect: CODE
expects: 3 extend: CODE f: CODE
features: CODE file: CODE find: 3
following: 3 footer: CODE for: 2, CODE
from: CODE g: 3 generator: 3, LAST
going: 3 good: 1 had: 1
has: 3 have: 2 however: 3
i: CODE iCol: CODE if: CODE
import: CODE in: 2, CODE index: 2
instead: 1 is: 3, LAST isalnum: CODE
isalpha: CODE it: 1, 2 iterable: 3
join: CODE just: LAST keepalnum: CODE
keys: CODE know: 3 led: LAST
left: 3 len: CODE lexicon: 1
like: 1 liked: 2 line: 3, CODE
lineToks: CODE list: CODE lists: 3
ly: 1 m: 3 main: CODE
map: CODE marked: LAST max: CODE
minus: CODE more: 2 my: 3
n: CODE name: CODE net: 1
new: 3 not: 2, CODE nothing: 3
numbers: 2 objects: 3 occurs: 2
of: 1, 3, CODE on: 2, 3 once: 2
one: LAST only: CODE open: CODE
or: CODE out: 3 page: 2, 3, CODE
pageId: CODE pageIds: CODE pageSeq: 3, CODE
pageWords: CODE pgId: CODE pgid: CODE
pid: CODE play: 3 please: 2
pnMaxWidth: CODE pop: CODE post: 2, 3, LAST
practice: 2 print: CODE prologue: CODE
punctuation: 3 py: CODE python: CODE
raise: CODE range: CODE read: CODE
readline: CODE ready: CODE recall: 1
rej: CODE requirement: 2 requires: CODE
right: LAST rjust: CODE s: CODE
same: 2 seeing: 1 sequence: 3
sets: 3, CODE single: CODE sjmachin: 1
small: 3 somewhere: 1 sort: CODE
special: 3 split: CODE started: LAST
starts: 1 stdin: CODE still: LAST
str: CODE strip: CODE structure: 1
sys: CODE t: 3 than: 2
that: 1, 2 the: 2, 3, LAST think: LAST
this: 1, 3 to: 2, 3, CODE, LAST tok: CODE
translate: CODE try: CODE untested: CODE, LAST
use: 3 user: 2 uses: CODE
verizon: 1 version: 3 wanted: 3
when: 2 which: LAST while: CODE
with: 1, 3, CODE word: 2, 3, CODE wordIndex: CODE
wordOccLine: CODE words: 3, CODE worse: 1
would: 2 wrote: 1 xxx: 3
yield: 3, CODE you: 2 yours: 1
Regards,
Bengt Richter