table (ascii text) lin ayout recognition

vbfoobar · Sep 13, 2006

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

James Stroud · Sep 13, 2006

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

I have to catch a bus, but, quickly the algorithm is to code non-space
as one and space as zero, then 'or' operate down the columns. Zeros will
indicate high probability of between-column. Code tomorrow if no one
else posts.

Must run...

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

James Stroud · Sep 13, 2006

Hello,

I am looking for python code useful to process
tables that are in ASCII text. The code must
determine where are the columns (fields).
Concerned tables for my application are various,
but their columns are not very complicated
to locate for a human, because even
when ignoring the semantic of words,
our eyes see vertical alignments

Here is a sample table (must be viewed
with fixed-width font to see alignments):
=================================

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

=================================

I want the python code that builds a representation
of this table (for exemple a list of lists, where each list
represents a table line, each element of the list
being a field value).

Any hints?
thanks

As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)

Assuming the lines are all equal width (padded right with space) e.g.:

def rpadd(lines):
"""
Pass in the lines as a list of lines.
"""
lines = [line.rstrip() for line in lines]
maxlen = max([len(line) for line in lines])
return [line + ' ' * (maxlen - len(line)) for line in lines]

In which case, you can:

binary = [[((s==' ' and 2) or 1) for s in line] for line in lines]
shadow = [1 in c for c in zip(*binary)]

isit = False
indices = []
for i,v in enumerate(shadow):
if v is not isit:
indices.append(i)
isit = not isit

indices.append(i+1)

indices = [t for t in zip(indices[::2],indices[1::2])]

columns = [[line[t[0]:t[1]].strip() for line in lines] for t in indices]

In case you want rows:

rows = zip(*columns)

James

James Stroud · Sep 13, 2006

James said:
indices = [t for t in zip(indices[::2],indices[1::2])]

(Artefact of cut-and-paste.)

Make that:

indices = zip(indices[::2],indices[1::2])

James

bearophileHUGS · Sep 13, 2006

My version, not much tested. It probably doesn't work well for tables
with few rows. It finds the most frequent word beginnings, and then
splits the data according to them.

data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope
"""

import re, pprint
# import collections # For Python 2.5

# RE to find the beginning of words
tpatt = re.compile(r"\b[^ ]")

# Remove empty lines
lines = filter(None, data.splitlines())

# Find the positions of all word beginnings
# This finds: treshs = [0, 11, 25, 35, 49, ...
# 44544 ipod apple black 102
# ^ ^ ^ ^ ^
treshs = [ob.start() for li in lines for ob in tpatt.finditer(li)]

# Find treshs frequences
freqs = {}
for el in treshs:
freqs[el] = freqs.get(el, 0) + 1

# Find treshs frequences, alternative for Python V.2.5
# freqs = collections.defaultdict(int)
# for el in treshs:
# freqs[el] += 1

# Find a big enough frequence
bigf = max(freqs.itervalues()) * 0.6

# Find the most common column beginnings
cols = sorted(k for k,v in freqs.iteritems() if v>bigf)

def xpairs(alist):
"xpairs(xrange(n)) ==> (0,1), (1,2), (2,3), ..., (n-2, n-1)"
for i in xrange(len(alist)-1):
yield alist[i:i+2]

result = [[li[x:y].strip() for x,y in xpairs(cols+[None])] for li in
lines]

print data
pprint.pprint(result)

"""
Output:

44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope

[['44544', 'ipod', 'apple', 'black', '102'],
['GFGFHHF-12', 'unknown thing', 'bizar', 'brick mortar', 'tbc'],
['45fjk', 'do not know', '+ is less', '', 'biac'],
['', 'disk', 'seagate', '250GB', '130'],
['5G_gff', '', 'tbd', 'tbd', ''],
['gjgh88hgg', 'media record', 'a and b', '', '12'],
['hjj', 'foo', 'bar', 'hop', 'zip'],
['hg uy oi', 'hj uuu ii a', 'qqq ccc v', 'ZZZ Ughj', ''],
['qdsd', 'zert', '', 'nope', 'nope']]
"""

Bye,
bearophile

Paul McGuire · Sep 13, 2006

James Stroud said:
As promised. I call this the "cast a shadow" algorithm for table
discovery. This is about as obfuscated as I could make it. It will be up
to you to explain it to your teacher ;-)

James -

I used your same algorithm, but I guess I used more brute force (and didn't
use pyparsing, either!).

-- Paul

data = """\
44544 ipod apple black 102
GFGFHHF-12 unknown thing bizar brick mortar tbc
45fjk do not know + is less biac
disk seagate 250GB 130
5G_gff tbd tbd
gjgh88hgg media record a and b 12
hjj foo bar hop zip
hg uy oi hj uuu ii a qqq ccc v ZZZ Ughj
qdsd zert nope nope""".split('\n')

# find rightmost space characters delimiting text columns
spaceCols = set(range(max(map(len, data)))) - \
set( [col for line in data
for col,c in enumerate(line.expandtabs())
if not c.isspace() ] )
spaceCols -= set( [c for c in spaceCols if c+1 in spaceCols ] )

# convert to sorted list of leading col characters
spaceCols = map(lambda x:x+1, sorted(list(spaceCols)))

# get and pretty-print data fields
dataFields = \
[ [line.expandtabs()[start:stop] for (start,stop) in
zip([0]+spaceCols,spaceCols+[None])] for line in data ]
import pprint
pprint.pprint( dataFields )

Gives:

[['44544 ', 'ipod ', 'apple ', 'black ', '102'],
['GFGFHHF-12 ', 'unknown thing ', 'bizar ', 'brick mortar ', 'tbc'],
['45fjk ', 'do not know ', '+ is less ', ' ', 'biac'],
[' ', 'disk ', 'seagate ', '250GB ', '130'],
['5G_gff ', ' ', 'tbd ', 'tbd', ''],
['gjgh88hgg ', 'media record ', 'a and b ', ' ', '12'],
['hjj ', 'foo ', 'bar ', 'hop ', 'zip'],
['hg uy oi ', 'hj uuu ii a ', 'qqq ccc v ', 'ZZZ Ughj', ''],
['qdsd ', 'zert ', ' ', 'nope ', 'nope']]

bearophileHUGS · Sep 14, 2006

Here you can find an improved version:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/498093

table (ascii text) lin ayout recognition

vbfoobar

James Stroud

James Stroud

James Stroud

bearophileHUGS

Paul McGuire

bearophileHUGS

Members online

Forum statistics

Latest Threads