Data Manipulation - Rows to Columns

Tess · Feb 5, 2008

Hello All,

I have a text file with marked up data that I need to convert into a
text tab separated file.

The structure of the input file is listed below (see file 1) and the
desired output file is below as well (see file 2).

I am a complete novice with python and would appreciate any tips you
may be able to provide.

Best,

Tess

file 1:
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>

file 2 (tab separated):
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink

Gabriel Genellina · Feb 5, 2008

En Wed said:
I have a text file with marked up data that I need to convert into a
text tab separated file.

The structure of the input file is listed below (see file 1) and the
desired output file is below as well (see file 2).

I am a complete novice with python and would appreciate any tips you
may be able to provide.

file 1:
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>

Are you sure it says <item>...</table>?
Are ALWAYS three colors per item, as in your example? If this is the case,
just read groups of 4 lines and ignore the tags.

file 2 (tab separated):
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink

The best way to produce this output is using the csv module:
http://docs.python.org/lib/module-csv.html
So we need a list of rows, being each row a list of column data. A simple
way of building such structure from the input file would be:

rows = []
row = None
for line in open('file1.txt'):
line = line.strip() # remove leading and trailing whitespace
if line.startswith('<item>'):
if row: rows.append(row)
j = row.index("</")
item = row[6:j]
row = [item]
elif line.startswith('<color>'):
j = row.index("</")
color = row[7:j]
row.append(color)
else:
raise ValueError, "can't understand line: %r" % line
if row: rows.append(row)

This allows for a variable number of "color" lines per item. Once the
`rows` list is built, we only have to create a csv writer for the right
dialect ('excel_tab' looks promising) and feed the rows to it:

import csv
fout = open('file2.txt', 'wb')
writer = csv.writer(fout, dialect='excel_tab')
writer.writerows(rows)

That's all folks!

Gabriel Genellina · Feb 5, 2008

En Wed, 06 Feb 2008 01:52:08 -0200, Gabriel Genellina

I have a text file with marked up data that I need to convert into a
text tab separated file.

Sorry, I sent the code too early:

rows = []
row = None
for line in file1:
line = line.strip() # remove leading and trailing whitespace
if line.startswith('<item>'):
if row: rows.append(row)
j = line.index("</")
item = line[6:j]
row = [item]
elif line.startswith('<color>'):
j = line.index("</")
color = line[7:j]
row.append(color)
else:
raise ValueError, "can't understand line: %r" % line
if row: rows.append(row)

(some `row` should have been `line`)
In the second part, the right dialect name is 'excel-tab' instead of
'excel_tab'.

bearophileHUGS · Feb 6, 2008

This is smells of homework. Here are few alternative solutions of mine
that I don't like. I presume a good teacher will refuse them all,
because no one of them uses the right tool

And every one of them
has some small problem (even if they work here).

data = """\
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>SOFA</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
"""

data2 = data.replace("<color>","").replace("</color>","").replace("</
table>","")
groups = [b.split() for b in data2.split("<item>") if b]
print groups
print

import re
data2 = re.sub(r"<color>|</color>|</table>", "", data)
groups = [b.split() for b in data2.split("<item>") if b]
print groups
print

import re
def splitter(data):
patt = re.compile(r"(?:<item>(.*)</table>)|(?:<color>(.*)</
color>)")
parts = []
for mo in patt.finditer(data):
p1, p2 = mo.groups()
if p1 is None:
parts.append(p2)
else:
if parts:
yield parts
parts = [p1]
if parts:
yield parts
print list(splitter(data))
print

def splitter2(items, predicate):
parts = []
for el in items:
if predicate(el):
parts.append(el)
else:
if parts:
yield parts
parts = [el]
if parts:
yield parts
import re
patt = re.compile(r"(?:<item>(.*)</table>)|(?:<color>(.*)</color>)")
xmobjects = (mo.groups() for mo in patt.finditer(data))
process = lambda group: [group[0][0]] + [part[1] for part in
group[1:]]
isstart = lambda (p1,p2): p1 is None
xgroups = (process(g) for g in splitter2(xmobjects, isstart))
print list(xgroups)
print

data2 = """
<item>TABLE</table>
<color>black</color>
<color>blue< / color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>

<color>black</color>
<color>red</color>
<item>SOFA</table>
<color>white</color>
<color>gray</color>
< color > pink < / color >
"""

import re
patt = re.compile(r"""
\s* < \s* (item|color) \s* > \s*
(.*)
\s* < \s* / \s* (?:table|color) \s* > \s*
""", re.VERBOSE)
groups = []
for mo in patt.finditer(data2):
p1, p2 = mo.groups()
if p1 == "item":
groups.append([p2])
else:
groups[-1].append(p2)
print groups
print

Bye,
bearophile

Dennis Lee Bieber · Feb 6, 2008

file 1:
<item>TABLE</table>

As commented by others, the <item>....</table> pairing looks WRONG;

shouldn't it be said:
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>

Or even more completely...

<table><item>...</item><color>...</color><color>...</color></table>

Ah well... Given all the various re-based solutions (I have yet to
find a use for regular expressions) let's toss in a different type of
complexity. You'll notice the work-arounds that were needed to handle
that mismatch of <item> and </table>

-=-=-=-=-=-=-

import sgmllib
import csv

# Format still looks wrong... I'd expect </item>, not </table>
SAMPLE = """\
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
"""

class HardWay(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.record = []
self.inTag = False
self.tsvfid = "PARSED%s.tsv" % id(self)
self.tsv = open(self.tsvfid, "wb")
self.writer = csv.writer(self.tsv, delimiter="\t")

def do_item(self, attrs): #if </item> is correct, this should be
start_item(...)
if self.record:
self.writer.writerow(self.record)
self.record = []
self.inTag = True

# parser reports an unbalanced closing tag for </table>!
# if, OTOH, </item> is proper, use end_item()
## def end_table(self):
## self.inTag = False
def report_unbalanced(self, tag):
self.inTag = False

def start_color(self, attrs):
self.inTag = True

def end_color(self):
self.inTag = False

def handle_data(self, text):
if self.inTag and text.strip(): #don't write stray glitches from
unmatched
self.record.append(text.strip())

def close(self):
if self.record:
self.writer.writerow(self.record)
self.tsv.close()
sgmllib.SGMLParser.close(self)

if __name__ == "__main__":
parser = HardWay()
parser.feed(SAMPLE)
parser.close()
-=-=-=-=-=-=-=-

Output file:

-=-=-=-=-=-
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink
-=-=-=-=-=-
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Need help , Having problem with Drag n Drop ?	9	Dec 21, 2022
Need help , Having problem with Drag n Drop...	1	Dec 21, 2022
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Database Manager: A C++ Console Application	14	May 12, 2025
Help me finding the xslt error please	0	Nov 18, 2011
Text box simply do not stand out against the wall paper.	3	Feb 7, 2025
How can I get my menu inside of a menu to function properly?	1	Jan 19, 2023
Iam having trouble adding a level editor to my platformer	0	Nov 4, 2025

Data Manipulation - Rows to Columns

Tess

Gabriel Genellina

Gabriel Genellina

bearophileHUGS

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads