Data Manipulation - Rows to Columns

T

Tess

Hello All,

I have a text file with marked up data that I need to convert into a
text tab separated file.

The structure of the input file is listed below (see file 1) and the
desired output file is below as well (see file 2).

I am a complete novice with python and would appreciate any tips you
may be able to provide.

Best,

Tess



file 1:
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>


file 2 (tab separated):
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink
 
G

Gabriel Genellina

I have a text file with marked up data that I need to convert into a
text tab separated file.

The structure of the input file is listed below (see file 1) and the
desired output file is below as well (see file 2).

I am a complete novice with python and would appreciate any tips you
may be able to provide.


file 1:
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>

Are you sure it says <item>...</table>?
Are ALWAYS three colors per item, as in your example? If this is the case,
just read groups of 4 lines and ignore the tags.
file 2 (tab separated):
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink

The best way to produce this output is using the csv module:
http://docs.python.org/lib/module-csv.html
So we need a list of rows, being each row a list of column data. A simple
way of building such structure from the input file would be:

rows = []
row = None
for line in open('file1.txt'):
line = line.strip() # remove leading and trailing whitespace
if line.startswith('<item>'):
if row: rows.append(row)
j = row.index("</")
item = row[6:j]
row = [item]
elif line.startswith('<color>'):
j = row.index("</")
color = row[7:j]
row.append(color)
else:
raise ValueError, "can't understand line: %r" % line
if row: rows.append(row)

This allows for a variable number of "color" lines per item. Once the
`rows` list is built, we only have to create a csv writer for the right
dialect ('excel_tab' looks promising) and feed the rows to it:

import csv
fout = open('file2.txt', 'wb')
writer = csv.writer(fout, dialect='excel_tab')
writer.writerows(rows)

That's all folks!
 
G

Gabriel Genellina

En Wed, 06 Feb 2008 01:52:08 -0200, Gabriel Genellina
I have a text file with marked up data that I need to convert into a
text tab separated file.

Sorry, I sent the code too early:

rows = []
row = None
for line in file1:
line = line.strip() # remove leading and trailing whitespace
if line.startswith('<item>'):
if row: rows.append(row)
j = line.index("</")
item = line[6:j]
row = [item]
elif line.startswith('<color>'):
j = line.index("</")
color = line[7:j]
row.append(color)
else:
raise ValueError, "can't understand line: %r" % line
if row: rows.append(row)

(some `row` should have been `line`)
In the second part, the right dialect name is 'excel-tab' instead of
'excel_tab'.
 
B

bearophileHUGS

This is smells of homework. Here are few alternative solutions of mine
that I don't like. I presume a good teacher will refuse them all,
because no one of them uses the right tool :) And every one of them
has some small problem (even if they work here).

data = """\
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>SOFA</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
"""

data2 = data.replace("<color>","").replace("</color>","").replace("</
table>","")
groups = [b.split() for b in data2.split("<item>") if b]
print groups
print


import re
data2 = re.sub(r"<color>|</color>|</table>", "", data)
groups = [b.split() for b in data2.split("<item>") if b]
print groups
print


import re
def splitter(data):
patt = re.compile(r"(?:<item>(.*)</table>)|(?:<color>(.*)</
color>)")
parts = []
for mo in patt.finditer(data):
p1, p2 = mo.groups()
if p1 is None:
parts.append(p2)
else:
if parts:
yield parts
parts = [p1]
if parts:
yield parts
print list(splitter(data))
print


def splitter2(items, predicate):
parts = []
for el in items:
if predicate(el):
parts.append(el)
else:
if parts:
yield parts
parts = [el]
if parts:
yield parts
import re
patt = re.compile(r"(?:<item>(.*)</table>)|(?:<color>(.*)</color>)")
xmobjects = (mo.groups() for mo in patt.finditer(data))
process = lambda group: [group[0][0]] + [part[1] for part in
group[1:]]
isstart = lambda (p1,p2): p1 is None
xgroups = (process(g) for g in splitter2(xmobjects, isstart))
print list(xgroups)
print



data2 = """
<item>TABLE</table>
<color>black</color>
<color>blue< / color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>

<color>black</color>
<color>red</color>
<item>SOFA</table>
<color>white</color>
<color>gray</color>
< color > pink < / color >
"""

import re
patt = re.compile(r"""
\s* < \s* (item|color) \s* > \s*
(.*)
\s* < \s* / \s* (?:table|color) \s* > \s*
""", re.VERBOSE)
groups = []
for mo in patt.finditer(data2):
p1, p2 = mo.groups()
if p1 == "item":
groups.append([p2])
else:
groups[-1].append(p2)
print groups
print

Bye,
bearophile
 
D

Dennis Lee Bieber

file 1:
<item>TABLE</table>

As commented by others, the <item>....</table> pairing looks WRONG;
shouldn't it be said:
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
Or even more completely...

<table><item>...</item><color>...</color><color>...</color></table>

Ah well... Given all the various re-based solutions (I have yet to
find a use for regular expressions) let's toss in a different type of
complexity. You'll notice the work-arounds that were needed to handle
that mismatch of <item> and </table>

-=-=-=-=-=-=-

import sgmllib
import csv

# Format still looks wrong... I'd expect </item>, not </table>
SAMPLE = """\
<item>TABLE</table>
<color>black</color>
<color>blue</color>
<color>red</color>
<item>CHAIR</table>
<color>yellow</color>
<color>black</color>
<color>red</color>
<item>TABLE</table>
<color>white</color>
<color>gray</color>
<color>pink</color>
"""

class HardWay(sgmllib.SGMLParser):
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.record = []
self.inTag = False
self.tsvfid = "PARSED%s.tsv" % id(self)
self.tsv = open(self.tsvfid, "wb")
self.writer = csv.writer(self.tsv, delimiter="\t")

def do_item(self, attrs): #if </item> is correct, this should be
start_item(...)
if self.record:
self.writer.writerow(self.record)
self.record = []
self.inTag = True

# parser reports an unbalanced closing tag for </table>!
# if, OTOH, </item> is proper, use end_item()
## def end_table(self):
## self.inTag = False
def report_unbalanced(self, tag):
self.inTag = False

def start_color(self, attrs):
self.inTag = True

def end_color(self):
self.inTag = False

def handle_data(self, text):
if self.inTag and text.strip(): #don't write stray glitches from
unmatched
self.record.append(text.strip())

def close(self):
if self.record:
self.writer.writerow(self.record)
self.tsv.close()
sgmllib.SGMLParser.close(self)

if __name__ == "__main__":
parser = HardWay()
parser.feed(SAMPLE)
parser.close()
-=-=-=-=-=-=-=-

Output file:

-=-=-=-=-=-
TABLE black blue red
CHAIR yellow black red
TABLE white gray pink
-=-=-=-=-=-
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top