TSV to HTML

B

Brian

I was wondering if anyone here on the group could point me in a
direction that would expllaing how to use python to convert a tsv file
to html. I have been searching for a resource but have only seen
information on dealing with converting csv to tsv. Specifically I want
to take the values and insert them into an html table.

I have been trying to figure it out myself, and in essence, this is
what I have come up with. Am I on the right track? I really have the
feeling that I am re-inventing the wheel here.

1) in the code define a css
2) use a regex to extract the info between tabs
3) wrap the values in the appropriate tags and insert into table.
4) write the .html file

Thanks again for your patience,
Brian
 
T

Tim Chase

I was wondering if anyone here on the group could point me
> in a direction that would expllaing how to use python to
> convert a tsv file to html. I have been searching for a
> resource but have only seen information on dealing with
> converting csv to tsv. Specifically I want to take the
> values and insert them into an html table.
>
> I have been trying to figure it out myself, and in
> essence, this is what I have come up with. Am I on the
> right track? I really have the feeling that I am
> re-inventing the wheel here.
>
> 1) in the code define a css
> 2) use a regex to extract the info between tabs
> 3) wrap the values in the appropriate tags and insert into
> table.
> 4) write the .html file

Sounds like you just want to do something like

print "<table>"
for line in file("in.tsv"):
print "<tr>"
items = line.split("\t")
for item in items:
print "<td>%s</td>" % item
print "</tr>"
print "</table>"

It gets a little more complex if you need to clean each item
for HTML entities/scripts/etc...but that's usually just a
function that you'd wrap around the item:

print "<td>%s</td>" % escapeEntity(item)

using whatever "escapeEntity" function you have on hand.
E.g.

from xml.sax.saxutils import escape
:
:
print "<td>%s</td>" % escape(item)

It doesn't gracefully attempt to define headers using
<thead>, <tbody>, and <th> sorts of rows, but a little
toying should solve that.

-tim
 
L

Leif K-Brooks

Brian said:
I was wondering if anyone here on the group could point me in a
direction that would expllaing how to use python to convert a tsv file
to html. I have been searching for a resource but have only seen
information on dealing with converting csv to tsv. Specifically I want
to take the values and insert them into an html table.

import csv
from xml.sax.saxutils import escape

def tsv_to_html(input_file, output_file):
output_file.write('<table><tbody>\n')
for row in csv.reader(input_file, 'excel-tab'):
output_file.write('<tr>')
for col in row:
output_file.write('<td>%s</td>' % escape(col))
output_file.write('</tr>\n')
output_file.write('</tbody></table>')

Usage example:
<table><tbody>
<tr><td>foo</td><td>bar</td><td>baz</td></tr>
<tr><td>qux</td><td>quux</td><td>quux</td></tr>
</tbody></table>
 
B

Brian

First let me say that I appreciate the responses that everyone has
given.

A friend of mine is a ruby programmer but knows nothing about python.
He gave me the script below and it does exactly what I want, only it is
in Ruby. Not knowing ruby this is greek to me, and I would like to
re-write it in python.

I ask then, is this essentially what others here have shown me to do,
or is it in a different vein all together?

Code:

class TsvToHTML
@@styleBlock = <<-ENDMARK
<style type='text/css'>
td {
border-left:1px solid #000000;
padding-right:4px;
padding-left:4px;
white-space: nowrap;
}
.cellTitle {
border-bottom:1px solid #000000;
background:#ffffe0;
font-weight: bold;
text-align: center;
}
.cell0 { background:#eff1f1; }
.cell1 { background:#f8f8f8; }
</style>
ENDMARK

def TsvToHTML::wrapTag(data,tag,modifier = "")
return "<#{tag} #{modifier}>" + data + "</#{tag}>\n"
end # wrapTag

def TsvToHTML::makePage(source)
page = ""
rowNum = 0
source.readlines.each { |record|
row = ""
record.chomp.split("\t").each { |field|
# replace blank fields with &nbsp;
field.sub!(/^$/,"&nbsp;")
# wrap in TD tag, specify style
row += wrapTag(field,"td","class=\"" +
((rowNum == 0)?"cellTitle":"cell#{rowNum % 2}") +
"\"")
}
rowNum += 1
# wrap in TR tag, add row to page
page += wrapTag(row,"tr") + "\n"
}
# finish page formatting
[ [ "table","cellpadding=0 cellspacing=0 border=0" ], "body","html"
].each { |tag|
page = wrapTag(@@styleBlock,"head") + page if tag == "html"
page = wrapTag(page,*tag)
}
return page
end # makePage
end # class

# stdin -> convert -> stdout
print TsvToHTML.makePage(STDIN)
 
P

Paddy

Brian said:
First let me say that I appreciate the responses that everyone has
given.

A friend of mine is a ruby programmer but knows nothing about python.
He gave me the script below and it does exactly what I want, only it is
in Ruby. Not knowing ruby this is greek to me, and I would like to
re-write it in python.

I ask then, is this essentially what others here have shown me to do,
or is it in a different vein all together?
Leif's Python example uses the csv module which understands a lot more
about the peculiarities of the CSV/TSV formats.
The Ruby example prepends a <style>...</style> block.

The Ruby example splits each line to form a table row and each row on
tabs, to form the cells.

The thing about TSV/CSV formats is that their is no one format. you
need to check how your TSV creator generates the TSV file:
Does it put quotes around text fields?
What kind of quotes?
How does it represent null fields?
Might you get fields that include newlines?

- P.S. I'm not a Ruby programmer, just read the source ;-)
 
D

Dennis Lee Bieber

Code:

class TsvToHTML
@@styleBlock = <<-ENDMARK

said:
print TsvToHTML.makePage(STDIN)

Given that no "instances" are created, there's no real need to use a
class (in Python, at least -- I don't know if Ruby is like Java, where
everything is embedded in a class). A simple module (file) is
sufficient.

I took a few liberties -- like splitting out the table generation
from the rest of the page, and adding argument parsing for input files
(so this version will create multiple tables if multiple files were
supplied). Be careful, one or two lines were wrapped by the news client.

-=-=-=-=-=-=-=-
# tsv2html.py
# function module

import sys

# define CSS style definition
STYLEBLOCK = """
<style type="text/css">
td {
border-left:1px solid #000000;
padding-right:4px;
padding-left:4px;
white-space: nowrap; }
..cellTitle {
border-bottom:1px solid #000000;
background:#ffffe0;
font-weight: bold;
text-align: center; }
..cell0 { background:#3ff1f1; }
..cell1 { background:#f8f8f8; }
</style>
"""

# utility function to wrap "data" within
# <tag modifier> data </tag>
def wrapTag(data, tag, modifier = ""):
if type(tag) != type(""): #check for complex (tag, modifier) tuple
tag, modifier = tag
return "<%s %s>%s</%s>\n" % (tag, modifier, data, tag)

# utility function to produce an HTML table
# from tab-separated data read from
# iterable source material
def makeTable(source):
tableParts = []
rowNum = 0
# get each line of source
for record in source:
rowParts = []
# get each field of source; splitting on tabs
for field in record.strip().split("\t"):
# convert empty fields to a non-breaking space
if not field: field = "&nbsp;"
if rowNum:
# past the first row, alternate cell style
tagged = wrapTag(field, "td",
'class="cell%s"' % (rowNum % 2))
else:
# first row, use "title" style
tagged = wrapTag(field, "td", #I'd use "th"
'class="cellTitle"')
# collect the tagged field as a list of row parts
rowParts.append(tagged)
rowNum += 1
# join the row parts, and wrap as a row, collecting rows in
list
tableParts.append(wrapTag("".join(rowParts), "tr"))
# join the rows with a new-line separator
return wrapTag("\n".join(tableParts),
("table",
'align="center" cellpadding="0" cellspacing="0"
border="0"'))

def makePage(data):
# wrap the tables in rest of HTML tags: table, body, html
for tag in ["body", "html"]:
# if current tag is the <html>, insert a <head> block with
# the CSS style definition
if tag == "html":
data = wrapTag(STYLEBLOCK, "head") + data
data = wrapTag(data, tag)
return data

if __name__ == "__main__":
# if command line arguments supplied, treat as file names
if len(sys.argv) > 1:
fout = open("TSV2HTML.html", "w")
tables = []
# for each file supplied
for fid in sys.argv[1:]:
# open for read, and open a <filename>.html for output
fin = open(fid, "r")
# generate page from file data, write new file
tables.append(makeTable(fin))
fin.close()
fout.write(makePage("\n".join(tables)))
fout.close()
else:
# no arguments, read stdin, write stdout
sys.stdout.write(makePage(makeTable(sys.stdin))) #could use
print

NOTE: no HTML escaping is done, and my test data sometimes caused
problems.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

Brian

Dennis,

Thank you for that response. Your code was very helpful to me. I
think that actually seeing how it should be done in Python was a lot
more educational than spending hours with trial and error.

One question (and this is a topic that I still have trouble getting my
arms around). Why is the text in STYLEBLOCK tripple quoted?

Thanks again,
Brian
 
S

Scott David Daniels

Brian said:
One question (and this is a topic that I still have trouble getting my
arms around). Why is the text in STYLEBLOCK tripple quoted?

Because triple-quoted strings can span lines and include single quotes
and double quotes.
 
D

Dennis Lee Bieber

Thank you for that response. Your code was very helpful to me. I
think that actually seeing how it should be done in Python was a lot
more educational than spending hours with trial and error.
It's not the best code around -- I hacked it together pretty much
line-for-line from an assumption of what the Ruby was doing (I don't do
Ruby -- too much PERL idiom in it)
One question (and this is a topic that I still have trouble getting my
arms around). Why is the text in STYLEBLOCK tripple quoted?
Triple quotes allow: 1) use of single quotes within the block
without needing to escape them; 2) allows the string to span multiple
lines. Plain string quoting must be one logical line to the parser.

I've practically never seen anyone use a line continuation character
in Python. And triple quoting looks cleaner than parser concatenation.

The alternatives would have been:

Line Continuation:
STYLEBLOCK = '\n\
<style type="text/css">\n\
td {\n\
border-left:1px solid #000000;\n\
padding-right:4px;\n\
padding-left:4px;\n\
white-space: nowrap; }\n\
..cellTitle {\n\
border-bottom:1px solid #000000;\n\
background:#ffffe0;\n\
font-weight: bold;\n\
text-align: center; }\n\
..cell0 { background:#3ff1f1; }\n\
..cell1 { background:#f8f8f8; }\n\
</style>\n\
'
Note the \n\ as the end of each line; the \n is to keep the
formatting on the generated HTML (otherwise everything would be one long
line) and the final \ (which must be the physical end of line)
signifying "this line is continued". Also note that I used ' rather than
" to avoid escaping the " on text/css.

Parser Concatenation:
STYLEBLOCK = (
'<style type="text/css">\n'
"td {\n"
" border-left:1px solid #000000;\n"
" padding-right:4px;\n"
" padding-left:4px;\n"
" white-space: nowrap; }\n"
".cellTitle {\n"
" border-bottom:1px solid #000000;\n"
" background:#ffffe0;\n"
" font-weight: bold;\n"
" text-align: center; }\n"
".cell0 { background:#3ff1f1; }\n"
".cell1 { background:#f8f8f8; }\n"
"</style>\n"
)

Note the use of ( ) where the original had """ """. Also note that
each line has quotes at start/end (the first has ' to avoid escaping
text/css). There are no commas separating each line (and the \n is still
for formatting). Using the ( ) creates an expression, and Python is nice
enough to let one split expressions inside () or [lists], {dicts}, over
multiple lines (I used that feature in a few spots to put call arguments
on multiple lines). Two strings that are next to each other

"string1" "string2"

are parsed as one string

"string1string2"

Using """ (or ''') is the cleanest of those choices, especially if
you want to do preformatted layout of the text. It works similar to the
Ruby/PERL construct that basically said: Copy all text up to the next
occurrence of MARKER_STRING.



Thanks again,
Brian
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

Brian

Dennis said:
Thank you for that response. Your code was very helpful to me. I
think that actually seeing how it should be done in Python was a lot
more educational than spending hours with trial and error.
It's not the best code around -- I hacked it together pretty much
line-for-line from an assumption of what the Ruby was doing (I don't do
Ruby -- too much PERL idiom in it)
One question (and this is a topic that I still have trouble getting my
arms around). Why is the text in STYLEBLOCK tripple quoted?
Triple quotes allow: 1) use of single quotes within the block
without needing to escape them; 2) allows the string to span multiple
lines. Plain string quoting must be one logical line to the parser.

I've practically never seen anyone use a line continuation character
in Python. And triple quoting looks cleaner than parser concatenation.

The alternatives would have been:

Line Continuation:
STYLEBLOCK = '\n\
<style type="text/css">\n\
td {\n\
border-left:1px solid #000000;\n\
padding-right:4px;\n\
padding-left:4px;\n\
white-space: nowrap; }\n\
.cellTitle {\n\
border-bottom:1px solid #000000;\n\
background:#ffffe0;\n\
font-weight: bold;\n\
text-align: center; }\n\
.cell0 { background:#3ff1f1; }\n\
.cell1 { background:#f8f8f8; }\n\
</style>\n\
'
Note the \n\ as the end of each line; the \n is to keep the
formatting on the generated HTML (otherwise everything would be one long
line) and the final \ (which must be the physical end of line)
signifying "this line is continued". Also note that I used ' rather than
" to avoid escaping the " on text/css.

Parser Concatenation:
STYLEBLOCK = (
'<style type="text/css">\n'
"td {\n"
" border-left:1px solid #000000;\n"
" padding-right:4px;\n"
" padding-left:4px;\n"
" white-space: nowrap; }\n"
".cellTitle {\n"
" border-bottom:1px solid #000000;\n"
" background:#ffffe0;\n"
" font-weight: bold;\n"
" text-align: center; }\n"
".cell0 { background:#3ff1f1; }\n"
".cell1 { background:#f8f8f8; }\n"
"</style>\n"
)

Note the use of ( ) where the original had """ """. Also note that
each line has quotes at start/end (the first has ' to avoid escaping
text/css). There are no commas separating each line (and the \n is still
for formatting). Using the ( ) creates an expression, and Python is nice
enough to let one split expressions inside () or [lists], {dicts}, over
multiple lines (I used that feature in a few spots to put call arguments
on multiple lines). Two strings that are next to each other

"string1" "string2"

are parsed as one string

"string1string2"

Using """ (or ''') is the cleanest of those choices, especially if
you want to do preformatted layout of the text. It works similar to the
Ruby/PERL construct that basically said: Copy all text up to the next
occurrence of MARKER_STRING.

Thank you for your explanation, now it makes sense.

Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top