HTMLParser problems.

Sean Cody · Oct 30, 2003

I'm trying to take a webpage that has a nxn table of entries (bus times) and
convert it to a 2D array (list of lists). Initially this was simple but I
need to be able to access whole 'columns' of data so the 2D array cannot be
sparse but in the HTML file I'm parsing there can be sparse entries which
are repsented in the table as &nbsp entities. The sparse output breaks my
ability to use entire columns and have entries correspond properly.

Is there a simple way to tell the parser whenever you see a &nbsp in table
data return say... "-1" or "NaN"?
The HTMLParser documentation is a bit.... terse. I was considering using
the handle_entityref() method but I would assume the data has already been
parsed at that point.

I could try:
def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).

As an example here is some code I'm using and partial output:

#!/usr/local/bin/python
import htmllib,os,string,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser):
def __init__(self):
self.in_td = 0
self.in_tr = 0
self.matrix = []
self.row = []
self.reset()

def handle_starttag(self,tag,attrs):
if tag == "td":
self.in_td = 1
elif tag == "tr":
self.in_tr = 1

def handle_data(self,data):
if self.in_td == 1:
data = string.lstrip(data)
if data != "":
self.row.append(data)

def handle_endtag(self,tag):
if tag == "td":
self.in_td = 0
elif tag == "tr":
self.in_tr = 0
if self.row != []:
self.matrix.append(self.row)
self.row=[]

parser = foo()
socket =
urllib.urlopen("http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413botto
m.html")
parser.feed(socket.read())
socket.close()
parser.close()
for row in parser.matrix:
print row

A partial output of the above code is:
['5:12 C', '5:52 W']
['5:34 C']
['5:50 P']
['6:01 P', '6:10 G', '6:09 S', '6:59 U']
['6:10 P', '6:26 G', '6:23 C']
['6:23 P', '6:42 G', '6:35 W']
['6:34 P', '6:54 G', '6:47 S']
['6:46 P', '6:59 C']

Any tips or suggestions or comments would be greatly appriciated,

Terry Reedy · Oct 31, 2003

Sean Cody said:
I could try:
def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).

Does this work? For me, that comes first.

tjr

Sean Cody · Oct 31, 2003

I could try:

Does this work? For me, that comes first.

Actually yes it does.

I wonder if there is a better way as I'm just stumbling through the
HTMLParser class.
The best thing about python is the stumbling through getting things done is
not as painful as it would be in other languages.

I use a lot of member variables. Is there a way to not have to reference
members by self.member. Back in the day in pascal you could do stuff like
"with self begin do_stuff(member_variable); end;" which was extremely useful
for large 'records.'

Peter Otten · Oct 31, 2003

Sean said:
I'm trying to take a webpage that has a nxn table of entries (bus times)
and
convert it to a 2D array (list of lists). Initially this was simple but I
need to be able to access whole 'columns' of data so the 2D array cannot
be sparse but in the HTML file I'm parsing there can be sparse entries
which
are repsented in the table as &nbsp entities. The sparse output breaks my
ability to use entire columns and have entries correspond properly.

Is there a simple way to tell the parser whenever you see a &nbsp in table
data return say... "-1" or "NaN"?
The HTMLParser documentation is a bit.... terse. I was considering using
the handle_entityref() method but I would assume the data has already been
parsed at that point.

I could try:
def handle_entityref(self,entity):
if self.in_td == 1:
if entity == "nbsp":
self.row.append(-1)

But that seems ulgy... (comments?).

As an example here is some code I'm using and partial output:
[...]

parser.feed(socket.read())

The simplest solution is to replace the above line with

parser.feed(socket.read().replace(" ", "NaN")

Below is an only slightly more robust solution. It implements a rudimentary
"what table are we in?" check and can handle table cells with multiple data
chunks.

import htmllib,os,string,urllib
from HTMLParser import HTMLParser

class foo(HTMLParser):
def __init__(self):
self.matrix = []
self.row = None
self.cell = None
self.in_table = 0
self.empty = "NaN"
self.reset()

def handle_starttag(self,tag,attrs):
if tag == "table":
self.in_table += 1
elif self.in_table == 2:
if tag == "td":
assert self.cell is None
self.cell = []
elif tag == "tr":
self.row = []
self.matrix.append(self.row)

def handle_data(self,data):
if self.in_table == 2:
if self.cell is not None:
data = string.strip(data)
if data or True:
self.cell.append(data)

def handle_endtag(self,tag):
if tag == "table":
self.in_table -= 1
elif self.in_table == 2:
if tag == "td":
s = " ".join(self.cell).replace("\n", " ")
if s == "":
s = self.empty
self.row.append(s)
self.cell = None
elif tag == "tr":
self.row = None

parser = foo()
if 0:
instream = urllib.urlopen(

"http://winnipegtransit.com/TIMETABLE/TODAY/STOPS/105413bottom.html")
else:
instream = file("105413bottom.html")
data = instream.read()
parser.feed(data)
instream.close()
parser.close()
for row in parser.matrix:
assert len(row) == 4
print row

I've replaced the urlopen() call with access to a local file as you might
want to run your tests with a local copy of the time table, too.

Peter

John J. Lee · Oct 31, 2003

Sean Cody said:
Actually yes it does.

I wonder if there is a better way as I'm just stumbling through the
HTMLParser class.

[...]

Seems OK to me.

I use a lot of member variables. Is there a way to not have to reference
members by self.member. Back in the day in pascal you could do stuff like
"with self begin do_stuff(member_variable); end;" which was extremely useful
for large 'records.'

Well, obviously, there's:

mv = self.member_variable
do_stuff(mv)

or if you have lots of names that are annoying you, things like:

for name in "foo", "bar", "baz":
do_stuff(getattr(self, name))

can help.

John

John J. Lee · Oct 31, 2003

Peter Otten said:
Sean Cody wrote: [...]
The simplest solution is to replace the above line with

parser.feed(socket.read().replace(" ", "NaN")

[...]

That's platform-dependent, if you're relying on float("NaN").

John

Paul Clinch · Oct 31, 2003

You could patch it with:=

def handle_starttag(self,tag,attrs):
if tag == "td":
self.in_td = 1
self.row.append("")
elif tag == "tr":
self.in_tr = 1

def handle_data(self,data):
if self.in_td == 1:
data = string.lstrip(data)
if data != "":
self.row[-1]=data

i.e. create the element and then later possible replace it.

BTW True can be used as 1, an empty string is false, strings have
methods and "if self.in_td:" ok, so:-

def handle_data(self,data):
if self.in_td:
data = data.lstrip()
if data:
self.row[-1]=data

is equivalent.

Regards, Paul Clinch

Terry Reedy · Nov 1, 2003

1. call the parameter s instead of self; then it is s.member.
But best not to post code with that, lest you upset some readers ;-).

There have been proposals something like that, but they do not seem to
fit Python too well.

2.

Well, obviously, there's:

mv = self.member_variable
do_stuff(mv)

In case you think this a hack, it is not. Copying things into the
local variable space (from builtins, globals, attributes) is a fairly
common idiom. When a value is used repeatedly (like in a loop), the
copying is paid for by faster repeated access.

Terry J. Reedy

Peter Otten · Nov 1, 2003

John said:
[...]

The simplest solution is to replace the above line with

parser.feed(socket.read().replace(" ", "NaN")

Click to expand...

[...]

That's platform-dependent, if you're relying on float("NaN").

Actually, I'm not, any non-empty string would have done as well, given the
original poster's parser implementation.

Peter

Peter Otten · Nov 1, 2003

Peter said:
Actually, I'm not, any non-empty string would have done as well, given the
original poster's parser implementation.

Nitpicking myself: any string containing at least one non-white character.

Peter

Alex Martelli · Nov 1, 2003

Terry said:
There have been proposals something like that, but they do not seem to
fit Python too well.

No, but, for the record: just last week in python-dev Guido rejected
a syntax proposal using a leading dot to strop a variablename in some
circumstances (writing '.var' rather than 'var' in those cases) for
the stated reason that, and I quote:
"I want to reserve .var for the "with" statement (a la VB)."

So, something like "with self: dostuff(.member_variable)" MIGHT be
in Python's future (the leading dot, like in VB, at least does make
things more explicit than leaving it implied like Pascal does).

In case you think this a hack, it is not. Copying things into the
local variable space (from builtins, globals, attributes) is a fairly
common idiom. When a value is used repeatedly (like in a loop), the
copying is paid for by faster repeated access.

Sure, good point. It _is_ a hack by some definitions of the word,
but that's not necessarily a bad thing. A quibble: the optimization
may be worth it when the NAME is used repeatedly -- repeated uses
of the VALUE, e.g. within the body of do_stuff, accessing the value
through another name [e.g. the parametername for do_stuff] do not
count, because what you're optimizing is specifically name lookup.

E.g., one silly example:

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' 'for i in range(999):
x=id(x)'
1000 loops, best of 3: 590 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'lid=id' 'for i in
range(999): x=lid(x)'
1000 loops, best of 3: 490 usec per loop

the repeated lookups of builtin name 'id' in the first case accounted
for almost 17% of the CPU time, so the simple optimization leading to
the second case may be worth it if this code is in a bottleneck. In a
way, this is a special case of the general principle that Python does
NOT get into the thorny business of hosting constant subexpressions
(requiring it to prove that something IS constant...), so, when you're
looking at a major bottleneck, you have to consider doing such hoisting
yourself manually. Name lookup for anything but local bare names IS
"a subexpression", so if you KNOW it's a constant subex. in some case
where every cycle matter, you can hoist it.

(Or, you can use psyco, which among many other things can do the
hoisting on your behalf...:

[alex@lancelot bo]$ timeit.py -c -s'x=range(999)' -s'import psyco;
psyco.full()' 'for i in range(999): x=id(x)'
10000 loops, best of 3: 43 usec per loop

[alex@lancelot bo]$ timeit.py -c -s'x=range(999); lid=id' -s'import psyco;
psyco.full()' 'for i in range(999): x=lid(x)'
10000 loops, best of 3: 43 usec per loop

as you can see, with psyco, this manual hoisting gives no further
benefit -- so you can use whatever construct you find clearer and
not worry about performance effects, just enjoying the order-of-
magnitude speedup that psyco achieves in this case either way.

Alex

John J. Lee · Nov 1, 2003

Peter Otten said:
John said:

[...]

The simplest solution is to replace the above line with

parser.feed(socket.read().replace(" ", "NaN")

Click to expand...

[...]

That's platform-dependent, if you're relying on float("NaN").

Click to expand...

Actually, I'm not, any non-empty string would have done as well, given the
original poster's parser implementation.

I was referring to the fact that (despite what he said), his code did
..append(-1), not .append("-1"). But only the OP knows what he really
meant to do.

John

HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser can't read japanese	3	Apr 13, 2010
HTMLParser not parsing whole html file	4	Oct 24, 2010
Minimum Total Difficulty	0	Nov 15, 2023
Taskcproblem calendar	4	Aug 31, 2023
HTMLParser question	1	Aug 19, 2004
HTMLParser handler_starttag misses lots of tags!	2	Nov 22, 2003
Index Error during backpropagation in a multilayer neural network.	1	Jun 17, 2023

HTMLParser problems.

Sean Cody

Terry Reedy

Sean Cody

Peter Otten

John J. Lee

John J. Lee

Paul Clinch

Terry Reedy

Peter Otten

Peter Otten

Alex Martelli

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads