HTMLParser not parsing whole html file

josh logan · Oct 24, 2010

Hello,

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:
http://pastebin.com/wu6Pky2W

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):

import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op

base_url = 'http://www.dci.org'

class TestParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

def handle_endtag(self, tag):
print('ending tag {}'.format(tag))

def do_parsing_from_file_stream(fname):
parser = TestParser()

with open(fname) as f:
for num, line in enumerate(f, start=1):
# print('Sending line {} through parser'.format(num))
parser.feed(line)

if __name__ == '__main__':
do_parsing_from_file_stream(sys.argv[1])

josh logan · Oct 24, 2010

Hello,

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:http://pastebin.com/wu6Pky2W

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):

import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op

base_url = 'http://www.dci.org'

class TestParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

def handle_endtag(self, tag):
print('ending tag {}'.format(tag))

def do_parsing_from_file_stream(fname):
parser = TestParser()

with open(fname) as f:
for num, line in enumerate(f, start=1):
# print('Sending line {} through parser'.format(num))
parser.feed(line)

if __name__ == '__main__':
do_parsing_from_file_stream(sys.argv[1])

Sorry, the group doesn't like how i surrounded the Python code's
pastebin URL with parentheses:

http://pastebin.com/HxwRTqrr

josh logan · Oct 24, 2010

Hello,

Click to expand...

I wanted to use python to scrub an html file for score data, but I'm
having trouble.
I'm using HTMLParser, and the parsing seems to fizzle out around line
192 or so. None of the event functions are being called anymore
(handle_starttag, handle_endtag, etc.) and I don't understand why,
because it is a html page over 1000 lines.

Click to expand...

Could someone tell me if this is a bug or simply a misunderstanding on
how HTMLParser works? I'd really appreciate some help in
understanding.

Click to expand...

I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).

Click to expand...

I put the HTML file on pastebin, because I couldn't think of anywhere
better to put it:http://pastebin.com/wu6Pky2W

Click to expand...

The source code has been pared down to the simplest form to exhibit
the problem. It is displayed below, and is also on pastebin for
download (http://pastebin.com/HxwRTqrr):

Click to expand...

import sys
import re
import os.path
import itertools as it
import urllib.request
from html.parser import HTMLParser
import operator as op

Click to expand...

base_url = 'http://www.dci.org'

Click to expand...

class TestParser(HTMLParser):

Click to expand...

def handle_starttag(self, tag, attrs):
print('position {}, staring tag {} with attrs
{}'.format(self.getpos(), tag, attrs))

Click to expand...

def handle_endtag(self, tag):
print('ending tag {}'.format(tag))

Click to expand...

def do_parsing_from_file_stream(fname):
parser = TestParser()

Click to expand...

with open(fname) as f:
for num, line in enumerate(f, start=1):
# print('Sending line {} through parser'.format(num))
parser.feed(line)

Click to expand...

if __name__ == '__main__':
do_parsing_from_file_stream(sys.argv[1])

Click to expand...

Sorry, the group doesn't like how i surrounded the Python code's
pastebin URL with parentheses:

http://pastebin.com/HxwRTqrr

I found the error. The HTML file I'm parsing has invalid HTML at line
193.
It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

Thanks for reading, anyways.

Stefan Behnel · Oct 25, 2010

josh logan, 25.10.2010 04:14:

I found the error. The HTML file I'm parsing has invalid HTML at line
193. It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

HTMLparser is not made to deal with non-HTML input. You can take a look at
lxml.html or BeautifulSoup (up to 3.0), which handle these problems a lot
better.

Stefan

John Nagle · Oct 26, 2010

josh logan, 25.10.2010 04:14:

HTMLparser is not made to deal with non-HTML input. You can take a look
at lxml.html or BeautifulSoup (up to 3.0), which handle these problems a
lot better.

Stefan

You might try HTML5lib:

http://code.google.com/p/html5lib/

The HTML 5 spec formalizes the concept of "bad HTML". Really. There's
a specified way to parse the most common HTML errors. Most browsers
are far more tolerant of bad HTML than they should be, and not in a
consistent way. The HTML 5 spec tries to fix that.

I use BeautifulSoup, but it's being abandoned for the Python 3
transition.
"http://www.crummy.com/software/BeautifulSoup/3.1-problems.html"

John Nagle

HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser can't read japanese	3	Apr 13, 2010
confused by HTMLParser class	3	May 27, 2008
HTMLParser and non-ascii html pages	0	Sep 20, 2011
HTMLParser question	1	Aug 19, 2004
Parsing an HTML a tag	10	Sep 24, 2005
HTMLParser problems.	11	Oct 30, 2003
HTMLParser and write	1	Mar 5, 2004

HTMLParser not parsing whole html file

josh logan

josh logan

josh logan

Stefan Behnel

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads