How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..

S

seberino

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.

Do I just need to add something like <?xml ...?> or what?

Chris
 
P

Paul Boddie

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

An XML parser should be sufficient. However...
xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.

Do I just need to add something like <?xml ...?> or what?

If the page isn't well-formed then it isn't proper XHTML since the
XHTML specification [1] says...

4.1. Documents must be well-formed

Yes, it's a heading, albeit in an "informative" section describing how
XHTML differs from HTML 4. See "3.2. User Agent Conformance" for a
"normative" mention of well-formedness.

You could try libxml2dom (or other libxml2-based solutions) for some
fairly effective HTML parsing:

libxml2dom.parseString("text of document here", html=1)

See http://www.python.org/pypi/libxml2dom for more details.

Paul

[1] http://www.w3.org/TR/xhtml1/
 
T

Thomas Dybdahl Ahle

Den Fri, 02 Mar 2007 15:32:58 -0800 skrev (e-mail address removed):
I'm trying to extract some data from an XHTML Transitional web page.
xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.
Do I just need to add something like <?xml ...?> or what?

As many HTML Transitional pages are very bad formed, you can't really
create a dom of them.

I've written multiple grabbers, which grab tv data from html pages, and
parses it into xml.

Basicly there are three ways to get the info:

# Use find(): If you are only searching for a few data pieces, you
might be able to find some html code always appearing before the data you
need.

# Use regular expressions: This can very quickly get all data from a
table or so into a nice list. Only problem is regular expressions having
a little steep learing curve.

# Use a SAX parser: This will iterate through all html items, not
carring if they validate or not. You will define a method to be called
each time it finds a tag, a piece of text etc.
What is best way to do this?

In the beginning I mostly did the SAX way, but it really generates a lot
of code, which is not necessaryly more readable than the regular
expressions.
 
J

James Graham

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

May I suggest html5lib [1]? It's based on the parsing section of the
WHATWG "HTML5" spec [2] which is in turn based on the behavior of major
web browsers so it should parse more or less* any invalid markup you
throw at it. Despite the name "html5lib" it works with any (X)HTML
document. By default, you have the option of producing a minidom tree,
an ElementTree, or a "simpletree" - a lightweight DOM-like
html5lib-specific tree.

If you are happy to pull from SVN I recommend that version; it has a few
bug fixes over the 0.2 release as well as improved features including
better error reporting and detection of encoding from <meta> elements
(the next release is imminent).

[1] http://code.google.com/p/html5lib/
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

* There might be a problem if e.g. the document uses a character
encoding that python does not support, otherwise it should parse anything.
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

xml.dom.minidom.

As a side note, cElementTree is probably a better choice. Or even a
simple SAX parser.
parseString("text of web page") gives errors about it
not being well formed XML.

If it's not well-formed XML, most - if not all - XML parsers will shoke
on it.
Do I just need to add something like <?xml ...?> or what?

How could we say without looking at the XML ?

But anyway, even if the XHTML is crappy, BeautifulSoup may do the job...

HTH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top