How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..

seberino · Mar 2, 2007

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.

Do I just need to add something like <?xml ...?> or what?

Chris

Paul Boddie · Mar 2, 2007

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

An XML parser should be sufficient. However...

xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.

Do I just need to add something like <?xml ...?> or what?

If the page isn't well-formed then it isn't proper XHTML since the
XHTML specification [1] says...

4.1. Documents must be well-formed

Yes, it's a heading, albeit in an "informative" section describing how
XHTML differs from HTML 4. See "3.2. User Agent Conformance" for a
"normative" mention of well-formedness.

You could try libxml2dom (or other libxml2-based solutions) for some
fairly effective HTML parsing:

libxml2dom.parseString("text of document here", html=1)

See http://www.python.org/pypi/libxml2dom for more details.

Paul

[1] http://www.w3.org/TR/xhtml1/

Thomas Dybdahl Ahle · Mar 3, 2007

Den Fri, 02 Mar 2007 15:32:58 -0800 skrev (e-mail address removed):

I'm trying to extract some data from an XHTML Transitional web page.
xml.dom.minidom.parseString("text of web page") gives errors about it
not being well formed XML.
Do I just need to add something like <?xml ...?> or what?

As many HTML Transitional pages are very bad formed, you can't really
create a dom of them.

I've written multiple grabbers, which grab tv data from html pages, and
parses it into xml.

Basicly there are three ways to get the info:

# Use find(): If you are only searching for a few data pieces, you
might be able to find some html code always appearing before the data you
need.

# Use regular expressions: This can very quickly get all data from a
table or so into a nice list. Only problem is regular expressions having
a little steep learing curve.

# Use a SAX parser: This will iterate through all html items, not
carring if they validate or not. You will define a method to be called
each time it finds a tag, a piece of text etc.

What is best way to do this?

In the beginning I mostly did the SAX way, but it really generates a lot
of code, which is not necessaryly more readable than the regular
expressions.

James Graham · Mar 3, 2007

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

May I suggest html5lib [1]? It's based on the parsing section of the
WHATWG "HTML5" spec [2] which is in turn based on the behavior of major
web browsers so it should parse more or less* any invalid markup you
throw at it. Despite the name "html5lib" it works with any (X)HTML
document. By default, you have the option of producing a minidom tree,
an ElementTree, or a "simpletree" - a lightweight DOM-like
html5lib-specific tree.

If you are happy to pull from SVN I recommend that version; it has a few
bug fixes over the 0.2 release as well as improved features including
better error reporting and detection of encoding from <meta> elements
(the next release is imminent).

[1] http://code.google.com/p/html5lib/
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

* There might be a problem if e.g. the document uses a character
encoding that python does not support, otherwise it should parse anything.

Bruno Desthuilliers · Mar 3, 2007

(e-mail address removed) a écrit :

I'm trying to extract some data from an XHTML Transitional web page.

What is best way to do this?

xml.dom.minidom.

As a side note, cElementTree is probably a better choice. Or even a
simple SAX parser.

parseString("text of web page") gives errors about it
not being well formed XML.

If it's not well-formed XML, most - if not all - XML parsers will shoke
on it.

Do I just need to add something like <?xml ...?> or what?

How could we say without looking at the XML ?

But anyway, even if the XHTML is crappy, BeautifulSoup may do the job...

HTH

Sending data from web page to Raspberry Pi	0	Nov 26, 2022
extract data from xhtml	2	Dec 7, 2006
Simple web framework - improvements to makefile	0	Feb 1, 2023
Need total amount displayed of data-price attribute from each table	2	Jul 3, 2022
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
extracting from web pages but got disordered words sometimes	3	Jan 27, 2007
Help to extract data from a web page	2	Aug 25, 2007
extract data from web page	16	Jul 9, 2007

How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..

seberino

Paul Boddie

Thomas Dybdahl Ahle

James Graham

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..

seberino

Paul Boddie

Thomas Dybdahl Ahle

James Graham

Bruno Desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

How extract data from XHTML Transitional web pages? got xml.dom.minidom troubles..