Parsing HTML?

Benjamin · Apr 3, 2008

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Daniel Fetchinson · Apr 3, 2008

I'm trying to parse an HTML file. I want to retrieve all of the text

inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel

benash · Apr 3, 2008

BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

Paul Boddie · Apr 3, 2008

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

With libxml2dom you'd do the following:

1. Parse the file using libxml2dom.parse with html set to a true
value.
2. Use the xpath method on the document to select the desired
element.
3. Use the toString method on the element to get the text of the
element (including start and end tags), or the textContent
property
to get the text between the tags.

See the Package Index page for more details:

http://www.python.org/pypi/libxml2dom

Paul

7stud · Apr 4, 2008

BeautifulSoup does what I need it to. Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM. Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.

Stefan Behnel · Apr 7, 2008

Benjamin said:
I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

Benjamin · Apr 26, 2008

innerHTML has never been part of the DOM. It is however a defacto
browser standard. That's probably why you aren't having any luck
using a python module that implements the DOM.

That makes sense.

Benjamin · Apr 26, 2008

Benjamin said:
Benjamin said:

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Click to expand...

import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.

Stefan Behnel · Apr 26, 2008

Benjamin said:
Benjamin said:

I'm trying to parse an HTML file. I want to retrieve all of the text
inside a certain tag that I find with XPath. The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.

Click to expand...

import lxml.html as h
tree = h.parse("somefile.html")
text = tree.xpath("string( some/element[@condition] )")

http://codespeak.net/lxml

Stefan

Click to expand...

I actually had trouble getting this to work. I guess only new version
of lxml have the html module, and I couldn't get it installed. lxml
does look pretty cool, though.

Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

import lxml.etree as et
parser = etree.HTMLParser()
tree = h.parse("somefile.html", parser)
text = tree.xpath("string( some/element[@condition] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan

Python client/server that reads HTML body from server	1	Apr 12, 2023
Web Page Parsing/Downloading	1	Nov 22, 2013
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
I need help making an html website	2	Aug 2, 2023
HTMLParser not parsing whole html file	4	Oct 24, 2010
open html page for parsing	1	Oct 4, 2011
CORS/Express: Getting data from server from domain html	2	Sep 3, 2022
How to save JSON Data to a file using fetch() api?	2	Apr 28, 2022

Parsing HTML?

Benjamin

Daniel Fetchinson

benash

Paul Boddie

7stud

Stefan Behnel

Benjamin

Benjamin

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads