HTML parsing confusion

A

Alnilam

Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on the www.diveintopython.org page, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.

gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:


gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?
from xml.dom.ext import PrettyPrint
PrettyPrint(paragraphs[5])

shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do. Formatter seems to do what I want,
but I can't figure out how to link the "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.

Thanks in advance.

- A.
 
J

John Machin

Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on thewww.diveintopython.orgpage, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.


gets me to a string variable consisting of the un-parsed contents of
the page.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:

Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
-1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
200-modules PyXML package installed. And you don't want the 75Kb
BeautifulSoup?
 
P

Paul Boddie

Sorry for the noob question, but I've gone through the documentation
on python.org, tried some of the diveintopython and boddie's examples,
and looked through some of the numerous posts in this group on the
subject and I'm still rather confused. I know that there are some
great tools out there for doing this (BeautifulSoup, lxml, etc.) but I
am trying to accomplish a simple task with a minimal (as in nil)
amount of adding in modules that aren't "stock" 2.5, and writing a
huge class of my own (or copying one from diveintopython) seems
overkill for what I want to do.

It's unfortunate that you don't want to install extra modules, but I'd
probably use libxml2dom [1] for what you're about to describe...
Here's what I want to accomplish... I want to open a page, identify a
specific point in the page, and turn the information there into
plaintext. For example, on thewww.diveintopython.orgpage, I want to
turn the paragraph that starts "Translations are freely
permitted" (and ends ..."let me know"), into a string variable.

Opening the file seems pretty straightforward.


gets me to a string variable consisting of the un-parsed contents of
the page.

Yes, there may be shortcuts that let some parsers read directly from
the server, but it's always good to have the page text around, anyway.
Now things get confusing, though, since there appear to be several
approaches.
One that I read somewhere was:


gets me all of the paragraph children, and the one I specifically want
can then be referenced with: paragraphs[5] This method seems to be
pretty straightforward, but what do I do with it to get it into a
string cleanly?

In less sophisticated DOM implementations, what you'd do is to loop
over the "descendant" nodes of the paragraph which are text nodes and
concatenate them.
from xml.dom.ext import PrettyPrint
PrettyPrint(paragraphs[5])

shows me the text, but still in html, and I can't seem to get it to
turn into a string variable, and I think the PrettyPrint function is
unnecessary for what I want to do.

Yes, PrettyPrint is for prettyprinting XML. You just want to visit and
collect the text nodes.
Formatter seems to do what I want,
but I can't figure out how to link the "Element Node" at
paragraphs[5] with the formatter functions to produce the string I
want as output. I tried some of the htmllib.HTMLParser(formatter
stuff) examples, but while I can supposedly get that to work with
formatter a little easier, I can't figure out how to get HTMLParser to
drill down specifically to the 6th paragraph's contents.

Given that you've found the paragraph above, you just need to write a
recursive function which visits child nodes, and if it finds a text
node then it collects the value of the node in a list; otherwise, for
elements, it visits the child nodes of that element; and so on. The
recursive approach is presumably what the formatter uses, but I can't
say that I've really looked at it.

Meanwhile, with libxml2dom, you'd do something like this:

import libxml2dom
d = libxml2dom.parseURI("http://www.diveintopython.org/", html=1)
saved = None

# Find the paragraphs.
for p in d.xpath("//p"):

# Get the text without leading and trailing space.
text = p.textContent.strip()

# Save the appropriate paragraph text.
if text.startswith("Translations are freely permitted") and \
text.endswith("just let me know."):

saved = text
break

The magic part of this code which saves you from needing to write that
recursive function mentioned above is the textContent property on the
paragraph element.

Paul

[1] http://www.python.org/pypi/libxml2dom
 
A

Alnilam

Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
-1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
200-modules PyXML package installed. And you don't want the 75Kb
BeautifulSoup?

I wasn't aware that I had PyXML installed, and can't find a reference
to having it installed in pydocs. And that highlights the problem I
have at the moment with using other modules. I move from computer to
computer regularly, and while all have a recent copy of Python, each
has different (or no) extra modules, and I don't always have the
luxury of downloading extras. That being said, if there's a simple way
of doing it with BeautifulSoup, please show me an example. Maybe I can
figure out a way to carry the extra modules I need around with me.
 
P

Paul McGuire

...I move from computer to
computer regularly, and while all have a recent copy of Python, each
has different (or no) extra modules, and I don't always have the
luxury of downloading extras. That being said, if there's a simple way
of doing it with BeautifulSoup, please show me an example. Maybe I can
figure out a way to carry the extra modules I need around with me.

Pyparsing's footprint is intentionally small - just one pyparsing.py
file that you can drop into a directory next to your own script. And
the code to extract paragraph 5 of the "Dive Into Python" home page?
See annotated code below.

-- Paul

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, anyCloseTag
import urllib
import textwrap

page = urllib.urlopen("http://diveintopython.org/")
source = page.read()
page.close()

# define a simple paragraph matcher
pStart,pEnd = makeHTMLTags("P")
paragraph = pStart.suppress() + SkipTo(pEnd) + pEnd.suppress()

# get all paragraphs from the input string (or use the
# scanString generator function to stop at the correct
# paragraph instead of reading them all)
paragraphs = paragraph.searchString(source)

# create a transformer that will strip HTML tags
tagStripper = anyOpenTag.suppress() | anyCloseTag.suppress()

# get paragraph[5] and strip the HTML tags
p5TextOnly = tagStripper.transformString(paragraphs[5][0])

# remove extra whitespace
p5TextOnly = " ".join(p5TextOnly.split())

# print out a nicely wrapped string - so few people know
# that textwrap is part of the standard Python distribution,
# but it is very handy
print textwrap.fill(p5TextOnly, 60)
 
A

Alnilam

I wasn't aware that I had PyXML installed, and can't find a reference
to having it installed in pydocs. ...

Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.

# get the source (assuming you don't have it locally and have an
internet connection)
# set up some regex to find tags, strip them out, and correct some
formatting oddities
# achieve clean results.
paragraphs = re.findall(p,source)
text_list = re.findall(tag_strip,paragraphs[5])
text = "".join(text_list)
clean_text = re.sub(fix_format," ",text)

This works, and is small and easily reproduced, but seems like it
would break easily and seems a waste of other *ML specific parsers.
 
D

Diez B. Roggisch

Alnilam said:
Ugh. Found it. Sorry about that, but I still don't understand why
there isn't a simple way to do this without using PyXML, BeautifulSoup
or libxml2dom. What's the point in having sgmllib, htmllib,
HTMLParser, and formatter all built in if I have to use use someone
else's modules to write a couple of lines of code that achieve the
simple thing I want. I get the feeling that this would be easier if I
just broke down and wrote a couple of regular expressions, but it
hardly seems a 'pythonic' way of going about things.

This is simply a gross misunderstanding of what BeautifulSoup or lxml
accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
sense is by no means trivial. And just because you can come up with a few
lines of code using rexes that work for your current use-case doesn't mean
that they serve as general html-fixing-routine. Or do you think the rather
long history and 75Kb of code for BS are because it's creator wasn't aware
of rexes?

And it also makes no sense stuffing everything remotely useful into the
standard lib. This would force to align development and release cycles,
resulting in much less features and stability as it can be wished.

And to be honest: I fail to see where your problem is. BeatifulSoup is a
single Python file. So whatever you carry with you from machine to machine,
if it's capable of holding a file of your own code, you can simply put
BeautifulSoup beside it - even if it was a floppy disk.

Diez
 
A

Alnilam

This is simply a gross misunderstanding of what BeautifulSoup or lxml
accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
sense is by no means trivial. And just because you can come up with a few
lines of code using rexes that work for your current use-case doesn't mean
that they serve as general html-fixing-routine. Or do you think the rather
long history and 75Kb of code for BS are because it's creator wasn't aware
of rexes?

And it also makes no sense stuffing everything remotely useful into the
standard lib. This would force to align development and release cycles,
resulting in much less features and stability as it can be wished.

And to be honest: I fail to see where your problem is. BeatifulSoup is a
single Python file. So whatever you carry with you from machine to machine,
if it's capable of holding a file of your own code, you can simply put
BeautifulSoup beside it - even if it was a floppy  disk.

Diez


I am, by no means, trying to trivialize the work that goes into
creating the numerous modules out there. However as a relatively
novice programmer trying to figure out something, the fact that these
modules are pushed on people with such zealous devotion that you take
offense at my desire to not use them gives me a bit of pause. I use
non-included modules for tasks that require them, when the capability
to do something clearly can't be done easily another way (eg.
MySQLdb). I am sure that there will be plenty of times where I will
use BeautifulSoup. In this instance, however, I was trying to solve a
specific problem which I attempted to lay out clearly from the
outset.

I was asking this community if there was a simple way to use only the
tools included with Python to parse a bit of html.

If the answer is no, that's fine. Confusing, but fine. If the answer
is yes, great. I look forward to learning from someone's example. If
you don't have an answer, or a positive contribution, then please
don't interject your angst into this thread.
 
G

Gabriel Genellina

I am, by no means, trying to trivialize the work that goes into
creating the numerous modules out there. However as a relatively
novice programmer trying to figure out something, the fact that these
modules are pushed on people with such zealous devotion that you take
offense at my desire to not use them gives me a bit of pause. I use
non-included modules for tasks that require them, when the capability
to do something clearly can't be done easily another way (eg.
MySQLdb). I am sure that there will be plenty of times where I will
use BeautifulSoup. In this instance, however, I was trying to solve a
specific problem which I attempted to lay out clearly from the
outset.

I was asking this community if there was a simple way to use only the
tools included with Python to parse a bit of html.

If you *know* that your document is valid HTML, you can use the HTMLParser
module in the standard Python library. Or even the parser in the htmllib
module. But a lot of HTML pages out there are invalid, some are grossly
invalid, and those parsers are just unable to handle them. This is why
modules like BeautifulSoup exist: they contain a lot of heuristics and
trial-and-error and personal experience from the developers, in order to
guess more or less what the page author intended to write and make some
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should
never pass silently" and "In the face of ambiguity, refuse the temptation
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the
documents you are handling now, fine. But don't complain when your RE's
match too much or too little or don't match at all because of unclosed
tags, improperly nested tags, nonsense markup, or just a valid combination
that you didn't take into account.
 
E

elijahu

If you *know* that your document is valid HTML, you can use the HTMLParser
module in the standard Python library. Or even the parser in the htmllib
module. But a lot of HTML pages out there are invalid, some are grossly
invalid, and those parsers are just unable to handle them. This is why
modules like BeautifulSoup exist: they contain a lot of heuristics and
trial-and-error and personal experience from the developers, in order to
guess more or less what the page author intended to write and make some
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should
never pass silently" and "In the face of ambiguity, refuse the temptation
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the
documents you are handling now, fine. But don't complain when your RE's
match too much or too little or don't match at all because of unclosed
tags, improperly nested tags, nonsense markup, or just a valid combination
that you didn't take into account.

Thank you. That does make perfect sense, and is a good clear position
on the up and down side of what I'm trying to do, as well as a good
explanation for why BeautifulSoup will probably remain outside the std
lib. I'm sure that I will get plenty of use out of it.

If, however, I am sure that the html code in target documents is
good, and the framework html doesn't change, just the data on page
after page of static html, would it be better to just go with regex or
with one of the std lib items you mentioned. I thought the latter, but
I'm stuck on how to make them generate results similar to the code I
put above as an example. I'm not trying to code this to go against
html in the wild, but to try to strip specific, consistently located
data from the markup and turn it into something more useful.

I may have confused folks by using the www.diveintopython.org page as
an example, but its html seemed to be valid strict tags.
 
A

Alnilam

If you *know* that your document is valid HTML, you can use the HTMLParser  
module in the standard Python library. Or even the parser in the htmllib  
module. But a lot of HTML pages out there are invalid, some are grossly  
invalid, and those parsers are just unable to handle them. This is why  
modules like BeautifulSoup exist: they contain a lot of heuristics and  
trial-and-error and personal experience from the developers, in order to  
guess more or less what the page author intended to write and make some  
sense of that "tag soup".
A guesswork like that is not suitable for the std lib ("Errors should  
never pass silently" and "In the face of ambiguity, refuse the temptation  
to guess.") but makes a perfect 3rd party module.

If you want to use regular expressions, and that works OK for the  
documents you are handling now, fine. But don't complain when your RE's  
match too much or too little or don't match at all because of unclosed  
tags, improperly nested tags, nonsense markup, or just a valid combination  
that you didn't take into account.

Thanks, Gabriel. That does make sense, both what the benefits of
BeautifulSoup are and why it probably won't become std lib anytime
soon.

The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html. They just have
specific paragraphs of useful information, located in the same place
in each file, that I want to 'harvest' and put to better use. I used
diveintopython.org as an example only (and in part because it had good
clean html formatting). I am pretty sure that I could craft some
regular expressions to do the work -- which of course would not be the
case if I was screen scraping web pages in the 'wild' -- but I was
trying to find a way to do that using one of those std libs you
mentioned.

I'm not sure if HTMLParser or htmllib would work better to achieve the
same effect as the regex example I gave above, or how to get them to
do that. I thought I'd come close, but as someone pointed out early
on, I'd accidently tapped into PyXML which is installed where I was
testing code, but not necessarily where I need it. It may turn out
that the regex way works faster, but falling back on methods I'm
comfortable with doesn't help expand my Python knowledge.

So if anyone can tell me how to get HTMLParser or htmllib to grab a
specific paragraph, and then provide the text in that paragraph in a
clean, markup-free format, I'd appreciate it.
 
M

M.-A. Lemburg

There are lots of ways doing HTML parsing in Python. A common
one is e.g. using mxTidy to convert the HTML into valid XHTML
and then use ElementTree to parse the data.

http://www.egenix.com/files/python/mxTidy.html
http://docs.python.org/lib/module-xml.etree.ElementTree.html

For simple tasks you can also use the HTMLParser that's part
of the Python std lib.

http://docs.python.org/lib/module-HTMLParser.html

Which tools to use is really dependent on what you are
trying to solve.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Jan 23 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 
C

cokofreedom

The pages I'm trying to write this code to run against aren't in the
wild, though. They are static html files on my company's lan, are very
consistent in format, and are (I believe) valid html.

Obvious way to check this is to go to http://validator.w3.org/ and see
what it tells you about your html...
 
A

Alnilam

There are lots of ways doing HTML parsing in Python. A common
one is e.g. using mxTidy to convert the HTML into valid XHTML
and then use ElementTree to parse the data.

http://www.egenix.com/files/python/...hon.org/lib/module-xml.etree.ElementTree.html

For simple tasks you can also use the HTMLParser that's part
of the Python std lib.

http://docs.python.org/lib/module-HTMLParser.html

Which tools to use is really dependent on what you are
trying to solve.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2008)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/

________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

Thanks. So far that makes 3 votes for BeautifulSoup, and one vote each
for libxml2dom, pyparsing, and mxTidy. I'm sure those would all be
great solutions, if I was looking to solve my coding question with
external modules.

Several folks have mentioned now that they think that if I have files
that are valid XHTML, that I could use htmllib, HTMLParser, or
ElementTree (all of which are part of the standard libraries in v
2.5).

Skipping past html validation, and html to xhtml 'cleaning', and
instead starting with the assumption that I have files that are valid
XHTML, can anyone give me a good example of how I would use _ htmllib,
HTMLParser, or ElementTree _ to parse out the text of one specific
childNode, similar to the examples that I provided above using regex?
 
J

Jerry Hill

Skipping past html validation, and html to xhtml 'cleaning', and
instead starting with the assumption that I have files that are valid
XHTML, can anyone give me a good example of how I would use _ htmllib,
HTMLParser, or ElementTree _ to parse out the text of one specific
childNode, similar to the examples that I provided above using regex?

Have you looked at any of the tutorials or sample code for these
libraries? If you had a specific question, you will probably get more
specific help. I started writing up some sample code, but realized I
was mostly reprising the long tutorial on SGMLLib here:
http://www.boddie.org.uk/python/HTML.html
 
G

Gabriel Genellina

Skipping past html validation, and html to xhtml 'cleaning', and
instead starting with the assumption that I have files that are valid
XHTML, can anyone give me a good example of how I would use _ htmllib,
HTMLParser, or ElementTree _ to parse out the text of one specific
childNode, similar to the examples that I provided above using regex?

The diveintopython page is not valid XHTML (but it's valid HTML). Assuming
it's property converted:

py> from cStringIO import StringIO
py> import xml.etree.ElementTree as ET
py> tree = ET.parse(StringIO(page))
py> elem = tree.findall('//p')[4]
py>
py> # from the online ElementTree docs
py> http://www.effbot.org/zone/element-bits-and-pieces.htm
.... def gettext(elem):
.... text = elem.text or ""
.... for e in elem:
.... text += gettext(e)
.... if e.tail:
.... text += e.tail
.... return text
....
py> print gettext(elem)
The complete text is available online. You can read the revision history
to see
what's new. Updated 20 May 2004
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top