Regular Expressions

G

Gabriel Genellina

En Mon, 12 Feb 2007 07:20:11 -0300, (e-mail address removed)
The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

You can build a parser for SGML/HTML/XML documents using regexps AND
python code. You can't do that with regexps only.
By example, suppose you work hard to build a correct regexp for matching
an opening <a> tag. You extract this from the document: "<a href='xxx'>".
Is it actually an <a> tag? Maybe. But the text could be inside a comment.
Or in a CDATA section. Or inside javascript code. Or...
A regexp is good for recognizing tokens, and this can be used to build a
parser. But regular expressions alone can't parse these kind of documents,
just because their grammar is not regular.
(Python re engine is stronger that "mathematical" regular expressions, in
the sense that it can handle things like backreferences (?P=...) and
lookahead (?=...) but anyway it can't handle HTML)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top