Regular Expressions

Gabriel Genellina · Feb 12, 2007

En Mon, 12 Feb 2007 07:20:11 -0300, (e-mail address removed)

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

You can build a parser for SGML/HTML/XML documents using regexps AND
python code. You can't do that with regexps only.
By example, suppose you work hard to build a correct regexp for matching
an opening <a> tag. You extract this from the document: "<a href='xxx'>".
Is it actually an <a> tag? Maybe. But the text could be inside a comment.
Or in a CDATA section. Or inside javascript code. Or...
A regexp is good for recognizing tokens, and this can be used to build a
parser. But regular expressions alone can't parse these kind of documents,
just because their grammar is not regular.
(Python re engine is stronger that "mathematical" regular expressions, in
the sense that it can handle things like backreferences (?P=...) and
lookahead (?=...) but anyway it can't handle HTML)

Python's re module and genealogy problem	10	Jun 11, 2014
Hello	0	Dec 10, 2022
Utility to locate errors in regular expressions	3	May 24, 2013
sys.setrecursionlimit() and regular expressions	3	Sep 30, 2010
Need Assistance With A Coding Problem	0	Aug 26, 2023
Mapping My Path in Java Web Development: Crafting a Detailed Roadmap	0	Mar 9, 2024
Python Regular Expressions	4	Jun 22, 2011
The power of regular expressions without regular expressions.	0	Jul 17, 2013

Regular Expressions

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads