Regular Expressions

Geoff Hill · Feb 10, 2007

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

Paul Rubin · Feb 10, 2007

Geoff Hill said:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

Read the documentation?

John Machin · Feb 11, 2007

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I suggest that you work through the re HOWTO
http://www.amk.ca/python/howto/regex/
and by work through, I don't mean "read". I mean as each new concept
is introduced:
1. try the given example(s) yourself at the interactive prompt
2. try variations on the examples
3. read the relevant part of the Library Reference Manual

Also I'd suggest reading threads in this newsgroup where people are
asking for help with re.

HTH,
John

Paul Rubin · Feb 11, 2007

John Machin said:
I suggest that you work through the re HOWTO
http://www.amk.ca/python/howto/regex/

Also remember Zawinski's law:
http://fishbowl.pastiche.org/2003/08/18/beware_regular_expressions

gregarican · Feb 11, 2007

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

Shawn Milo · Feb 11, 2007

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

Absolutely: Get "Mastering Regular Expressions" by Jeffrey Friedl. Not
only is it easy to read, but you'll get a lot of mileage out of
regexes in general. Grep, Perl one-liners, Python, and other tools use
regexes, and you'll find that they are really clever little creatures
once you befriend a few of them.

Shawn

Geoff Hill · Feb 11, 2007

Thanks. O'Reilly is the way I learned Python, and I'm suprised that I didn't
think of a book by them earlier.

Steve Holden · Feb 11, 2007

Geoff said:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

In fact that's a pretty smart stance. A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

Steven D'Aprano · Feb 11, 2007

In fact that's a pretty smart stance.

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

I believe that is correctly attributed to Jamie Zawinski.

John Machin · Feb 11, 2007

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

Thanks for the tip-off, Steve and Steven. Looks like I'll have to
start hiding my 12C (datecode 2214) with its "GTO" button under the
loose floor-board whenever I hear a knock at the door ;-) Looks like
Agner Fog's gone a million, and there'll be a special place in hell
for people who combine regexes with bit manipulation, like Navarro &
Raffinot. And we won't even mention Heikki Hy,*7g^54d3j+__=

James Stroud · Feb 11, 2007

gregarican said:
I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

There is no real mention of python in this book, but the first edition
is probably the best programming book I've ever read (excepting, perhaps
Text Processing in Python by Mertz.) Well, come to think of it, check
the latter book out. It has a great chapter on Python Regex. And its
free to download.

James

deviantbunnylord · Feb 11, 2007

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

I believe that is correctly attributed to Jamie Zawinski.

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

skip · Feb 11, 2007

jwz> Some people, when confronted with a problem, think 'I know, I'll
jwz> use regular expressions.' Now they have two problems.

dbl> So as a newbie, I have to ask.... So I guess I don't really
dbl> understand why they are a "bad idea" to use.

Regular expressions are fine in their place, however, you can get carried
away. For example:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Skip

Gabriel Genellina · Feb 11, 2007

En Sun, 11 Feb 2007 13:35:26 -0300, (e-mail address removed)

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

For very simple things, it's easier/faster to use string methods like find
or split. By example, splitting "2007-02-11" into y,m,d parts:
y,m,d = date.split("-")
is a lot faster than matching "(\d+)-(\d+)-(\d+)"
On the other hand, complex tasks like parsing an HTML/XML document,
*can't* be done with a regexp alone; but people insist anyway, and then
complain when it doesn't work as expected, and ask how to "fix" the
regexp...
Good usage of regexps maybe goes in the middle.

John Machin · Feb 11, 2007

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use.

Regexes are not "bad". However people tend to overuse them, whether
they are overkill (like Gabriel's date-splitting example) or underkill
-- see your next sentence

I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Text: Paul Maguire's pyparsing module (Google is your friend); read
David Mertz's book on text processing with Python (free download, I
believe); modules for specific data formats e.g. csv

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

HTH,
John

Steve Holden · Feb 11, 2007

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Re's aren't inherently bad. Just avoid using them as a hammer to the
extent that all your problems look like nails.

They wouldn't exist if there weren't problems it was appropriate to use
them on. Just try to use simpler techniques first.

For example, don't use re's to find out if a string starts with a
specific substring when you could instead use the .startswith() string
method.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007

deviantbunnylord · Feb 12, 2007

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

John Machin · Feb 12, 2007

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

That's right. Those modules use regexes. You don't. You call functions
& classes in the modules.

Someone has written those modules and tested them and documented them
and they've had a fair old thrashing by quite a few people over the
years -- it may be the only difference in your way of thinking but
it's quite a large difference from you opening up the re docs and
getting stuck in single-handedly

Neil Cerutti · Feb 12, 2007

What's the way to go about learning Python's regular
expressions? I feel like such an idiot - being so strong in a
programming language but knowing nothing about RE.

A great way to learn regular expressions is to implement them.

skip · Feb 12, 2007

dbl> The source of HTMLParser and xmllib use regular expressions for
dbl> parsing out the data. htmllib calls sgmllib at the begining of it's
dbl> code--sgmllib starts off with a bunch of regular expressions used
dbl> to parse data.

I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").

If I have a simple expression:

(7 + 3.14) * CONST

that's just a stream of bytes, "(", "&", " ", "+", ... Lexical analysis
chunks that stream of bytes into the "words" of the language:

LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")

Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes). That representation is
application-dependent.

Regular expressions are ideal for lexical analysis. They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.

Here are a couple much better expositions on the topics:

http://en.wikipedia.org/wiki/Lexical_analysis
http://en.wikipedia.org/wiki/Parsing

Skip

Python's re module and genealogy problem	10	Jun 11, 2014
Hello	0	Dec 10, 2022
Utility to locate errors in regular expressions	3	May 24, 2013
sys.setrecursionlimit() and regular expressions	3	Sep 30, 2010
Need Assistance With A Coding Problem	0	Aug 26, 2023
Mapping My Path in Java Web Development: Crafting a Detailed Roadmap	0	Mar 9, 2024
Python Regular Expressions	4	Jun 22, 2011
The power of regular expressions without regular expressions.	0	Jul 17, 2013

Regular Expressions

Geoff Hill

Paul Rubin

John Machin

Paul Rubin

gregarican

Shawn Milo

Geoff Hill

Steve Holden

Steven D'Aprano

John Machin

James Stroud

deviantbunnylord

skip

Gabriel Genellina

John Machin

Steve Holden

deviantbunnylord

John Machin

Neil Cerutti

skip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads