Regular Expressions

G

Geoff Hill

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.
 
P

Paul Rubin

Geoff Hill said:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

Read the documentation?
 
J

John Machin

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I suggest that you work through the re HOWTO
http://www.amk.ca/python/howto/regex/
and by work through, I don't mean "read". I mean as each new concept
is introduced:
1. try the given example(s) yourself at the interactive prompt
2. try variations on the examples
3. read the relevant part of the Library Reference Manual

Also I'd suggest reading threads in this newsgroup where people are
asking for help with re.

HTH,
John
 
G

gregarican

What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.
 
S

Shawn Milo

I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

Absolutely: Get "Mastering Regular Expressions" by Jeffrey Friedl. Not
only is it easy to read, but you'll get a lot of mileage out of
regexes in general. Grep, Perl one-liners, Python, and other tools use
regexes, and you'll find that they are really clever little creatures
once you befriend a few of them.

Shawn
 
G

Geoff Hill

Thanks. O'Reilly is the way I learned Python, and I'm suprised that I didn't
think of a book by them earlier.
 
S

Steve Holden

Geoff said:
What's the way to go about learning Python's regular expressions? I feel
like such an idiot - being so strong in a programming language but knowing
nothing about RE.
In fact that's a pretty smart stance. A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
 
S

Steven D'Aprano

In fact that's a pretty smart stance.

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.


A quote attributed variously to
Tim Peters and Jamie Zawinski says "Some people, when confronted with a
problem, think 'I know, I'll use regular expressions.' Now they have two
problems."

I believe that is correctly attributed to Jamie Zawinski.
 
J

John Machin

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.

Thanks for the tip-off, Steve and Steven. Looks like I'll have to
start hiding my 12C (datecode 2214) with its "GTO" button under the
loose floor-board whenever I hear a knock at the door ;-) Looks like
Agner Fog's gone a million, and there'll be a special place in hell
for people who combine regexes with bit manipulation, like Navarro &
Raffinot. And we won't even mention Heikki Hy,*7g^54d3j+__=
 
J

James Stroud

gregarican said:
I highly recommend reading the book "Mastering Regular Expressions,"
which I believe is published by O'Reilly. It's a great reference and
helps peel the onion in terms of working through RE. They are a
language unto themselves. A fun brain exercise.

There is no real mention of python in this book, but the first edition
is probably the best programming book I've ever read (excepting, perhaps
Text Processing in Python by Mertz.) Well, come to think of it, check
the latter book out. It has a great chapter on Python Regex. And its
free to download.

James
 
D

deviantbunnylord

That's a little harsh -- regexes have their place, together with pointer
arithmetic, bit manipulations, reverse polish notation and goto. The
problem is when people use them inappropriately e.g. using a regex when a
simple string.find will do.


I believe that is correctly attributed to Jamie Zawinski.

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?
 
S

skip

jwz> Some people, when confronted with a problem, think 'I know, I'll
jwz> use regular expressions.' Now they have two problems.

dbl> So as a newbie, I have to ask.... So I guess I don't really
dbl> understand why they are a "bad idea" to use.

Regular expressions are fine in their place, however, you can get carried
away. For example:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Skip
 
G

Gabriel Genellina

En Sun, 11 Feb 2007 13:35:26 -0300, (e-mail address removed)
So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

For very simple things, it's easier/faster to use string methods like find
or split. By example, splitting "2007-02-11" into y,m,d parts:
y,m,d = date.split("-")
is a lot faster than matching "(\d+)-(\d+)-(\d+)"
On the other hand, complex tasks like parsing an HTML/XML document,
*can't* be done with a regexp alone; but people insist anyway, and then
complain when it doesn't work as expected, and ask how to "fix" the
regexp...
Good usage of regexps maybe goes in the middle.
 
J

John Machin

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use.

Regexes are not "bad". However people tend to overuse them, whether
they are overkill (like Gabriel's date-splitting example) or underkill
-- see your next sentence :)
I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?

Text: Paul Maguire's pyparsing module (Google is your friend); read
David Mertz's book on text processing with Python (free download, I
believe); modules for specific data formats e.g. csv

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

HTH,
John
 
S

Steve Holden

So as a newbie, I have to ask. I've played with the re module now for
a while, I think regular expressions are super fun and useful. As far
as them being a problem I found they can be tricky and sometimes the
regex's I've devised do unexpected things...(which I can think of two
instances where that unexpected thing was something that I had hoped
to get into further down the line, yay for me!). So I guess I don't
really understand why they are a "bad idea" to use. I don't know of
any other way yet to parse specific data out of a text, html, or xml
file without resorting to regular expressions.
What other ways are there?
Re's aren't inherently bad. Just avoid using them as a hammer to the
extent that all your problems look like nails.

They wouldn't exist if there weren't problems it was appropriate to use
them on. Just try to use simpler techniques first.

For example, don't use re's to find out if a string starts with a
specific substring when you could instead use the .startswith() string
method.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Blog of Note: http://holdenweb.blogspot.com
See you at PyCon? http://us.pycon.org/TX2007
 
D

deviantbunnylord

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.


The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.
 
J

John Machin

HTML: htmllib and HTMLParser (both in the Python library),
BeautifulSoup (again GIYF)

XML: xml.* in the Python library. ElementTree (recommended) is
included in Python 2.5; use xml.etree.cElementTree.

The source of HTMLParser and xmllib use regular expressions for
parsing out the data. htmllib calls sgmllib at the begining of it's
code--sgmllib starts off with a bunch of regular expressions used to
parse data. So the only real difference there I see is that someone
saved me the work of writing them ;0). I haven't looked at the source
for Beautiful Soup, though I have the sneaking suspicion that most
processing of html/xml is all based on regex's.

That's right. Those modules use regexes. You don't. You call functions
& classes in the modules.

Someone has written those modules and tested them and documented them
and they've had a fair old thrashing by quite a few people over the
years -- it may be the only difference in your way of thinking but
it's quite a large difference from you opening up the re docs and
getting stuck in single-handedly :)
 
N

Neil Cerutti

What's the way to go about learning Python's regular
expressions? I feel like such an idiot - being so strong in a
programming language but knowing nothing about RE.

A great way to learn regular expressions is to implement them.
 
S

skip

dbl> The source of HTMLParser and xmllib use regular expressions for
dbl> parsing out the data. htmllib calls sgmllib at the begining of it's
dbl> code--sgmllib starts off with a bunch of regular expressions used
dbl> to parse data.

I am almost certain those modules use regular expressions for lexical
analysis (splitting the input byte stream into "words"), not for parsing
(extracting the structure of the "sentences").

If I have a simple expression:

(7 + 3.14) * CONST

that's just a stream of bytes, "(", "&", " ", "+", ... Lexical analysis
chunks that stream of bytes into the "words" of the language:

LPAREN (NUMBER, 7) PLUS (NUMBER, 3.14) RPAREN TIMES (IDENT, "CONST")

Parsing then constructs a higher level representation of that stream of
"words" (more commonly called tokens or lexemes). That representation is
application-dependent.

Regular expressions are ideal for lexical analysis. They are not-so-hot for
parsing unless the grammar of the language being parsed is *extremely*
simple.

Here are a couple much better expositions on the topics:

http://en.wikipedia.org/wiki/Lexical_analysis
http://en.wikipedia.org/wiki/Parsing

Skip
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,161
Latest member
GertrudeMa
Top