[2.5] Regex doesn't support MULTILINE?

Gilles Ganault · Jul 22, 2007

Hello

I'm trying to extract information from a web page using the Re module,
but it doesn't seem to support MULTILINE:

=============
import re

#NO CRLF : works
response = "Blablabla"
#CRLF : doesn't work
response = "Blablabla\r\n"

pattern = "Bla.+?"

p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)
m = p.search(response)

if m:
print m.group(1)
else:
print "Not found"
=============

Do I need to add something else to have Re work as intended?

Thank you.

Carsten Haese · Jul 22, 2007

Hello

I'm trying to extract information from a web page using the Re module,

That's your problem right there. RE is not the right tool for that job.
Use an actual HTML parser such as BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/) and your life will be
much easier.

HTH,

Paul Rubin · Jul 22, 2007

Carsten Haese said:
Use an actual HTML parser such as BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/) and your life will be
much easier.

BeautifulSoup is a lot simpler to use than RE's but a heck of a lot
slower. I ended up having to use RE's last time I had to scrape a lot
of pages.

Carsten Haese · Jul 22, 2007

BeautifulSoup is a lot simpler to use than RE's but a heck of a lot
slower. I ended up having to use RE's last time I had to scrape a lot
of pages.

True, but the OP said "extract information from a web page", not "from a
lot of pages." Until BeautifulSoup is actually too slow for that job,
going straight to RE is premature optimization.

Tony Meyer · Jul 22, 2007

I'm trying to extract information from a web page using the Re module,

but it doesn't seem to support MULTILINE: [...]
Do I need to add something else to have Re work as intended?

I believe you are looking for DOTALL, not MULTILINE. From the
documentation:

"""
M
MULTILINE
When specified, the pattern character "^" matches at the beginning of
the string and at the beginning of each line (immediately following
each newline); and the pattern character "$" matches at the end of the
string and at the end of each line (immediately preceding each
newline). By default, "^" matches only at the beginning of the string,
and "$" only at the end of the string and immediately before the
newline (if any) at the end of the string.

S
DOTALL
Make the "." special character match any character at all, including a
newline; without this flag, "." will match anything except a newline.
"""

If you do a lot of working with regular expressions, then I highly
recommend Kodos (http://kodos.sourceforge.net) as a tool for
interactively figuring out issues.

Cheers,
Tony Meyer

Gilles Ganault · Jul 22, 2007

That's your problem right there. RE is not the right tool for that job.
Use an actual HTML parser such as BeautifulSoup

Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "(\d+:\d+).*?"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
f = open(page, "r")
response = f.read()
f.close()

start = time.strftime("%H:%M:%S", time.localtime(time.time()))
print "before findall @ " + start
packed = p.findall(response)
if packed:
for item in packed:
print item
===========================

Thank you.

Jay Loden · Jul 22, 2007

Gilles said:
Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "(\d+:\d+).*?"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
f = open(page, "r")
response = f.read()
f.close()

start = time.strftime("%H:%M:%S", time.localtime(time.time()))
print "before findall @ " + start
packed = p.findall(response)
if packed:
for item in packed:
print item
===========================

I don't know if it'll result in a performance difference, but since you're just saving the result of re.findall() to a variable in order to iterate over it, you might as well just use re.finditer() instead:

for item in p.finditer(response):
print item

At least then it can start printing as soon as it hits a match instead of needing to find all the matches first.

-Jay

irstas · Jul 22, 2007

Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

pattern = "(\d+:\d+).*?"

That .*? can really slow it down if the following pattern
can't be found. It may end up looking until the end of the file for
proper continuation of the pattern and fail, and then start again.
Without DOTALL it would only look until the end of the line so
performance would stay bearable. Your 1.5MB file might have for
example
'13:34'*10000 as its contents. Because
the doesn't match , it would end up looking till
the end of the file for and not finding it. And then move
on to the next occurence of '<span class=...' and see if it has better
luck finding a pattern there. That's an example of a situation where
the pattern matcher would become very slow. I'd have to see the 1.5MB
file's contents to better guess what goes wrong.

If the span's contents don't have nested elements (like ),
you could maybe use negated char range:

"(\d+:\d+)[^<]*"

This pattern should be very fast for all inputs because the [^<]*
can't
match stuff indefinitely until the end of the file - only until the
next HTML element comes around. Or if you don't care about anything
but
those numbers, you should just match this:

"(\d+:\d+)"

Gabriel Genellina · Jul 22, 2007

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

pattern = "(\d+:\d+).*?"

Try to avoid using ".*" and ".+" (even the non greedy forms); in this
case, I think you want the scan to stop when it reaches the ending 
or any other tag, so use: [^<]* instead.

BTW, better to use a raw string to represent the pattern: pattern =
r"...\d+..."

Gilles Ganault · Jul 24, 2007

Try to avoid using ".*" and ".+" (even the non greedy forms); in this
case, I think you want the scan to stop when it reaches the ending 
or any other tag, so use: [^<]* instead.

BTW, better to use a raw string to represent the pattern: pattern =
r"...\d+..."

Thanks everyone for the help. It did improve things significantly

Why is regex so slow?	21	Jun 18, 2013
regex line by line over file	8	Mar 27, 2014
Questions about regex	3	May 29, 2009
Microsoft VBScript runtime error '800a01b6' Object doesn't support	2	Nov 22, 2009
newb: Simple regex problem headache	3	Sep 21, 2007
Question on regex	1	Dec 23, 2006
Regex help...pretty please?	4	Aug 23, 2006
re module substitution confusion	1	Jul 7, 2003

[2.5] Regex doesn't support MULTILINE?

Gilles Ganault

Carsten Haese

Paul Rubin

Carsten Haese

Tony Meyer

Gilles Ganault

Jay Loden

irstas

Gabriel Genellina

Gilles Ganault

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads