[2.5] Regex doesn't support MULTILINE?

G

Gilles Ganault

Hello

I'm trying to extract information from a web page using the Re module,
but it doesn't seem to support MULTILINE:

=============
import re

#NO CRLF : works
response = "<b>Bla</b>blabla<font color=#123>"
#CRLF : doesn't work
response = "<b>Bla</b>blabla\r\n<font color=#123>"

pattern = "<b>Bla</b>.+?<font color=(.+?)>"

p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)
m = p.search(response)

if m:
print m.group(1)
else:
print "Not found"
=============

Do I need to add something else to have Re work as intended?

Thank you.
 
C

Carsten Haese

BeautifulSoup is a lot simpler to use than RE's but a heck of a lot
slower. I ended up having to use RE's last time I had to scrape a lot
of pages.

True, but the OP said "extract information from a web page", not "from a
lot of pages." Until BeautifulSoup is actually too slow for that job,
going straight to RE is premature optimization.
 
T

Tony Meyer

I'm trying to extract information from a web page using the Re module,
but it doesn't seem to support MULTILINE: [...]
Do I need to add something else to have Re work as intended?

I believe you are looking for DOTALL, not MULTILINE. From the
documentation:

"""
M
MULTILINE
When specified, the pattern character "^" matches at the beginning of
the string and at the beginning of each line (immediately following
each newline); and the pattern character "$" matches at the end of the
string and at the end of each line (immediately preceding each
newline). By default, "^" matches only at the beginning of the string,
and "$" only at the end of the string and immediately before the
newline (if any) at the end of the string.

S
DOTALL
Make the "." special character match any character at all, including a
newline; without this flag, "." will match anything except a newline.
"""

If you do a lot of working with regular expressions, then I highly
recommend Kodos (http://kodos.sourceforge.net) as a tool for
interactively figuring out issues.

Cheers,
Tony Meyer
 
G

Gilles Ganault

That's your problem right there. RE is not the right tool for that job.
Use an actual HTML parser such as BeautifulSoup

Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
f = open(page, "r")
response = f.read()
f.close()

start = time.strftime("%H:%M:%S", time.localtime(time.time()))
print "before findall @ " + start
packed = p.findall(response)
if packed:
for item in packed:
print item
===========================

Thank you.
 
J

Jay Loden

Gilles said:
Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
f = open(page, "r")
response = f.read()
f.close()

start = time.strftime("%H:%M:%S", time.localtime(time.time()))
print "before findall @ " + start
packed = p.findall(response)
if packed:
for item in packed:
print item
===========================

I don't know if it'll result in a performance difference, but since you're just saving the result of re.findall() to a variable in order to iterate over it, you might as well just use re.finditer() instead:

for item in p.finditer(response):
print item

At least then it can start printing as soon as it hits a match instead of needing to find all the matches first.

-Jay
 
I

irstas

Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

That .*? can really slow it down if the following pattern
can't be found. It may end up looking until the end of the file for
proper continuation of the pattern and fail, and then start again.
Without DOTALL it would only look until the end of the line so
performance would stay bearable. Your 1.5MB file might have for
example
'<span class=defaut>13:34< /span>'*10000 as its contents. Because
the < /span> doesn't match </span>, it would end up looking till
the end of the file for </span> and not finding it. And then move
on to the next occurence of '<span class=...' and see if it has better
luck finding a pattern there. That's an example of a situation where
the pattern matcher would become very slow. I'd have to see the 1.5MB
file's contents to better guess what goes wrong.

If the span's contents don't have nested elements (like <i></i>),
you could maybe use negated char range:

"<span class=.?default.?>(\d+:\d+)[^<]*</span>"

This pattern should be very fast for all inputs because the [^<]*
can't
match stuff indefinitely until the end of the file - only until the
next HTML element comes around. Or if you don't care about anything
but
those numbers, you should just match this:

"<span class=.?default.?>(\d+:\d+)"
 
G

Gabriel Genellina

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

Try to avoid using ".*" and ".+" (even the non greedy forms); in this
case, I think you want the scan to stop when it reaches the ending </span>
or any other tag, so use: [^<]* instead.

BTW, better to use a raw string to represent the pattern: pattern =
r"...\d+..."
 
G

Gilles Ganault

Try to avoid using ".*" and ".+" (even the non greedy forms); in this
case, I think you want the scan to stop when it reaches the ending </span>
or any other tag, so use: [^<]* instead.

BTW, better to use a raw string to represent the pattern: pattern =
r"...\d+..."

Thanks everyone for the help. It did improve things significantly :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top