Does Python mess with CRLFs?

G

Gilles Ganault

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.
 
G

Gilles Ganault

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.
 
J

John Machin

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Don't wonder; do some very elementary debugging and find out for
yourself.
Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

Consider inserting
print repr(content)
here.
 
I

Irmen de Jong

Gilles said:
For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.

Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen
 
I

Irmen de Jong

Gilles said:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,265
Latest member
TodLarocca

Latest Threads

Top