Does Python mess with CRLFs?

Gilles Ganault · Nov 12, 2008

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

#GOOD
friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
| re.DOTALL)

m = friends.search(content)
if m:
print "Found"
else:
print "List not found"
==========

Thank you for any tip.

Gilles Ganault · Nov 12, 2008

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.

John Machin · Nov 12, 2008

Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

I wonder if Python rewrites CRLFs when reading a text file with
open/read?

Don't wonder; do some very elementary debugging and find out for
yourself.

Here's the code:
==========
f = open("content.html", "r")
content = f.read()
f.close()

Consider inserting
print repr(content)
here.

Irmen de Jong · Nov 12, 2008

Gilles said:
For those seeing the same thing, the answer is yes: On Windows, the
code above turns CRLF into LF. I tried "rb" instead of "r", with no
difference.

Sorry but that is not what's happening. Your problem is not in reading the
file, it's in the regular expression you're using.

Using open with the "rb" flag leaves the file content intact and does not munge newlines
in any way. A read() will return the exact bytes that are in the file.

--irmen

Irmen de Jong · Nov 12, 2008

Gilles said:
Hello

I'm stuck at understanding why Python can't extract some bit from an
HTML file using regexes, although I can find it just fine with
UltraEdit.

#BAD
friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
| re.MULTILINE | re.DOTALL)

If you keep running into trouble and you're sure it's related to the newlines,
maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
re.compile('</td></tr></table>\\s*</div>\\s*', .... )

Other than that, hard to say what's not working as expected without knowing
the exact contents of the "content.html" file you're searching in....

--irmen

Python client/server that reads HTML body from server	1	Apr 12, 2023
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Python mange with liste	7	Dec 28, 2013
problem with re.MULTILINE	2	Oct 18, 2009
When I send email as HTML, why do erroneous whitespaces getintroduced to the HTML source and a few <	2	Nov 8, 2013
Crummy BS Script	8	Oct 1, 2010
[2.5] Regex doesn't support MULTILINE?	9	Jul 22, 2007
Output confusion	2	Mar 9, 2023

Does Python mess with CRLFs?

Gilles Ganault

Gilles Ganault

John Machin

Irmen de Jong

Irmen de Jong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads