Does Python mess with CRLFs?

Discussion in 'Python' started by Gilles Ganault, Nov 12, 2008.

  1. Hello

    I'm stuck at understanding why Python can't extract some bit from an
    HTML file using regexes, although I can find it just fine with
    UltraEdit.

    I wonder if Python rewrites CRLFs when reading a text file with
    open/read?

    Here's the code:
    ==========
    f = open("content.html", "r")
    content = f.read()
    f.close()

    #BAD
    friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
    | re.MULTILINE | re.DOTALL)

    #GOOD
    friends = re.compile('</td></tr></table>',re.IGNORECASE | re.MULTILINE
    | re.DOTALL)

    m = friends.search(content)
    if m:
    print "Found"
    else:
    print "List not found"
    ==========

    Thank you for any tip.
     
    Gilles Ganault, Nov 12, 2008
    #1
    1. Advertising

  2. On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <>
    wrote:
    >I wonder if Python rewrites CRLFs when reading a text file with
    >open/read?


    For those seeing the same thing, the answer is yes: On Windows, the
    code above turns CRLF into LF. I tried "rb" instead of "r", with no
    difference.
     
    Gilles Ganault, Nov 12, 2008
    #2
    1. Advertising

  3. Gilles Ganault

    John Machin Guest

    On Nov 12, 10:04 pm, Gilles Ganault <> wrote:
    > Hello
    >
    > I'm stuck at understanding why Python can't extract some bit from an
    > HTML file using regexes, although I can find it just fine with
    > UltraEdit.
    >
    > I wonder if Python rewrites CRLFs when reading a text file with
    > open/read?


    Don't wonder; do some very elementary debugging and find out for
    yourself.

    > Here's the code:
    > ==========
    > f = open("content.html", "r")
    > content = f.read()
    > f.close()


    Consider inserting
    print repr(content)
    here.
     
    John Machin, Nov 12, 2008
    #3
  4. Gilles Ganault wrote:
    > On Wed, 12 Nov 2008 12:04:07 +0100, Gilles Ganault <>
    > wrote:
    >> I wonder if Python rewrites CRLFs when reading a text file with
    >> open/read?

    >
    > For those seeing the same thing, the answer is yes: On Windows, the
    > code above turns CRLF into LF. I tried "rb" instead of "r", with no
    > difference.


    Sorry but that is not what's happening. Your problem is not in reading the
    file, it's in the regular expression you're using.

    Using open with the "rb" flag leaves the file content intact and does not munge newlines
    in any way. A read() will return the exact bytes that are in the file.

    --irmen
     
    Irmen de Jong, Nov 12, 2008
    #4
  5. Gilles Ganault wrote:
    > Hello
    >
    > I'm stuck at understanding why Python can't extract some bit from an
    > HTML file using regexes, although I can find it just fine with
    > UltraEdit.
    >
    > #BAD
    > friends = re.compile('</td></tr></table>\r\n</div>\r\n',re.IGNORECASE
    > | re.MULTILINE | re.DOTALL)


    If you keep running into trouble and you're sure it's related to the newlines,
    maybe it helps using the 'whitespace' symbol instead of \r\n in your expression:
    re.compile('</td></tr></table>\\s*</div>\\s*', .... )

    Other than that, hard to say what's not working as expected without knowing
    the exact contents of the "content.html" file you're searching in....

    --irmen
     
    Irmen de Jong, Nov 12, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?cmtibmFpcg==?=

    Authentication mess

    =?Utf-8?B?cmtibmFpcg==?=, Jan 20, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    313
    =?Utf-8?B?cmtibmFpcg==?=
    Jan 21, 2005
  2. Mickey Segal

    Does FrontPage mess up JSP code?

    Mickey Segal, Aug 2, 2004, in forum: Java
    Replies:
    8
    Views:
    804
    John C. Bollinger
    Aug 3, 2004
  3. Eric von Horst
    Replies:
    3
    Views:
    829
    Eric von Horst
    Feb 28, 2008
  4. Roy Smith
    Replies:
    1
    Views:
    816
    Gabriel Genellina
    Dec 8, 2009
  5. DFS
    Replies:
    92
    Views:
    263
    BartC
    Jun 17, 2014
Loading...

Share This Page