file read, binary or text mode

Discussion in 'Python' started by Guyon Morée, Sep 24, 2004.

  1. Guyon Morée

    Guyon Morée Guest

    what is the difference?

    if I open a text file in binary (rb) mode, it doesn't matter... the read()
    output is the same.
     
    Guyon Morée, Sep 24, 2004
    #1
    1. Advertising

  2. Guyon Morée

    Askari Guest

    "Guyon Morée" <gumuz@NO_looze_SPAM.net> wrote in
    news:41540121$0$3891$:

    > what is the difference?
    >
    > if I open a text file in binary (rb) mode, it doesn't matter... the
    > read() output is the same.
    >
    >
    >
    >


    "rb" and "r" on a text file is the same if your text file have ascii
    caractere (8bit) but it's not the same for Unicode caractere (16 bit).
    Bref, if you sure that your file is ONLY text, use "r", else, use always
    "rb". And "r" don't read the control caractere other that "\n" "\t" .. etc
     
    Askari, Sep 24, 2004
    #2
    1. Advertising

  3. Guyon Morée

    Peter Hansen Guest

    Guyon Morée wrote:
    > what is the difference?
    >
    > if I open a text file in binary (rb) mode, it doesn't matter... the read()
    > output is the same.


    If you are on Linux that's the case... or under other
    conditions. Maybe describing your platform and showing
    an example of what you're trying to do would be helpful.

    -Peter
     
    Peter Hansen, Sep 24, 2004
    #3
  4. On 2004-09-24, Guyon Morée <gumuz@NO_looze_SPAM.net> wrote:

    > what is the difference?


    42?

    > if I open a text file in binary (rb) mode, it doesn't matter... the read()
    > output is the same.


    OK...

    --
    Grant Edwards grante Yow! They
    at collapsed... like nuns
    visi.com in the street... they had
    no teenappeal!
     
    Grant Edwards, Sep 24, 2004
    #4
  5. Guyon Morée

    Guyon Morée Guest

    ok, i have huffman encoding code.

    this is actually build for text, but because python can also read a binary
    file as a string, this applies equally well :)

    but, i was just wondering if this gives any problems if I use text-mode read
    for the binary files and vice versa.

    If I undertand correctly now, using binary mode is _always_ save, right?


    "Peter Hansen" <> wrote in message
    news:...
    > Guyon Morée wrote:
    > > what is the difference?
    > >
    > > if I open a text file in binary (rb) mode, it doesn't matter... the

    read()
    > > output is the same.

    >
    > If you are on Linux that's the case... or under other
    > conditions. Maybe describing your platform and showing
    > an example of what you're trying to do would be helpful.
    >
    > -Peter
     
    Guyon Morée, Sep 24, 2004
    #5
  6. Guyon Morée wrote:
    > what is the difference?


    On Unix/Linux, none.

    On Windows, binary mode is just that while text mode translates "\r\n"
    (or "\n\r", I always forget) to "\n" on input and vice-versa on output.

    I don't know about other platforms.

    > if I open a text file in binary (rb) mode, it doesn't matter... the read()
    > output is the same.


    Depends on your platform, and the format of the text file (Unix, Windows
    or other platform style line endings).

    --
    "Codito ergo sum"
    Roel Schroeven
     
    Roel Schroeven, Sep 24, 2004
    #6
  7. On 2004-09-24, Guyon Morée <gumuz@NO_looze_SPAM.net> wrote:

    > ok, i have huffman encoding code.


    You should open the file in binary.

    > this is actually build for text,


    All of the Huffman encoding implimentations I've seen output
    binary, but I'll take your word for it.

    > but because python can also
    > read a binary file as a string, this applies equally well :)


    If the file contains printiable text with cr/nl, nl, or cr line
    endings, then open it in text mode. Otherwise open it in
    binary mode.

    > but, i was just wondering if this gives any problems if I use
    > text-mode read for the binary files and vice versa.


    Yes, it will give you problems.

    > If I undertand correctly now, using binary mode is _always_ save, right?


    No.

    If it's text, open it in text mode. That way the line endings
    are handled properly.

    --
    Grant Edwards grante Yow! I think I'll do BOTH
    at if I can get RESIDUALS!!
    visi.com
     
    Grant Edwards, Sep 24, 2004
    #7
  8. Guyon Morée

    Peter Hansen Guest

    Guyon Morée wrote:
    > ok, i have huffman encoding code.
    >
    > this is actually build for text, but because python can also read a binary
    > file as a string, this applies equally well :)
    >
    > but, i was just wondering if this gives any problems if I use text-mode read
    > for the binary files and vice versa.
    >
    > If I undertand correctly now, using binary mode is _always_ save, right?


    You're not helping a whole lot here. What platform are you using?
    I'll assume from the headers in your message that it's Windows.
    If that's true, then forget about text and binary and ASCII for
    a moment, and just consider this.

    If you open a file on Windows using "r" or "rt" or the default (which
    is "r"), then when you read the file any occurrences of the byte
    sequence 13 followed by 10 (that is, CR LF or \r\n or whatever you want
    to call it) will be replaced as the file is read by just the 10, or the
    LF, or the \n, or whatever you want to call it.

    If you use "rb" instead of just "r" or the default, then this
    translation will not occur and you will retrieve all bytes in
    the file just as they are stored there.

    It's up to you to pick the behaviour you need. Saying it's
    "huffman encoding code" doesn't really help, since that doesn't
    refer to any universal standard representation data. It
    seems likely that it's binary (i.e. the translation provided by
    not using "rb" is undesirable), but nobody here knows where you
    got that file or what it contains.

    And in case that doesn't answer the questions above: (1) yes,
    it can definitely give problems reading text files as binary
    and vice versa, and (2) binary mode applies whenever "b" is
    used on Windows, and not otherwise, so if you save a file without
    using "wb" you will get the same translation as above but in
    the reverse direction (LF or \n gets turned into CR LF or \r\n
    on output).

    -Peter
     
    Peter Hansen, Sep 24, 2004
    #8
  9. Guyon Morée wrote:

    > ok, i have huffman encoding code.
    >
    > this is actually build for text, but because python can also read a binary
    > file as a string, this applies equally well :)
    >
    > but, i was just wondering if this gives any problems if I use text-mode read
    > for the binary files and vice versa.
    >
    > If I undertand correctly now, using binary mode is _always_ save, right?


    It's safe in the sense that everything goes out exactly as it came in.
    For example, gzip uses binary mode even when compressing text files. The
    files may be text, but gzip doesn't care about that. It doesn't care
    about words, sentences and line endings, but it does care about
    representing exactly the bytes that are in the file.

    Editors, diff, wc, ... use text mode.
    cp, tar, gzip, ... use binary mode.

    --
    "Codito ergo sum"
    Roel Schroeven
     
    Roel Schroeven, Sep 24, 2004
    #9
  10. Guyon Morée

    Terry Reedy Guest

    "Askari" <> wrote in message
    news:Xns956E4CDA892D7askariaddressNonVali@207.35.177.135...
    > "Guyon Morée" <gumuz@NO_looze_SPAM.net> wrote in
    > news:41540121$0$3891$:
    >
    > "rb" and "r" on a text file is the same if your text file have ascii
    > caractere (8bit) but it's not the same for Unicode caractere (16 bit).
    > Bref, if you sure that your file is ONLY text, use "r", else, use always
    > "rb". And "r" don't read the control caractere other that "\n" "\t" ..
    > etc


    Newbies, ignore this confusion.

    On Windows, text mode autoconverts \r\n to \n on input and viceverse on
    output. I believe that that is all the difference. Period.

    Terry J. Reedy
     
    Terry Reedy, Sep 24, 2004
    #10
  11. Guyon Morée

    Ralf Schmitt Guest

    "Terry Reedy" <> writes:

    >
    > Newbies, ignore this confusion.
    >
    > On Windows, text mode autoconverts \r\n to \n on input and viceverse on
    > output. I believe that that is all the difference. Period.
    >


    That's not quite the case. As always windows sucks big time:

    $ cat bla.py
    open("b.txt", "w").write("bla\x1a")
    print len(open("b.txt", "rb").read())
    open("b.txt", "a+")
    print len(open("b.txt", "rb").read())

    ralf@CRACK ~
    $ python bla.py
    4
    3


    The last character gets stripped if it's 0x1a when opening a file for
    appending in text mode. I remember this from a posting on the metakit
    mailing list. The poor guy corrupted his databases while he wanted to
    check for write access:
    http://www.equi4.com/pipermail/metakit/2003-October/001497.html

    - Ralf

    --
    brainbot technologies ag
    boppstrasse 64 . 55118 mainz . germany
    fon +49 6131 211639-1 . fax +49 6131 211639-2
    http://brainbot.com/ mailto:
     
    Ralf Schmitt, Sep 24, 2004
    #11
  12. Guyon Morée

    Peter Hansen Guest

    Ralf Schmitt wrote:
    > "Terry Reedy" <> writes:
    >>On Windows, text mode autoconverts \r\n to \n on input and viceverse on
    >>output. I believe that that is all the difference. Period.

    >
    > That's not quite the case. As always windows sucks big time:

    [snip example with ^Z]
    > The last character gets stripped if it's 0x1a when opening a file for
    > appending in text mode.


    Good point. Note for the picky: it doesn't just get stripped... it
    *is* the last character, even if there's data following. Or to
    be blunt, ^Z (byte value 26) is treated as EOF on Windows when not
    using binary mode to read files.

    I suspect Terry and others (including I) overlooked this because
    ^Z is pretty much obsolete, and since few applications *write*
    ^Z as the last character of text files any more, almost nobody
    bothers to remember that text mode is slightly more complicated
    than just the CR LF to LF conversion and back.

    -Peter
     
    Peter Hansen, Sep 24, 2004
    #12
  13. On 2004-09-24, Peter Hansen <> wrote:

    > Good point. Note for the picky: it doesn't just get stripped... it
    > *is* the last character, even if there's data following. Or to
    > be blunt, ^Z (byte value 26) is treated as EOF on Windows when not
    > using binary mode to read files.


    <history>

    That's because CP/M allocated file space in blocks and only
    kept track of the length of the file in blocks. It was common
    practice to mark the end of the "real" data in a text file with
    a ^Z (IIRC, this was done by the application writing to the
    file). Otherwise, you had no way of knowing _where_ in that
    last block the data actually ended.

    The original MS/PC-DOS was basically a CP/M clone.

    I presume CP/M copied that behavior from RSX-11 or RT-11, but
    that's just an educated guess.

    </history>

    --
    Grant Edwards grante Yow! My mind is making
    at ashtrays in Dayton...
    visi.com
     
    Grant Edwards, Sep 24, 2004
    #13
  14. Terry Reedy wrote:

    > "Askari" <> wrote in message
    > news:Xns956E4CDA892D7askariaddressNonVali@207.35.177.135...
    >
    >>"Guyon Morée" <gumuz@NO_looze_SPAM.net> wrote in
    >>news:41540121$0$3891$:
    >>
    >>"rb" and "r" on a text file is the same if your text file have ascii
    >>caractere (8bit) but it's not the same for Unicode caractere (16 bit).
    >>Bref, if you sure that your file is ONLY text, use "r", else, use always
    >>"rb". And "r" don't read the control caractere other that "\n" "\t" ..
    >>etc

    >
    >
    > Newbies, ignore this confusion.
    >
    > On Windows, text mode autoconverts \r\n to \n on input and viceverse on
    > output. I believe that that is all the difference. Period.


    It's the main difference, but not the only thing. From the MSDN
    documentation on fopen:

    "t

    Open in text (translated) mode. In this mode, CTRL+Z is interpreted as
    an end-of-file character on input. In files opened for reading/writing
    with "a+", fopen checks for a CTRL+Z at the end of the file and removes
    it, if possible. This is done because using fseek and ftell to move
    within a file that ends with a CTRL+Z, may cause fseek to behave
    improperly near the end of the file.

    Also, in text mode, carriage return–linefeed combinations are translated
    into single linefeeds on input, and linefeed characters are translated
    to carriage return–linefeed combinations on output. When a Unicode
    stream-I/O function operates in text mode (the default), the source or
    destination stream is assumed to be a sequence of multibyte characters.
    Therefore, the Unicode stream-input functions convert multibyte
    characters to wide characters (as if by a call to the mbtowc function).
    For the same reason, the Unicode stream-output functions convert wide
    characters to multibyte characters (as if by a call to the wctomb
    function)."

    So there's
    - the line endings translation
    - the issue of CTRL-Z as end of file that gets stripped (CTRL-Z is
    decimal 26 or hex 1a, consistent with Ralf's mail)
    - the Unicode issue, which I frankly don't understand

    --
    "Codito ergo sum"
    Roel Schroeven
     
    Roel Schroeven, Sep 24, 2004
    #14
  15. Guyon Morée

    Alan G Isaac Guest

    "Roel Schroeven" <> wrote in message
    news:OjW4d.255917$-ops.be...
    > It's safe in the sense that everything goes out exactly as it came in.
    > For example, gzip uses binary mode even when compressing text files. The
    > files may be text, but gzip doesn't care about that. It doesn't care
    > about words, sentences and line endings, but it does care about
    > representing exactly the bytes that are in the file.


    I think the following is the same question from another angle.
    I have an .zip archive of compressed files that
    I want to decompress. Using the zipfile module,
    I tried
    z=zipfile.ZipFile(local.zip)
    for zname in z.namelist():
    localtxtfile='c:/puthere/'+zname
    f=open(localtxtfile,'w')
    f.write(z.read(zname))
    f.close

    The original files were all plain text,
    created on an unspecified platform.
    The files I decompressed this way contained
    *two successive* carriage returns
    (ASCII 13) at the end of each line.
    If I change 'w' to 'wb' I get only one
    carriage return at the end of each line.

    Why is this extra carriage return added?
    My original guess was the using 'w' instead
    of 'wb' would be the right action, since the
    platform for the original files is unspecified
    and the original files are known to be plain text.

    Thanks,
    Alan Isaac
     
    Alan G Isaac, Sep 26, 2004
    #15
  16. Alan G Isaac wrote:

    > "Roel Schroeven" <> wrote in message
    > news:OjW4d.255917$-ops.be...
    >
    >>It's safe in the sense that everything goes out exactly as it came in.
    >>For example, gzip uses binary mode even when compressing text files. The
    >>files may be text, but gzip doesn't care about that. It doesn't care
    >>about words, sentences and line endings, but it does care about
    >>representing exactly the bytes that are in the file.

    >
    > I think the following is the same question from another angle.


    I think you should consider the same answer from this angle. ;)

    > I have an .zip archive of compressed files that
    > I want to decompress. Using the zipfile module,
    > I tried
    > z=zipfile.ZipFile(local.zip)
    > for zname in z.namelist():
    > localtxtfile='c:/puthere/'+zname
    > f=open(localtxtfile,'w')
    > f.write(z.read(zname))
    > f.close
    >
    > The original files were all plain text,
    > created on an unspecified platform.


    Are you sure the platform is unspecified? You can find out the platform
    by doing zipfile.getinfo(zname).create_system and then *yuck* looking up
    the ID number you get against the list in
    <http://www.pkware.com/company/standards/appnote/>.

    > The files I decompressed this way contained
    > *two successive* carriage returns
    > (ASCII 13) at the end of each line.
    > If I change 'w' to 'wb' I get only one
    > carriage return at the end of each line.
    >
    > Why is this extra carriage return added?


    I imagine the file in the archive was created on a DOS-type system,
    where the line ending is \r\n. That's what you read in. When you write
    it out in "w" mode the \n is expanded to \r\n without checking to see if
    there is already a \r beforehand. So you get \r\r\n.

    Essentially you should consider the archive file to be read in "rb"
    mode. Writing in "w" mode instead of "wb" mode will give you extra
    carriage returns.

    If you want to be able to get "universal newline" input from your
    zipfile, consider piping input through this generator and using "w" mode:

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286165

    Then you should get the correct line ending for a text file without
    regard to the current platform or the one where the archive was created.
    --
    Michael Hoffman
     
    Michael Hoffman, Sep 26, 2004
    #16
  17. Guyon Morée

    Tim Roberts Guest

    "Alan G Isaac" <> wrote:
    >
    >I think the following is the same question from another angle.
    >I have an .zip archive of compressed files that
    >I want to decompress. Using the zipfile module,
    >I tried
    >z=zipfile.ZipFile(local.zip)
    >for zname in z.namelist():
    > localtxtfile='c:/puthere/'+zname
    > f=open(localtxtfile,'w')
    > f.write(z.read(zname))
    > f.close
    >
    >The original files were all plain text,
    >created on an unspecified platform.


    Not true. They were in plain text, created on a DOS/Windows platform.

    >The files I decompressed this way contained
    >*two successive* carriage returns
    >(ASCII 13) at the end of each line.
    >If I change 'w' to 'wb' I get only one
    >carriage return at the end of each line.
    >
    >Why is this extra carriage return added?


    Because the original file inside the zip file contained \r\n. z.read
    returns you those exact bytes. When you write "\r\n" to a text file in
    Windows, the \r is written as \r, and the \n is written as \r\n. This, you
    end up with \r\r\n.

    >My original guess was the using 'w' instead
    >of 'wb' would be the right action, since the
    >platform for the original files is unspecified
    >and the original files are known to be plain text.


    No. If you do not know what your buffer contains, you should always use
    'wb' so that those contents are not altered.

    That's the real lesson: when you write using 'w' or 'wt', the buffer is
    changed on the way out. You only want that if you know exactly what you
    are writing.
    --
    - Tim Roberts,
    Providenza & Boekelheide, Inc.
     
    Tim Roberts, Sep 26, 2004
    #17
  18. Guyon Morée

    Alan G Isaac Guest

    "Michael Hoffman" <> wrote in
    message news:cj57cj$1d3$...
    > I imagine the file in the archive was created on a DOS-type system,
    > where the line ending is \r\n. That's what you read in. When you write
    > it out in "w" mode the \n is expanded to \r\n without checking to see if
    > there is already a \r beforehand. So you get \r\r\n.


    Thanks; that addresses my basic misconception about writing in textmode.
    I had thought that writing in textmode produced a platform specific
    conversion of the text written, but I now understand that this only affects
    how \n is written.

    > If you want to be able to get "universal newline" input from your
    > zipfile, consider piping input through this generator and using "w" mode:
    > http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286165


    Very helpful.

    Thanks,
    Alan Isaac
     
    Alan G Isaac, Sep 28, 2004
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John J Lee
    Replies:
    3
    Views:
    551
    bruno at modulix
    Dec 1, 2005
  2. Edward Loper
    Replies:
    0
    Views:
    525
    Edward Loper
    Aug 7, 2007
  3. John J Lee
    Replies:
    0
    Views:
    558
    John J Lee
    Aug 7, 2007
  4. Edward Loper

    mmm-mode, python-mode and doctest-mode?

    Edward Loper, Aug 9, 2007, in forum: Python
    Replies:
    0
    Views:
    475
    Edward Loper
    Aug 9, 2007
  5. manu
    Replies:
    11
    Views:
    1,688
    Default User
    Jan 5, 2009
Loading...

Share This Page