text and binary files confusion

Discussion in 'C Programming' started by joelagnel@gmail.com, Mar 13, 2006.

  1. Guest

    hi friends,

    i've been having this confusion for about a year, i want to know the
    exact difference between text and binary files.

    using the fwrite function in c, i wrote 2 bytes of integers in binary
    mode.
    according to me, notepad opens files and each byte of the file
    read, it converts that byte from ascii to its correct character and
    displays
    it on screen..
    so that's what i did, i wrote 2 bytes (an integer) using fwrite and
    since ascii
    is 1 byte, i expected 2 characters to be displayed in notepad..
    the first character displayed correctly but not the second.

    to add to my confusion of text and binary, some FTP servers running
    on Linux require html files to be uploaded in 'ascii mode' and binary
    files in 'binary mode'.
    Both are ordinary files consisting of a sequential series of bytes
    after all, then
    why a seperate mode?


    Any insight into this confusion would be greatly appreciated.

    thanks a lot for your time.

    - joel
     
    , Mar 13, 2006
    #1
    1. Advertising

  2. Marc Boyer Guest

    Le 13-03-2006, <> a écrit :
    > i've been having this confusion for about a year, i want to know the
    > exact difference between text and binary files.


    To my knowledge, the main difference is the interpretation
    of the '\n' character.
    In text mode, '\n' is an 'end of line' indication, mapped
    into '\n', '\r\n' or '\r' depending of the file system
    encoding of the end of line.
    In binary mode, '\n' is '\n'.

    > using the fwrite function in c, i wrote 2 bytes of integers in binary
    > mode.
    > according to me, notepad opens files and each byte of the file
    > read, it converts that byte from ascii to its correct character and
    > displays
    > it on screen..


    Assuming that your encoding is ASCII, yes.

    > so that's what i did, i wrote 2 bytes (an integer) using fwrite and
    > since ascii
    > is 1 byte, i expected 2 characters to be displayed in notepad..
    > the first character displayed correctly but not the second.


    Perhaps it was not a 'displayable' value.

    Marc Boyer
     
    Marc Boyer, Mar 13, 2006
    #2
    1. Advertising

  3. On 2006-03-13, <> wrote:
    > hi friends,
    >
    > i've been having this confusion for about a year, i want to know the
    > exact difference between text and binary files.
    >
    > using the fwrite function in c, i wrote 2 bytes of integers in binary
    > mode.
    > according to me, notepad opens files and each byte of the file
    > read, it converts that byte from ascii to its correct character and
    > displays
    > it on screen..
    > so that's what i did, i wrote 2 bytes (an integer) using fwrite and
    > since ascii
    > is 1 byte, i expected 2 characters to be displayed in notepad..
    > the first character displayed correctly but not the second.
    >
    > to add to my confusion of text and binary, some FTP servers running
    > on Linux require html files to be uploaded in 'ascii mode' and binary
    > files in 'binary mode'.
    > Both are ordinary files consisting of a sequential series of bytes
    > after all, then
    > why a seperate mode?
    >
    >
    > Any insight into this confusion would be greatly appreciated.
    >
    > thanks a lot for your time.
    >
    > - joel
    >


    Firstly, dont worry : this is something that trips a lot of people up.

    I was about to pen a few lines and then decided not to because (a) I
    could not think of an eloquent way of doing it and (b) like most things,
    someone else has done it first. The secret of being a great engineer
    is not howing how to do something, but knowing that there may be a
    better way and knowing how to locate that better way :-;

    Here:

    http://en.wikipedia.org/wiki/Binary_and_text_files

    There is one key part which might confuse you (not knowing your
    familiarity with ascii text) and that is:

    "Text files are files where most bytes (or short sequences of bytes)
    represent ordinary readable characters such as letters,"

    The short sequence of bytes is important : google up unicode and dbcs.

    --
    Debuggers : you know it makes sense.
    http://heather.cs.ucdavis.edu/~matloff/UnixAndC/CLanguage/Debug.html#tth_sEc
     
    Richard G. Riley, Mar 13, 2006
    #3
  4. Guest

    > Perhaps it was not a 'displayable' value.

    the value was displayable im sure, because the ascii codes
    i wrote in binary to the file were:
    40H
    and 41H
    2 bytes.
    and i expected it to display AB
    but it displayed A@

    That's what confuses me..

    - joel
     
    , Mar 17, 2006
    #4
  5. Guest

    also how does notepad detect the encoding, after all its just
    a sequence of bytes, there's nothing in the file that says im
    encoded in ascii or ebcdic...

    joel
     
    , Mar 17, 2006
    #5
  6. Guest

    wrote:
    > > Perhaps it was not a 'displayable' value.

    >
    > the value was displayable im sure, because the ascii codes
    > i wrote in binary to the file were:
    > 40H
    > and 41H
    > 2 bytes.
    > and i expected it to display AB


    Wrong.

    > but it displayed A@


    That's what it's supposed to display. ASCII 'A' is 41H.
    ASCII '@' is 40H.

    >
    > That's what confuses me..


    But the computer is not confused. It does what you tell it.
    When what you get is not what you expect, always consider
    that your expectations may be wrong.

    >
    > - joel
     
    , Mar 17, 2006
    #6
  7. Guest

    wrote:
    > also how does notepad detect the encoding, after all its just
    > a sequence of bytes, there's nothing in the file that says im
    > encoded in ascii or ebcdic...


    Notepad assumes it's in ASCII. If you actually used EBCDIC
    and opened it in notepad, you'd _really_ be confused.

    >
    > joel
     
    , Mar 17, 2006
    #7
  8. Me Guest

    wrote:
    > hi friends,
    >
    > i've been having this confusion for about a year, i want to know the
    > exact difference between text and binary files.


    As far as the C standard is concerned there are some things like not
    being able to get the exact file size with binary files, file position
    may be off with text files, there being a maximum line length for text
    files, and each line in a text file must be outputted with '\n'. This
    is a summary, check the standard for the real list. So basically
    writing a file in text mode then opening it in binary mode isn't
    guaranteed to even give you anything meaningful or work at all (imagine
    an implementation that marks whether a file has a text or binary
    attribute and a file is determined by both the filename and this
    attribute).

    On many implementations, the above doesn't apply and all you have to
    worry about how the implementation stores the newline character. Since
    you're on Windows, here is the convention for text files (treating the
    text file as binary here):

    BOM(optional)
    line1 newline
    ....
    lineN newline(optional)
    EOF(optional)

    the BOM is to handle unicode files, it can be one of:

    0xEF 0xBB 0xBF (UTF-8 BOM)
    0xFF 0xFE (UTF-16LE BOM)
    0xFE 0xFF (UTF-16BE BOM)

    If there is no BOM, then it's up to the software opening it to figure
    out the encoding of the file somehow.

    Newline is the '\r' '\n' sequence of characters.

    Lines are composed of characters. For UTF-16, these characters are
    either 2-bytes or 4-bytes depending if they're surrogate pairs. For
    UTF-8, characters are 1, 2, 3, or 4 bytes. (and on top of all this, you
    have to deal with an arbitrary number of combining characters). You
    should read up on unicode, UTF-8, and UTF-16 because this whole issue
    of characters and glyphs is confusing when somebody like me uses loose
    language like this. If it's not a Unicode file, it most likely uses
    some encoding set on the system. Generally white-people countries use
    1-byte per character and non-white-people countries use multiple bytes
    to encode characters.

    EOF is the ASCII ctrl+Z code (0x19). You won't find this except when
    opening an ancient DOS file off a floppy or something.


    When opening a file in text-mode, most of this should be transparent to
    you if your program and the C runtime were carefully designed. i.e. the
    above should pretty much be a concern for the C runtime implementors or
    programmers that want to handle all of this themselves.


    Here's some homework for you:

    On the C side, read 7.19, 7.24, and 7.25 in the C standard. Make sure
    you know what the following do and how they fit together:

    mbstate_t
    fwide
    fwrite
    fputs
    fputws
    mbtowc
    mbstowcs
    setlocale
    wcstombs
    wctomb
    mblen

    On the windows side, read:

    GetACP
    MultiByteToWideChar
    WideCharToMultiByte
    http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
    http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx
    http://blogs.msdn.com/oldnewthing/archive/2005/08/29/457483.aspx
    http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
    http://blogs.msdn.com/michkap/archive/category/8717.aspx

    On the Unicode side, read:

    http://www.unicode.org/faq/utf_bom.html
    http://www.cl.cam.ac.uk/~mgk25/unicode.html
    http://catch22.net/tuts/ (the articles about his text editor)
    http://en.wikipedia.org/wiki/ISO/IEC_8859
    http://en.wikipedia.org/wiki/Unicode

    After all this, you should be way more advanced about files than most C
    programmers.


    > using the fwrite function in c, i wrote 2 bytes of integers in binary
    > mode.
    > according to me, notepad opens files and each byte of the file
    > read, it converts that byte from ascii to its correct character and
    > displays it on screen..
    > so that's what i did, i wrote 2 bytes (an integer) using fwrite and
    > since ascii
    > is 1 byte, i expected 2 characters to be displayed in notepad..
    > the first character displayed correctly but not the second.


    Notepad likely uses the winapi function IsTextUnicode to determine the
    encoding of the file. Windows supports the ASCII codepage but it is
    very rarely used. Yours is most likely set to this:

    http://en.wikipedia.org/wiki/Windows-1252

    > to add to my confusion of text and binary, some FTP servers running
    > on Linux require html files to be uploaded in 'ascii mode' and binary
    > files in 'binary mode'.
    > Both are ordinary files consisting of a sequential series of bytes
    > after all, then
    > why a seperate mode?


    I don't know about FTP but I think they also allow EBCIDIC to be
    transferred as well. I doubt you can send any arbitrary text file out
    to another because it highly depends on the source and destination
    character set of computers so obviously FTP can only send it out to
    computers (non-lossy) that have some sort of mapping between each-other
    and the FTP server is aware of this mapping.
     
    Me, Mar 17, 2006
    #8
  9. Marc Boyer Guest

    Le 17-03-2006, <> a écrit :
    > also how does notepad detect the encoding,


    Does it ?

    > after all its just
    > a sequence of bytes, there's nothing in the file that says im
    > encoded in ascii or ebcdic...


    No, but, in general, a 'familly of platform'
    (like Win*, AIX*, AS*) uses the same encoding. That
    is to says, I do not know any Win* running EBCDIC.
    So, notepad can assume ASCII is used.

    Nevertheless, nowadays, in non-english countries,
    peoples are using iso-* encodings, UTF-8, perhaps
    UTF-16 and others...
    As french, I often have problem openning UTF-8
    files with iso-latin* editors, and so on.

    There are some heuristics used to guess the
    encoding. Some editors (like [X]emacs) use some,
    and it often works.

    Marc Boyer
     
    Marc Boyer, Mar 17, 2006
    #9
  10. Richard Bos Guest

    Marc Boyer <> wrote:

    > Le 17-03-2006, <> a écrit :
    > > also how does notepad detect the encoding,

    >
    > Does it ?
    >
    > > after all its just
    > > a sequence of bytes, there's nothing in the file that says im
    > > encoded in ascii or ebcdic...

    >
    > No, but, in general, a 'familly of platform'
    > (like Win*, AIX*, AS*) uses the same encoding. That
    > is to says, I do not know any Win* running EBCDIC.
    > So, notepad can assume ASCII is used.


    Not on newer versions of Windows, it can't. More would be off-topic,
    except to say that
    a. the detection used is easily writable, correctly, in ISO C and
    b. _if_ the implementation uses UTF-16 for wchar_t, so is the rest of
    the editor.

    Richard
     
    Richard Bos, Mar 17, 2006
    #10
  11. Marc Boyer said:

    > Le 17-03-2006, <> a écrit :
    >> also how does notepad detect the encoding,

    >
    > Does it ?
    >
    >> after all its just
    >> a sequence of bytes, there's nothing in the file that says im
    >> encoded in ascii or ebcdic...

    >
    > No, but, in general, a 'familly of platform'
    > (like Win*, AIX*, AS*) uses the same encoding. That
    > is to says, I do not know any Win* running EBCDIC.
    > So, notepad can assume ASCII is used.


    Windows can still emulate MS-DOS, quite probably to the extent that it can
    run IBM's DisplayWrite software, which uses [1] EBCDIC encoding (for, would
    you believe, mainframe compatibility).


    [1] Or, at least, used. I freely admit that my information is over a decade
    old.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
     
    Richard Heathfield, Mar 17, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Albert Tu
    Replies:
    2
    Views:
    659
    Bengt Richter
    Jan 25, 2005
  2. Replies:
    4
    Views:
    984
    M.E.Farmer
    Feb 13, 2005
  3. utab
    Replies:
    3
    Views:
    889
  4. Jim
    Replies:
    6
    Views:
    756
  5. zvika
    Replies:
    2
    Views:
    141
    Jürgen Exner
    Dec 12, 2004
Loading...

Share This Page