hi friends,
i've been having this confusion for about a year, i want to know the
exact difference between text and binary files.
As far as the C standard is concerned there are some things like not
being able to get the exact file size with binary files, file position
may be off with text files, there being a maximum line length for text
files, and each line in a text file must be outputted with '\n'. This
is a summary, check the standard for the real list. So basically
writing a file in text mode then opening it in binary mode isn't
guaranteed to even give you anything meaningful or work at all (imagine
an implementation that marks whether a file has a text or binary
attribute and a file is determined by both the filename and this
attribute).
On many implementations, the above doesn't apply and all you have to
worry about how the implementation stores the newline character. Since
you're on Windows, here is the convention for text files (treating the
text file as binary here):
BOM(optional)
line1 newline
....
lineN newline(optional)
EOF(optional)
the BOM is to handle unicode files, it can be one of:
0xEF 0xBB 0xBF (UTF-8 BOM)
0xFF 0xFE (UTF-16LE BOM)
0xFE 0xFF (UTF-16BE BOM)
If there is no BOM, then it's up to the software opening it to figure
out the encoding of the file somehow.
Newline is the '\r' '\n' sequence of characters.
Lines are composed of characters. For UTF-16, these characters are
either 2-bytes or 4-bytes depending if they're surrogate pairs. For
UTF-8, characters are 1, 2, 3, or 4 bytes. (and on top of all this, you
have to deal with an arbitrary number of combining characters). You
should read up on unicode, UTF-8, and UTF-16 because this whole issue
of characters and glyphs is confusing when somebody like me uses loose
language like this. If it's not a Unicode file, it most likely uses
some encoding set on the system. Generally white-people countries use
1-byte per character and non-white-people countries use multiple bytes
to encode characters.
EOF is the ASCII ctrl+Z code (0x19). You won't find this except when
opening an ancient DOS file off a floppy or something.
When opening a file in text-mode, most of this should be transparent to
you if your program and the C runtime were carefully designed. i.e. the
above should pretty much be a concern for the C runtime implementors or
programmers that want to handle all of this themselves.
Here's some homework for you:
On the C side, read 7.19, 7.24, and 7.25 in the C standard. Make sure
you know what the following do and how they fit together:
mbstate_t
fwide
fwrite
fputs
fputws
mbtowc
mbstowcs
setlocale
wcstombs
wctomb
mblen
On the windows side, read:
GetACP
MultiByteToWideChar
WideCharToMultiByte
http://blogs.msdn.com/oldnewthing/archive/2005/03/08/389527.aspx
http://blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx
http://blogs.msdn.com/oldnewthing/archive/2005/08/29/457483.aspx
http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
http://blogs.msdn.com/michkap/archive/category/8717.aspx
On the Unicode side, read:
http://www.unicode.org/faq/utf_bom.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://catch22.net/tuts/ (the articles about his text editor)
http://en.wikipedia.org/wiki/ISO/IEC_8859
http://en.wikipedia.org/wiki/Unicode
After all this, you should be way more advanced about files than most C
programmers.
using the fwrite function in c, i wrote 2 bytes of integers in binary
mode.
according to me, notepad opens files and each byte of the file
read, it converts that byte from ascii to its correct character and
displays it on screen..
so that's what i did, i wrote 2 bytes (an integer) using fwrite and
since ascii
is 1 byte, i expected 2 characters to be displayed in notepad..
the first character displayed correctly but not the second.
Notepad likely uses the winapi function IsTextUnicode to determine the
encoding of the file. Windows supports the ASCII codepage but it is
very rarely used. Yours is most likely set to this:
http://en.wikipedia.org/wiki/Windows-1252
to add to my confusion of text and binary, some FTP servers running
on Linux require html files to be uploaded in 'ascii mode' and binary
files in 'binary mode'.
Both are ordinary files consisting of a sequential series of bytes
after all, then
why a seperate mode?
I don't know about FTP but I think they also allow EBCIDIC to be
transferred as well. I doubt you can send any arbitrary text file out
to another because it highly depends on the source and destination
character set of computers so obviously FTP can only send it out to
computers (non-lossy) that have some sort of mapping between each-other
and the FTP server is aware of this mapping.