Binary or Ascii Text?

C

Claude Yih

Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.
 
K

Keith Thompson

Claude Yih said:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

There is no general solution to this. Many systems don't actually
distinguish between text and binary files; a text file is just a file
that happens to consist of printable characters -- and what's
considered a printable character can vary. You can also look at line
terminators (ASCII LF on Unix-like systems, an ASCII CR-LF sequence on
Windows-like systems, possibly something completely different
elsewhere).

<OT>Unix-like systems have a command called "file" that attempts to
classify a file based on its contents.</OT>
 
O

osmium

Claude Yih said:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.

But note that this is not what is meant when C programmers discuss text vs.
binary, for example in some of the file functions. What is referred to
there is the distinction between two ways of handling end of lines. Is an
end of line demarked by a single character (LF) or two characters <CR><LF>?
Unix uses only the LF to mark end of line, so the distinction is
meaningless. Systems that use <CR><LF> or <LF><CR> have to examine the
stream and convert the two characters into one, called '\n' So '\n' is
really <LF>..

When you open a file in binary mode, you are telling the world: Hey, you
there, keep your cotton-picking hands off this file.
 
S

SM Ryan

# Hi, everyone. I got a question. How can I identify whether a file is a
# binary file or an ascii text file? For instance, I wrote a piece of
# code and saved as "Test.c". I knew it was an ascii text file. Then
# after compilation, I got a "Test" file and it was a binary executable
# file. The problem is, I know the type of those two files in my mind
# because I executed the process of compilation, but how can I make the
# computer know the type of a given file by writing code in C? Files are

As far as stdio is concerned, a binary file is what you get if you
include a "b" in the open mode, otherwise it's text mode. Binary and
text files may handle end-of-line indicators differently, and how
fseek offsets are interpretted. (In unix, stdio treats binary and
text files identically.)
 
C

Claude Yih

osmium writes:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are present
in text files. So if you have quite a few of the other 25 or so codes, it is
probably not a text file - but it's only an educated guess, no real proof.

Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.
 
O

osmium

Claude Yih said:
The best you can do is make a guess. The first 32 characters of ASCII are
control codes and only a few of them (CR, LF, FF, HT (tab), .... are
present
in text files. So if you have quite a few of the other 25 or so codes, it
is
probably not a text file - but it's only an educated guess, no real proof.

Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.
 
V

void * clvrmnky()

Claude said:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Please help me, thanks.
As others have said, this is an essentially arbitrary decision on your
part. Here we've standardized on a definition of "binary" that means:

Lines greater than X bytes (where "X" is some arbitrarily high number,
like 16 or 23k), or any character within the file is \0 or null.

I line is defined as data between newlines (normalized to '\n').

Everything else fits into a reasonable notion of OEM or ANSI charset,
with some caveats.

Again, this is specific to application requirements. Your requirements
may vary.
 
K

Keith Thompson

osmium said:
Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F? In that case, my method will make no
sense.

It doesn't work, but it has nothing to do with UTF-8. It is the problem of
proving a negative. How many white crows are there? AFAIK no one has ever
*seen* a white crow. What does that prove? Your guess is not as good as
the guess I implicitly proposed.

The quoting is completely messed up. The paragraph starting with "The
best you can do" was written by osmium, the next three paragraphs
where written by Claude Yih, and the last paragraph, starting with "It
doesn't work", was written by osmium (who usually gets this stuff
right).
 
K

Keith Thompson

Claude Yih said:
osmium writes:

Well, as matter of fact, I just got an idea to handle that problem. But
I don't know if it is feasible.

Now that we know ascii text only use 7 bits of a byte and the first bit
is always set as 0. So I wonder if I could write a program to get a
fixed length of a given file(for example, the first 1024 bytes) , to
store them in a unsigned char array and to check if there is any
elements greater than 0x7F. If any, the file can be judged as a binary
file.

I think that's fairly close to what the Unix "file" command does.
(Versions of the command are available as open source; see
<ftp://ftp.astron.com/pub/file/>.)

As mentioned above, you should also check for control characters.
However, the disadvantage of the above method is that it cannot handle
the multi-byte character. Take the UTF-8's japanese character for
example, a japanese character may be encoded as three bytes and some of
them may be greater than 0x7F。 In that case, my method will make no
sense.

Multi-byte characters aren't the only problem. ISO-8859-1 is an
extension of ASCII that uses codes from 161 to 255 for printable
characters (there are several ISO-8859-N standards).

And none of this is portable to all possible C implementations. Some
systems distinguish between text and binary files at the filesystem
level.

Whatever it is you're trying to do, your first line of defense should
be to arrange to know what type a file is before you open it. If that
fails, as it inevitably will in some cases, you can check the contents
as a fallback, but there's no 100% reliable way to do so.

If you're writing a program that's intended to work only on text
files, it might be best to decide what's acceptable *for that
program*. If you're displaying the contents of the file, for example,
you can establish a convention for displaying non-printable characters
in some readable form. If an input line is very long, you can wrap it
or truncate it. And so on.
 
M

Me

Claude said:
Hi, everyone. I got a question. How can I identify whether a file is a
binary file or an ascii text file? For instance, I wrote a piece of
code and saved as "Test.c". I knew it was an ascii text file. Then
after compilation, I got a "Test" file and it was a binary executable
file. The problem is, I know the type of those two files in my mind
because I executed the process of compilation, but how can I make the
computer know the type of a given file by writing code in C? Files are
all save as 0's and 1's. What's the difference?

Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use.
 
J

Joe Wright

Me said:
Modern computers deal with much more than just ASCII so trying to
determine the encoding is doomed to failure and mysterious acting
heuristics. All you should be doing is getting a filename from the user
and opening it in either text mode or binary mode. If you want to
distinguish between the two then just have the user also input which
mode they want to use.
Too true. Reading a file to determine its format is like walking outside
and predicting the weather. You might get it right but maybe not.

Text mode implemented in C is a concession to Microsoft. It removes the
CR from the CRLF pair and ignores any trailing ^Z character. Conversely
on writing a file, when told to write LF the pair CRLF is written.

If you expect anything else you'll be disappointed. If you must
investigate the contents of a file, "rb" is your friend.
 
C

CBFalconer

Joe said:
.... snip ...
Too true. Reading a file to determine its format is like walking outside
and predicting the weather. You might get it right but maybe not.

Text mode implemented in C is a concession to Microsoft. It removes the
CR from the CRLF pair and ignores any trailing ^Z character. Conversely
on writing a file, when told to write LF the pair CRLF is written.

In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr, and the protocols were largely inherited from teletype
machines. The C technique makes it awkward to overprint lines, or
to advance a line without returning to the left margin, while the
much older protocal makes those things easy.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
 
P

P.J. Plauger

Text mode implemented in C is a concession to Microsoft.

Nope. It was a concession to every OS *except* UNIX. By the time
the C standardization effort began in 1983, my company Whitesmiths,
Ltd. had ported C to dozens of different platforms. We added the
text/binary dichotomy to deal uniformly with the numerous
conventions for terminating lines in text files. One of those
platforms happened to be 86-DOS, which was the precursor to MS-DOS.
It was by no means the most important at that time.
It removes the CR
from the CRLF pair and ignores any trailing ^Z character.

And the bytes thereafter.
Conversely on
writing a file, when told to write LF the pair CRLF is written.

If you expect anything else you'll be disappointed. If you must
investigate the contents of a file, "rb" is your friend.

Right. Except for the possibility of trailing NUL padding, that is.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
P

P.J. Plauger

In this case you are being unfair to Microsoft (yes, I know it's
hard to do). C is the offbeat animal here. Text lines were
terminated with cr/lf for many moons before C decided to ignore the
cr,

You mean, before Unix developed a uniform notation for text streams,
both inside and outside the program, and C built it into its runtime
library.
and the protocols were largely inherited from teletype
machines.

They were also terminated by lf/cr, cr, by blank padding to fixed-length
records, by line count, etc. etc.
The C technique makes it awkward to overprint lines,

No, most systems will put out a ^M so you can do that.
or
to advance a line without returning to the left margin,

Yes, that is hard to do in text mode.
while the
much older protocal makes those things easy.

True. Thus the choice of binary mode as well.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
O

osmium

Joe Wright said:
Text mode implemented in C is a concession to Microsoft.

I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium. I believe the ASCII code has been jiggered
with to redefine CR and LF since the original specification, but I have no
actual proof.

So it was a concession to ASCII.
 
P

P.J. Plauger

I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side
effects, such as advancing the medium. I believe the ASCII code has been
jiggered with to redefine CR and LF since the original specification, but
I have no actual proof.

No, you just need backspace to work so you can overstrike a letter.
That still works with the portable C model of a text stream (on a
display that shows both characters of an overstrike, at least).
So it was a concession to ASCII.

Not really.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
M

mensanator

osmium said:
I hate Microsoft too. But that is not the case.

The ASCII code was designed to allow a second pass at printing to produce
some of the accents used with the latin alphabet. Early copies of ASCII
show both the circumflex and tilde in the superior position to make this
work. And also, to make it work, a line feed had to have no side effects,
such as advancing the medium.

What, then, is its effect? Or are you thinking about carraige return
not advancing the media?
 
O

osmium

P.J. Plauger said:
Not really.

I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of actual
drum printers, chain printers and so on.

So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return? What was the point then, of
having them as separate codes? Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.
 
O

osmium

What, then, is its effect? Or are you thinking about carraige return
not advancing the media?

I think I said it backwards. CR meant return the carriage and LF meant to
advance to next line. Neither had any side effects. The sequence <CR><LF>
was like a typewriter.
 
P

P.J. Plauger

I realized that back space might enter into that too, but thought there
might have been problems with that considering the physical nature of
actual drum printers, chain printers and so on.

Dunno how CR would be any better off than BS if that was the case.
Either way, as long as the device driver can do overstrikes you
can express them either by CR and spacing down or by BS and an
immediate overstrike.
So are you saying that the initial release of the ASCII standard said that
LF was to do line feed AND carriage return?

I don't recall saying anything about the ASCII standard. It describes
the effect of presenting a stream of ASCII characters to a conforming
display device. That's what goes on *outside* a C program. What I
discussed was the use of a single NL (as opposed to an assortment of
earlier conventions) for signaling the end of a line of text *within*
a C program. Unix also chose this representation for text files, so
there was no need to distinguish binary and text files. Moreover, Unix
device drivers generated whatever sequence of codes was necessary to
get the device to replicate the intent of the internal text stream.
That isolated the device peculiarities where they belong, not spread
throughout each program. (If you ever saw code written during the
1960s, you'd appreciate what a breakthrough this uniformity caused.)

I agree that you lose a bit of expressiveness over maintaining the
code internally as ASCII, but the payoff is a significantly better
unified model, IMO, for representing all text streams. Witness the
success of Unix-style software tools, and the C I/O model well
beyond Unix.
What was the point then, of
having them as separate codes?

ASCII serves one purpose, the C Standard another.
Unfortunately I don't have the text that
goes with my pre-historic ASCII chart, only a single page showing the
glyphs.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top