how to determine a file is ASCII or binary?

S

Sunner Sun

Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!
 
J

James McIninch

<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.

Since there's no definition of what that magic might be, there's likewise no
way to distinguish a "text" file from a "binary" file. All text files are
binary files. The only way to recognize a text file would be to check if
the file matches the local environment's criteria for a "text" file (and
most environments don't have the concept of a "text" file at all).

The cannonical example is CP/M (and Microsoft's products, which harken back
to it). There, if you open a file for writing as a "text" file, every "\n"
that is written becomes "\r\n" on disk, and when you close the file, "\032"
is appended to the end of the file. When you read from the text file, the
reverse operations occur. Windows still does this. The only way you would
could differentiate between a text file and binary file would be to be
armed with this information, then open the target file in binary mode and
check that every byte in the file returns true for isprint() or isspace()
except the last byte in the file, which must equal '\032'. If so, you know
the file is a text file. You don't need to test if the file is a binary
file, since all files are.

It gets more complicated in modern days where multiple character sets and
various encodings are used for text... In that case, the encoding needs to
be indicated within the file somehow and that frequently presumes multibyte
character sets, etc., which already preclude them from being treated as
simple text files in the first place.
 
L

Leor Zolman

Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Portably, in C? Nah, because a "binary" file can simply mimic an ASCII file
and no one, or no /thing/, could possibly tell you whether the data was
written in binary mode or text mode.

The best you can do is take the approach of the Unix "file" command. Here's
a sample output I just got from running it under Cygwin on a text file:

[/home/leor] $ file s2
s2: ASCII English text, with CRLF line terminators

It looks at the first few bytes (along with perhaps platform-specific inode
info in this case) and "takes its best shot".
-leor
 
O

osmium

Sunner said:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

There is no way to be certain. But note that most of the control characters
would not appear in an ASCII file.

You can make a pretty good guess by making a subset of most of the ASCII
control characters. Then count the number of characters in the file that are
in the subset, the count should be zero for an ASCII file.

But in the final analysis you must prove a negative, which is a troublesome
thing to do.
 
T

Thomas Matthews

Sunner said:
Hi, all

Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

Thank you!

In addition to what the other's have replied, there is no
guarantee in some operating systems that an extension is the
type of file.

For example, in MS-DOS land, one could create a file
containing "The big Ogre" and give it an extension of
".exe". On the other hand, one could rename a executable,
such as "command.com" to "command.txt".

Whether a file is binary or ASCII is an attribute of the
file. Maintaining file attributes is the responsibility
of the operating system (and perhaps the application
creating the file).

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
 
D

Dan Pop

In said:
<posted & mailed>

There's no "ASCII" in C. There is a somewhat artificial distinction between
"text" and "binary". "text" being a special case of a binary file whereby
the operating system might do something to the data as it is written to the
disk to make it compatible with applications that operate on text.

The difference is as natural as you can get on those systems where binary
and text files are completely different beasts. Unix and Windows do not
define the whole world of hosted computing...

Dan
 
D

Dan Pop

In said:
Since the OS look both ASCII and binary file as a sequence of bytes, is
there any way to determine the file type except to judge the extension?

If *your* implementation allows opening a file in the wrong (text vs
binary) mode (that's technically undefined behaviour), you can try
opening it in text mode and using some heuristics to decide whether
it contains text or binary data. I wouldn't recommend opening it in
binary mode, as it could expose some of the internals of the text files
representation and allow drawing the wrong conclusion from that.

First, if you find any null character inside, it is reasonable to decide
that you have a binary file (text files seldom contain null characters,
as they upset any input function that returns a string, while binary files
seldom don't contain at least one null byte).

In the absence of a null character, try finding characters for which
iscntrl() is true but isspace() isn't. Any such beast is also a good
hint that the file is a binary file.

If the file is too large, you may want to restrict your search to the
first N bytes. There are also files containing essentially text, but
having embedded terminal/printer control sequences. It is hard to
say whether they qualify as text files or as binary files and even
harder to identify them.

Dan
 
D

Dik T. Winter

> But in the final analysis you must prove a negative, which is a troublesome
> thing to do.

Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.
 
D

Dan Pop

In said:
Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.

ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)

Dan
 
M

Michael Wojcik

Indeed. There has been an example of an executable which consisted only
of bytes in the printable ASCII range.

Executable machine code for Pentiums and the like that's entirely
printable ASCII or UTF-16 is quite the rage these days, since it's
useful for exploiting some kinds of buffer overflows and other
security vulnerabilities. It shows up all the time on the security
mailing lists (Bugtraq, Vuln-Dev, etc).

Back to the original problem: If a file consists of nothing but
printable ASCII characters, then it is by definition an ASCII file.
It may not be human-readable text, but it's ASCII. Problem solved.

If the OP wants to determine the *intent* of a file, of course, that's
a bit harder, inasmuch as it's not even well-defined.
 
P

Peter Pichler

Dan Pop said:
ALL my Perl and /bin/sh executables consist only of bytes in the
printable ASCII range ;-)

It is arguable whether they qualify as "executables". Dik was probably
talking about EICAR or something similar. Look up EICAR in Google.

Peter
 
D

Dik T. Winter

>
> It is arguable whether they qualify as "executables". Dik was probably
> talking about EICAR or something similar. Look up EICAR in Google.

Nope, much older. It is an old story about a workstation where a rm -rf /
in the works was aborted, but not in time to lose almost all object files
(/bin /usr/bin). Luckily there was a window open with a shell and a
window with a texteditor (vi). On a similar machine an executable was
produced that consisted of ASCII characters only. This was entered in
the text editor, a file was written, in the shell a chmod performed
(built-in), and so a start of recovery was found.
 
D

Dan Pop

In said:
It is arguable whether they qualify as "executables".

Nothing to argue about it on a Unix system. If execve() can execute it,
it is an executable.

Dan
 
V

Villy Kruse

Nothing to argue about it on a Unix system. If execve() can execute it,
it is an executable.


With a proper interpreter you could make a C source program executable in
the same way. The interpreter will then run the compiler and then run the
object produced by the compiler. If this sounds too strange then consider
that this is basicaly what the perl interpreter for example does.


Villy
 
D

Dan Pop

In said:
With a proper interpreter you could make a C source program executable in
the same way.

Nope: a correct C program cannot start with the #! characters.

Apart from that, you're right. That's why one should be careful with the
termionology: "executable" vs "binary executable". Even if it consists
exclusively of printable ASCII characters, a binary executable is still
a binary executable. However, binary executables contain more than
machine instructions and the data they're supposed to process and
I doubt that the "metadata" can consist exclusively of printable ASCII
characters.

Dan
 
B

Ben Pfaff

Nope: a correct C program cannot start with the #! characters.

A shell script need not start with #! to be executable on at
least some Unix-like systems. The default interpreter is
/bin/sh:

blp@sighup:~$ cat > foo
echo bar
blp@sighup:~$ chmod a+x foo
blp@sighup:~$ ./foo
bar
blp@sighup:~$ rm foo
blp@sighup:~$
 
V

Villy Kruse

(e-mail address removed) (Dan Pop) writes:

A shell script need not start with #! to be executable on at
least some Unix-like systems. The default interpreter is
/bin/sh:

blp@sighup:~$ cat > foo
echo bar
blp@sighup:~$ chmod a+x foo
blp@sighup:~$ ./foo
bar
blp@sighup:~$ rm foo
blp@sighup:~$

That doesn't prove anything as it is the shell which tries to run the
script as a shell script if the kernel doesn't recognize the file as
some script program. The Unix shells has always done that long before
the #! magic was invented.

However, on Linux using binfmt_misc you can teach the kernel to
for example treat all file ending in a .c suffix as an executable
program using a specified program as interpreter.

See /usr/src/linux/Documentation/binfmt_misc.txt for details.


Villy
 
D

Dan Pop

In said:
A shell script need not start with #! to be executable on at
least some Unix-like systems. The default interpreter is
/bin/sh:

blp@sighup:~$ cat > foo
echo bar
blp@sighup:~$ chmod a+x foo
blp@sighup:~$ ./foo
bar
blp@sighup:~$ rm foo
blp@sighup:~$

You're actually making my point: the *default* interpreter is not an
interpreter for C source code.

Dan
 
S

Sam Dennis

Dan said:
In said:
[C source files as executables on *nix]
A shell script need not start with #! to be executable on at
least some Unix-like systems. The default interpreter is
The default interpreter is /bin/sh:

You're actually making my point: the *default* interpreter is not an
interpreter for C source code.

It's a rather silly idea, but I can imagine a shell that interprets a
subset of C in interactive mode (executing statements in main as they
are completed) and the real thing otherwise. It would also be rather
ugly, I suspect, but sessions like this are feasible:

/* The Real C Shell Interpreter */

/* user:/home/user % */ /*
A sample session; all others comments are output, mainly prompts.
*/
/* user:/home/user % */ #include <stdio.h>
/* user:/home/user % */ #include <math.h>
/* user:/home/user % */ int main( void ) {
/* Execution commences */
/* user:/home/user % */ puts( "\"Hello, world!\"" );
/*
"Hello, world!"
*/
/* user:/home/user % */ printf( "%f\n", pow( 4., 2.5 ) )
/* ... */ ;
/*
31.9999
*/
/* user:/home/user % */ if (0) {
/* ... */ puts( "Goodbye, cruel world.\n" );
/* ... */ return 42;
/* ... */ }
/* user:/home/user % */ char *cc = "/bin/sh";
/* user:/home/user % */ __rcsh_chdir( "some_project" );
/* user:/home/user/some_project % */ __rcsh_exec( cc, "--standard=c99",
/* ... */ "--annoying-style-advice", "--target=ds9000", "-o", "runme",
/* ... */ "main.c", NULL );
/*
/* The Real C Shell Compiler * /

char *errors[] = { "main.c: No such file or directory", 0 };
/* Compilation failed; 1 error, 0 warnings. * /
*/
/* user:/home/user % */ }
/* Session closed */

Obviously, it can't be interactive while allowing functions that have
not been defined yet to be called from main; other restrictions might
be necessary.

I'm not sure if this disqualifies an operating system from being said
to be `Unix-like', though.
 
D

Dan Pop

In said:
Dan said:
In said:
[C source files as executables on *nix]
A shell script need not start with #! to be executable on at
least some Unix-like systems. The default interpreter is
The default interpreter is /bin/sh:

You're actually making my point: the *default* interpreter is not an
interpreter for C source code.

It's a rather silly idea, but I can imagine a shell that interprets a
subset of C in interactive mode (executing statements in main as they
are completed) and the real thing otherwise. It would also be rather
ugly, I suspect, but sessions like this are feasible:

No one said that such a shell cannot be imagined, merely that /bin/sh,
the *only* shell that doesn't need the #! "magic number" in the executable
file is NOT such a shell. And to get the executable file interpreted by
any other program, you need the #! "magic number", but a valid C source
file cannot start with this character pair.

So, the *only* solution would be to radically change the specification of
/bin/sh, but it's far too late for that: the resulting system won't
qualify as Unix-like. And building additional heuristics into execve()
based on the file suffix is NOT an option, as there is nothing preventing
/bin/sh scripts from using the .c suffix:

fangorn:/tmp 568> echo uptime >test.c
fangorn:/tmp 569> chmod +x test.c
fangorn:/tmp 570> ./test.c
6:43pm up 41 days 23:07, 13 users, load average: 0.00, 0.00, 0.00

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top