Determine file type (binary or text)

S

Sami Viitanen

Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..


Thanks in adv.
 
B

bromden

How can I check if a file is binary or text?
1

(btw, f.read() returns 'text/x-java; charset=us-ascii\n')
 
B

bromden

f = os.popen('file -bi test.py', 'r')
sorry, it's not general, since "file -i" returns
"application/x-shellscript" for shell scripts,
it's better to go like that:
 
S

Sami Viitanen

Works well in Unix but I'm making a script that works on both
Unix and Windows.

Win doesn't have that 'file -bi' command.
 
M

Michael Peuser

Hi,
yes there is more than just Unix in the world ;-)
Windows directories have no means to specify their contents type in any way.
The approved method is using three-letter extensions, though this rule is
not strictly followed (lot of files without extension nowadays!)

When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded
20%. I have no idea whether it will work good with chinese unicode files or
some funny depositories or project files that store uncompressed texts....

KIndly
Michael P

Sami Viitanen said:
Works well in Unix but I'm making a script that works on both
Unix and Windows.

Win doesn't have that 'file -bi' command.
 
K

Karl Scalet

Michael said:
Hi,
yes there is more than just Unix in the world ;-)
Windows directories have no means to specify their contents type in any way.

That's even more true with linux/unix, as there is no need to do
any stuff like line-terminator conversion.
The approved method is using three-letter extensions, though this rule is
not strictly followed (lot of files without extension nowadays!)

When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded
20%. I have no idea whether it will work good with chinese unicode files or
some funny depositories or project files that store uncompressed texts....

based on the idea from Mr. "bromden", why not use mimetypes.MimeTypes()
and guess_type('file://...') and analye the returned string.
This should work on windows / linux / unix / whatever.


Karl

KIndly
Michael P
 
P

Peter Hansen

Sami said:
How can I check if a file is binary or text?

There was some easy way but I forgot it..

First you need to define what you mean by binary and text.
Is a file "text" simply because it contains only the
printable (in ASCII) bytes between 31 and 127, plus
CR and/or LF, or do you have a more complex definition
in mind.

Better yet, what do you need the information for? Maybe
the answer to that will show us the proper path to take.
 
T

Trent Mick

[Sami Viitanen wrote]
Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..

Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary). Assuming that, then:

def is_binary(filename):
"""Return true iff the given filename is binary.

Raises an EnvironmentError if the file does not exist or cannot be
accessed.
"""
fin = open(filename, 'rb')
try:
CHUNKSIZE = 1024
while 1:
chunk = fin.read(CHUNKSIZE)
if '\0' in chunk: # found null byte
return 1
if len(chunk) < CHUNKSIZE:
break # done
finally:
fin.close()

return 0

Cheers,
Trent
 
G

Grant Edwards

How can I check if a file is binary or text?

In order to provide an answer, you'll have to define "binary"
and "text".
There was some easy way but I forgot it..

To _me_ a file isn't "binary" or "text". Those are two modes
you can use to read a file. The file itself is neutral on the
matter. At least under Windows and Unix. VMS and FILES-11
contained a _lot_ more meta-data and actually did have several
different fundamental file types (fixed length records,
variable length records, byte-stream, etc.).
 
P

Peter Hansen

Trent said:
[Sami Viitanen wrote]
Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..

Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).

"Contains only printable characters" is probably a more useful definition
of text in many cases. I can't say off the top of my head exactly when
either definition might be a problem.... wait, how about this one: in
CVS, if you don't have a file that is effectively line-oriented, human
readable information, you probably don't want to let it be treated as
"text" and stored as diffs. In that situation, "contains primarily
printable characters organized in lines" is probably a more thorough,
though less deterministic, definition.

-Peter
 
G

Graham Fawcett

Trent said:
[Sami Viitanen wrote]

Hello,

How can I check if a file is binary or text?

There was some easy way but I forgot it..

Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).

Dangerous assumption. Even if many or most binary files contain NULs, it
doesn't mean that they all do.

It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()

Sami, I would suggest that you need to stop thinking in terms of tools,
and instead think in terms of the problem you're trying to solve. Why do
you need to (or think you need to) determine whether a file is "binary"
or "text"? Why would your application fail if it received a
(binary/text) file when it expected a (text/binary) one?

My guess is that the trait you are trying to identify will prove not to
be "binary or text", but something more application-specific.

-- Graham

P.S. Sami, it's very bad form to "make up" an e-mail address, such as
<[email protected]>. I'm sure the owners of the none.net domain would agree.
Can't you provide a real address?
 
P

Peter Hansen

Grant said:
The definition of "printable" is dependent on the character
set, that will have to be specified.

That's why I said "printable (in ASCII)" in another message, so I
definitely agree. The problem was rather under-specified. :)
 
J

John Machin

Michael Peuser said:
When I had a similar problem I read 1000 characters, counted the amount of
<32 and >255 characters and classified it "binary when this qota exceeded

How many characters > 255 did you get? Did you mean 127? If so, what
about accented characters ... like umlauts?

On a slightly more serious note, CR, LF, HT and FF would have to be
considered "text" but their ordinal values are < 32.

What was the problem that you thought you were solving?
 
J

John Machin

Trent Mick said:
Generally I define a text file as "it has no null bytes". I think this
is a pretty safe definition (I would be interested to hear practical
experience to the contrary).

Data file written by C program which has an off-by-one error and is
including a trailing '\0' byte ...
 
J

John Machin

Graham Fawcett said:
It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()

I tried this but it didn't work. It said:

IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.

So I thought I had to be persistent but after doing it a few more times it said:

SerialIdiotError: What I tell you three times is true.
NotLispingError: You need 'wb' as in 'wascally wabbit'

This is very strange behaviour -- does my computer have worms?
 
G

Graham Fawcett

John said:
It is trivial to create a non-text file that has no NULs.

f = open('no_zeroes.bin', 'rb')
for x in range(1, 256):
f.write(chr(x))
f.close()

I tried this but it didn't work. It said:

IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.

So I thought I had to be persistent but after doing it a few more times it said:

SerialIdiotError: What I tell you three times is true.
NotLispingError: You need 'wb' as in 'wascally wabbit'

This is very strange behaviour -- does my computer have worms?

No, but my brain does. Glad you caught my typo.

However, it looks like your computer definitely has an AttitudeError!

-- Graham
 
P

Peter Hansen

John said:
Data file written by C program which has an off-by-one error and is
including a trailing '\0' byte ...

To be fair, I'd call that a "binary" file in any case, or at least
a defective text file...
 
B

Brian Lenihan

Peter Hansen said:
"Contains only printable characters" is probably a more useful definition
of text in many cases. I can't say off the top of my head exactly when
either definition might be a problem.... wait, how about this one: in
CVS, if you don't have a file that is effectively line-oriented, human
readable information, you probably don't want to let it be treated as
"text" and stored as diffs. In that situation, "contains primarily
printable characters organized in lines" is probably a more thorough,
though less deterministic, definition.

We check for binary files in our CVS commitprep script like this:

look for -kb arg
open the file in binary mode, read 4k fom the file and...

for i in range(len(buff)):
a = ord(buff)
if (a < 8) or (a > 13 and a < 32) or (a > 126):
non_text = non_text + 1

If 10 percent of the characters are found to be non-text, we reject
the file if it was not commited with the -kb flag, or print a warning
if the file appears to be text but is being checked in as a binary.

We don't bother checking for charsets other than ascii, because
localized files have to be checked in as binaries or bad things
(tm) happen.
 
S

Sami Viitanen

Thanks for the answers.

To be more specific I'm making a script that should
identify binary files as binary and text files as text.

The script is for automating CVS commands and
with CVS you have to add the -kb flag to
add (or import) binary files. (because it can't itself
determine what type the file is). If binary file is not
added with -kb the results are awful.

Script example usage:
-import.py <directory_name>

Script makes list of all files under that directory
and then determines each files filetype. After that
all files are added with Add command and binary
files get that additional -kb automatically.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top