Test if file is binary ?

R

Rebhan, Gilbert

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =3D
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

Regards, Gilbert
 
D

dima

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

Regards, Gilbert

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.
 
R

Rebhan, Gilbert

=20
Hi,

-----Original Message-----
From: dima [mailto:[email protected]]=20
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

Regards, Gilbert
 
R

Robert Klemme

2007/8/21 said:
Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:

class File
def self.binary?(name)
ascii = control = binary = 0

File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
case bt
when 0...32
control += 1
when 32...128
ascii += 1
else
binary += 1
end
end

control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
end
end

Kind regards

robert
 
R

Rebhan, Gilbert

=20
Hi,

-----Original Message-----
From: Robert Klemme [mailto:[email protected]]=20
Sent: Tuesday, August 21, 2007 9:05 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21 said:
Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =3D
NoMethodError: undefined method `binary?' for File:Class

Any ideas or libraries available ?

/*

If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.
Something like this:

class File
def self.binary?(name)
ascii =3D control =3D binary =3D 0

File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
case bt
when 0...32
control +=3D 1
when 32...128
ascii +=3D 1
else
binary +=3D 1
end
end

control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
end
end

*/


Nice :) Thanks !!

Regards, Gilbert
 
A

Alex Gutteridge

Hi,

-----Original Message-----
From: dima [mailto:[email protected]]
Sent: Tuesday, August 21, 2007 8:50 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

What to you need to achieve with this is_binary? method?
All files are just collection of bytes, so in a perspective they all
are binary. We interpret them as suites our needs.

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

Regards, Gilbert

One simple approach is this:

class File
def is_binary?
ascii = 0
total = 0
self.read(1024).each_byte{|c| total += 1; ascii +=1 if c >= 128
or c == 0}
ascii.to_f / total.to_f > 0.33 ? true : false
end
end

You can tweak the 0.33 value if you like. Probably better (i.e. more
robust) ways out there though.

Alex Gutteridge

Bioinformatics Center
Kyoto University
 
A

Alex Gutteridge

Sorry for the duplicate! Robert is too fast for me.

Alex Gutteridge

Bioinformatics Center
Kyoto University
 
R

Robert Klemme

2007/8/21 said:
Sorry for the duplicate! Robert is too fast for me.

It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic. Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file"). :)

Btw, you should get rid of the ternary operator - it's totally
superfluous because there is no point in converting a boolean value
into a boolean value. :)

Kind regards

robert
 
R

Rebhan, Gilbert

=20

-----Original Message-----
From: Robert Klemme [mailto:[email protected]]=20
Sent: Tuesday, August 21, 2007 9:41 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21 said:
Sorry for the duplicate! Robert is too fast for me.

/*
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic.
*/

you mean it should be something like ? =3D

class File
def self.is_binary?(name)
ascii =3D total =3D 0
File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
total +=3D 1;=20
ascii +=3D1 if c >=3D 128 or c =3D=3D 0
end
ascii.to_f / total.to_f > 0.33
end
end


/*
Might be fun to let both approaches
test a large number of files and compare their results (probably also
with output from "file"). :)
*/

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =3D

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are found
it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found =3D=3D 0 it's
always
considered as textfile

??


Regards, Gilbert
 
X

Xavier Noria

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are
found
it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found == 0 it's
always
considered as textfile

??

What's the heuristic in Subversion?

-- fxn
 
R

Rebhan, Gilbert

=20

-----Original Message-----
From: Xavier Noria [mailto:[email protected]]=20
Sent: Tuesday, August 21, 2007 10:25 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

Is there an exisiting standard what is considered as a binary file,
means a
rule like check the first block from a file and =3D

- if control characters (ASCII 0-32) and "high ASCII" (> 128) are =20
found
it's considered as binary file otherwise textfile

- if control characters (ASCII 0-32 and > 128) are found =3D=3D 0 it's
always
considered as textfile

??

/*
What's the heuristic in Subversion?
*/

the subversion FAQ
http://subversion.tigris.org/faq.html#binary-files has =3D
" ...
if any of the bytes are zero, or if more than 15% are not ASCII printing
characters,
then Subversion calls the file binary. This heuristic might be improved
in the future, however."

Regards, Gilbert
 
P

Peña, Botp

From: Rebhan, Gilbert [mailto:[email protected]]=20
# Is there an exisiting standard what is considered as a binary file,

if you're on a *nix (non-windows) box, you should use the os file =
command and then just wrap it in ruby,

irb(main):022:0> def is_bin(f)
irb(main):023:1> %x(file #{f}) !~ /text/
irb(main):024:1> end
=3D> nil
irb(main):025:0> is_bin "test.rb"
=3D> false
irb(main):026:0> is_bin "test.txt"
=3D> false
irb(main):027:0> is_bin "/usr/local/bin/dnscache"
=3D> true
irb(main):028:0> is_bin "/bin/ps"
=3D> true
irb(main):029:0> def is_text(f)
irb(main):030:1> %x(file #{f}) =3D~ /text/
irb(main):031:1> end
=3D> nil
irb(main):032:0> is_text "test.rb"
=3D> 27
irb(main):033:0> is_text "test.txt"
=3D> 16
irb(main):034:0> is_text "/usr/local/bin/dnscache"
=3D> nil
irb(main):035:0> is_text "/bin/ps"
=3D> nil

kind regards -botp
 
R

Robert Klemme

2007/8/21 said:
-----Original Message-----
From: Robert Klemme [mailto:[email protected]]
Sent: Tuesday, August 21, 2007 9:41 AM
To: ruby-talk ML
Subject: Re: Test if file is binary ?

2007/8/21 said:
Sorry for the duplicate! Robert is too fast for me.

/*
It's always good to see more solutions. I like the conciseness of
your solution. But I think this should rather be a class method
because you would not do the test on an open stream. Dunno which of
the solutions is more realistic.
*/

you mean it should be something like ? =

class File
def self.is_binary?(name)
ascii = total = 0
File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
total += 1;
ascii +=1 if c >= 128 or c == 0
end
ascii.to_f / total.to_f > 0.33
end
end

Yep. But I'd leave the "is_" out - that's handled by the "?" already.

Cheers

robert
 
B

Bertram Scharpf

Hi,

Am Dienstag, 21. Aug 2007, 15:57:13 +0900 schrieb Rebhan, Gilbert:
From: dima [mailto:[email protected]]
Sent: Tuesday, August 21, 2007 8:50 AM
Subject: Re: Test if file is binary ?

What to you need to achieve with this is_binary? method?

For example this information is needed to decide whether
cvs should handle that file / that fileextension as binary or ascii

I'm impressed by the solutions of Alex and Robert. Anyway I
suppose in most cases a test on one single null character
will suffice. Something like this:

class File
def binary?
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
end

Yet I recommend first to review whether you want to read the
file later. In this case you may abort reading when the file
fails a more sophisticated filetype check.

Dividing files into "text" and "binary" is the archetype
misdesign in the operating system you use. (Is there
anything designed well (besides Outlook, of course?)) The
distinction doesn't refer to the files _contents_ but how to
the file is _treated_ when it is being read or written. In
"rb"/"wb" modes files are left how they are, in "r"/"w"
modes Windows programmers get line ends "\r\n" translated
into "\n" what disturbs file positions and string lengths.
I think the only purpose of this is to detain programmers
from doing anything a non-Microsoft way.

Bertram
 
D

Daniel Berger

Hi ,

how to test if a file is binary or not ?

There ain't something like File.binary =
NoMethodError: undefined method `binary?' for File:Class

gem install ptools
require 'ptools'
File.binary?(file)

Regards,

Dan
 
B

Bertram Scharpf

Hi,

Am Dienstag, 21. Aug 2007, 18:06:26 +0900 schrieb Bertram Scharpf:
class File
def binary?
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
end

This is blunder, of course. Some better ones:

def File.binary? name
open name do |f|
while (b=f.read(256)) do
return true if b[ "\0"]
end
end
false
end

def File.binary? name
open name do |f|
f.each_byte { |x|
x.nonzero? or return true
}
end
false
end

Just to be corrrect.

Bertram
 
S

Simon Krahnke

* Robert Klemme said:
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.

class File
def self.binary?(name)
cr, len, mlen = false, 0, 0
File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
return false if cr and bt != 10
case bt
when 13
cr = true
when 10
mlen = len if len > mlen
len = 0
else
len += 1
end
end
mlen > 1000
end
end

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

mfg, simon .... l
 
S

Stefan Mahlitz

Simon said:
* Robert Klemme said:
If I'd really need it I'd probably do a heuristic based on
distribution of byte values across an initial portion of the file.

That only shows how many non-ascii-characters are used. It won't
recognise russian script in utf-8 as text, or uuencode as binary.

What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
is text if it's logically organized in short lines, and eohl cahracters
are used only for ending lines.
[snip]

I chose 1000 as the maximum line length, to fit whole paragraphs in one
line. But of course the maximum of the proceeding tool is relevant here.
There is the right place to do the check anyway.

That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

This is on my personal list of 'bad practices', but it may be
appropriate to others.

My 0.02EUR

Stefan
 
S

Simon Krahnke

* Stefan Mahlitz said:
That's why clearcase (on windows) claimed my pure-ascii xml-file was
non-text (and did refuse to check it in). One line exceeded 8000 characters.

You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.

One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.
This is on my personal list of 'bad practices', but it may be
appropriate to others.

I think it's bad practice to declare something with huge lines as text.

mfg, simon .... l
 
S

Stefan Mahlitz

Simon said:
You can't seriously treat a file with lines longer than 8000 characters
as line oriented. It's far from being readable by a human. You declare
that file as application/xml.

Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

Why do you think it is not readable?
One small change in that line will produce a patch of more than 8000
characters. And if that change is at the end of the line the diff tool
will have to use 4 pages of memory for the compare.

Sorry, I fail to see your point. Are we really judging whether a file is
text by how much memory pages a diff will take or how many characters a
patch has?

I couldn't find a definition of text except that text means absence of
binary data. This is weak - so I would follow your definition - A text
file is a file which can be read by a human.
I think it's bad practice to declare something with huge lines as text.

Well, I disagree.

But to get (slightly at least) ontopic again, if I would have to detect
whether a file is text I would go with a combination of Robert Klemmes
and Bertram Schrapfs solutions.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top