Test if file is binary ?

Discussion in 'Ruby' started by Rebhan, Gilbert, Aug 21, 2007.

  1. Hi ,

    how to test if a file is binary or not ?

    There ain't something like File.binary =3D
    NoMethodError: undefined method `binary?' for File:Class

    Any ideas or libraries available ?

    Regards, Gilbert
    Rebhan, Gilbert, Aug 21, 2007
    #1
    1. Advertising

  2. Rebhan, Gilbert

    dima Guest

    On Aug 21, 8:04 am, "Rebhan, Gilbert" <>
    wrote:
    > Hi ,
    >
    > how to test if a file is binary or not ?
    >
    > There ain't something like File.binary =
    > NoMethodError: undefined method `binary?' for File:Class
    >
    > Any ideas or libraries available ?
    >
    > Regards, Gilbert


    What to you need to achieve with this is_binary? method?
    All files are just collection of bytes, so in a perspective they all
    are binary. We interpret them as suites our needs.
    dima, Aug 21, 2007
    #2
    1. Advertising

  3. =20
    Hi,

    -----Original Message-----
    From: dima [mailto:]=20
    Sent: Tuesday, August 21, 2007 8:50 AM
    To: ruby-talk ML
    Subject: Re: Test if file is binary ?

    On Aug 21, 8:04 am, "Rebhan, Gilbert" <>
    wrote:
    > Hi ,
    >>
    >> how to test if a file is binary or not ?
    >>
    >> There ain't something like File.binary =3D
    >> NoMethodError: undefined method `binary?' for File:Class
    >>
    >> Any ideas or libraries available ?


    >What to you need to achieve with this is_binary? method?
    >All files are just collection of bytes, so in a perspective they all
    >are binary. We interpret them as suites our needs.


    For example this information is needed to decide whether
    cvs should handle that file / that fileextension as binary or ascii

    Regards, Gilbert
    Rebhan, Gilbert, Aug 21, 2007
    #3
  4. 2007/8/21, Rebhan, Gilbert <>:
    >
    > Hi ,
    >
    > how to test if a file is binary or not ?
    >
    > There ain't something like File.binary =
    > NoMethodError: undefined method `binary?' for File:Class
    >
    > Any ideas or libraries available ?


    If I'd really need it I'd probably do a heuristic based on
    distribution of byte values across an initial portion of the file.
    Something like this:

    class File
    def self.binary?(name)
    ascii = control = binary = 0

    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
    case bt
    when 0...32
    control += 1
    when 32...128
    ascii += 1
    else
    binary += 1
    end
    end

    control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
    end
    end

    Kind regards

    robert
    Robert Klemme, Aug 21, 2007
    #4
  5. =20
    Hi,

    -----Original Message-----
    From: Robert Klemme [mailto:]=20
    Sent: Tuesday, August 21, 2007 9:05 AM
    To: ruby-talk ML
    Subject: Re: Test if file is binary ?

    2007/8/21, Rebhan, Gilbert <>:
    >
    > Hi ,
    >
    > how to test if a file is binary or not ?
    >
    > There ain't something like File.binary =3D
    > NoMethodError: undefined method `binary?' for File:Class
    >
    > Any ideas or libraries available ?


    /*

    If I'd really need it I'd probably do a heuristic based on
    distribution of byte values across an initial portion of the file.
    Something like this:

    class File
    def self.binary?(name)
    ascii =3D control =3D binary =3D 0

    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
    case bt
    when 0...32
    control +=3D 1
    when 32...128
    ascii +=3D 1
    else
    binary +=3D 1
    end
    end

    control.to_f / ascii > 0.1 || binary.to_f / ascii > 0.05
    end
    end

    */


    Nice :) Thanks !!

    Regards, Gilbert
    Rebhan, Gilbert, Aug 21, 2007
    #5
  6. On 21 Aug 2007, at 15:57, Rebhan, Gilbert wrote:

    >
    > Hi,
    >
    > -----Original Message-----
    > From: dima [mailto:]
    > Sent: Tuesday, August 21, 2007 8:50 AM
    > To: ruby-talk ML
    > Subject: Re: Test if file is binary ?
    >
    > On Aug 21, 8:04 am, "Rebhan, Gilbert" <>
    > wrote:
    >> Hi ,
    >>>
    >>> how to test if a file is binary or not ?
    >>>
    >>> There ain't something like File.binary =
    >>> NoMethodError: undefined method `binary?' for File:Class
    >>>
    >>> Any ideas or libraries available ?

    >
    >> What to you need to achieve with this is_binary? method?
    >> All files are just collection of bytes, so in a perspective they all
    >> are binary. We interpret them as suites our needs.

    >
    > For example this information is needed to decide whether
    > cvs should handle that file / that fileextension as binary or ascii
    >
    > Regards, Gilbert


    One simple approach is this:

    class File
    def is_binary?
    ascii = 0
    total = 0
    self.read(1024).each_byte{|c| total += 1; ascii +=1 if c >= 128
    or c == 0}
    ascii.to_f / total.to_f > 0.33 ? true : false
    end
    end

    You can tweak the 0.33 value if you like. Probably better (i.e. more
    robust) ways out there though.

    Alex Gutteridge

    Bioinformatics Center
    Kyoto University
    Alex Gutteridge, Aug 21, 2007
    #6
  7. Sorry for the duplicate! Robert is too fast for me.

    Alex Gutteridge

    Bioinformatics Center
    Kyoto University
    Alex Gutteridge, Aug 21, 2007
    #7
  8. 2007/8/21, Alex Gutteridge <-u.ac.jp>:
    > Sorry for the duplicate! Robert is too fast for me.


    It's always good to see more solutions. I like the conciseness of
    your solution. But I think this should rather be a class method
    because you would not do the test on an open stream. Dunno which of
    the solutions is more realistic. Might be fun to let both approaches
    test a large number of files and compare their results (probably also
    with output from "file"). :)

    Btw, you should get rid of the ternary operator - it's totally
    superfluous because there is no point in converting a boolean value
    into a boolean value. :)

    Kind regards

    robert
    Robert Klemme, Aug 21, 2007
    #8
  9. =20

    -----Original Message-----
    From: Robert Klemme [mailto:]=20
    Sent: Tuesday, August 21, 2007 9:41 AM
    To: ruby-talk ML
    Subject: Re: Test if file is binary ?

    2007/8/21, Alex Gutteridge <-u.ac.jp>:
    > Sorry for the duplicate! Robert is too fast for me.


    /*
    It's always good to see more solutions. I like the conciseness of
    your solution. But I think this should rather be a class method
    because you would not do the test on an open stream. Dunno which of
    the solutions is more realistic.
    */

    you mean it should be something like ? =3D

    class File
    def self.is_binary?(name)
    ascii =3D total =3D 0
    File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
    total +=3D 1;=20
    ascii +=3D1 if c >=3D 128 or c =3D=3D 0
    end
    ascii.to_f / total.to_f > 0.33
    end
    end


    /*
    Might be fun to let both approaches
    test a large number of files and compare their results (probably also
    with output from "file"). :)
    */

    Is there an exisiting standard what is considered as a binary file,
    means a
    rule like check the first block from a file and =3D

    - if control characters (ASCII 0-32) and "high ASCII" (> 128) are found
    >30 %

    it's considered as binary file otherwise textfile

    - if control characters (ASCII 0-32 and > 128) are found =3D=3D 0 it's
    always
    considered as textfile

    ??


    Regards, Gilbert
    Rebhan, Gilbert, Aug 21, 2007
    #9
  10. Rebhan, Gilbert

    Xavier Noria Guest

    On Aug 21, 2007, at 10:21 AM, Rebhan, Gilbert wrote:

    > Is there an exisiting standard what is considered as a binary file,
    > means a
    > rule like check the first block from a file and =
    >
    > - if control characters (ASCII 0-32) and "high ASCII" (> 128) are
    > found
    >> 30 %

    > it's considered as binary file otherwise textfile
    >
    > - if control characters (ASCII 0-32 and > 128) are found == 0 it's
    > always
    > considered as textfile
    >
    > ??


    What's the heuristic in Subversion?

    -- fxn
    Xavier Noria, Aug 21, 2007
    #10
  11. =20

    -----Original Message-----
    From: Xavier Noria [mailto:]=20
    Sent: Tuesday, August 21, 2007 10:25 AM
    To: ruby-talk ML
    Subject: Re: Test if file is binary ?

    On Aug 21, 2007, at 10:21 AM, Rebhan, Gilbert wrote:

    > Is there an exisiting standard what is considered as a binary file,
    > means a
    > rule like check the first block from a file and =3D
    >
    > - if control characters (ASCII 0-32) and "high ASCII" (> 128) are =20
    > found
    >> 30 %

    > it's considered as binary file otherwise textfile
    >
    > - if control characters (ASCII 0-32 and > 128) are found =3D=3D 0 it's
    > always
    > considered as textfile
    >
    > ??


    /*
    What's the heuristic in Subversion?
    */

    the subversion FAQ
    http://subversion.tigris.org/faq.html#binary-files has =3D
    " ...
    if any of the bytes are zero, or if more than 15% are not ASCII printing
    characters,
    then Subversion calls the file binary. This heuristic might be improved
    in the future, however."

    Regards, Gilbert
    Rebhan, Gilbert, Aug 21, 2007
    #11
  12. Rebhan, Gilbert

    Peña, Botp Guest

    From: Rebhan, Gilbert [mailto:]=20
    # Is there an exisiting standard what is considered as a binary file,

    if you're on a *nix (non-windows) box, you should use the os file =
    command and then just wrap it in ruby,

    irb(main):022:0> def is_bin(f)
    irb(main):023:1> %x(file #{f}) !~ /text/
    irb(main):024:1> end
    =3D> nil
    irb(main):025:0> is_bin "test.rb"
    =3D> false
    irb(main):026:0> is_bin "test.txt"
    =3D> false
    irb(main):027:0> is_bin "/usr/local/bin/dnscache"
    =3D> true
    irb(main):028:0> is_bin "/bin/ps"
    =3D> true
    irb(main):029:0> def is_text(f)
    irb(main):030:1> %x(file #{f}) =3D~ /text/
    irb(main):031:1> end
    =3D> nil
    irb(main):032:0> is_text "test.rb"
    =3D> 27
    irb(main):033:0> is_text "test.txt"
    =3D> 16
    irb(main):034:0> is_text "/usr/local/bin/dnscache"
    =3D> nil
    irb(main):035:0> is_text "/bin/ps"
    =3D> nil

    kind regards -botp
    Peña, Botp, Aug 21, 2007
    #12
  13. 2007/8/21, Rebhan, Gilbert <>:
    >
    >
    > -----Original Message-----
    > From: Robert Klemme [mailto:]
    > Sent: Tuesday, August 21, 2007 9:41 AM
    > To: ruby-talk ML
    > Subject: Re: Test if file is binary ?
    >
    > 2007/8/21, Alex Gutteridge <-u.ac.jp>:
    > > Sorry for the duplicate! Robert is too fast for me.

    >
    > /*
    > It's always good to see more solutions. I like the conciseness of
    > your solution. But I think this should rather be a class method
    > because you would not do the test on an open stream. Dunno which of
    > the solutions is more realistic.
    > */
    >
    > you mean it should be something like ? =
    >
    > class File
    > def self.is_binary?(name)
    > ascii = total = 0
    > File.open(name, "rb") { |io| io.read(1024) }.each_byte do |c|
    > total += 1;
    > ascii +=1 if c >= 128 or c == 0
    > end
    > ascii.to_f / total.to_f > 0.33
    > end
    > end


    Yep. But I'd leave the "is_" out - that's handled by the "?" already.

    Cheers

    robert
    Robert Klemme, Aug 21, 2007
    #13
  14. Hi,

    Am Dienstag, 21. Aug 2007, 15:57:13 +0900 schrieb Rebhan, Gilbert:
    > From: dima [mailto:]
    > Sent: Tuesday, August 21, 2007 8:50 AM
    > Subject: Re: Test if file is binary ?
    >
    > On Aug 21, 8:04 am, "Rebhan, Gilbert" <>
    > wrote:
    > >> how to test if a file is binary or not ?
    > >>
    > >> There ain't something like File.binary =
    > >> NoMethodError: undefined method `binary?' for File:Class

    >
    > >What to you need to achieve with this is_binary? method?

    >
    > For example this information is needed to decide whether
    > cvs should handle that file / that fileextension as binary or ascii


    I'm impressed by the solutions of Alex and Robert. Anyway I
    suppose in most cases a test on one single null character
    will suffice. Something like this:

    class File
    def binary?
    while (b=f.read(256)) do
    return true if b[ "\0"]
    end
    end
    end

    Yet I recommend first to review whether you want to read the
    file later. In this case you may abort reading when the file
    fails a more sophisticated filetype check.

    Dividing files into "text" and "binary" is the archetype
    misdesign in the operating system you use. (Is there
    anything designed well (besides Outlook, of course?)) The
    distinction doesn't refer to the files _contents_ but how to
    the file is _treated_ when it is being read or written. In
    "rb"/"wb" modes files are left how they are, in "r"/"w"
    modes Windows programmers get line ends "\r\n" translated
    into "\n" what disturbs file positions and string lengths.
    I think the only purpose of this is to detain programmers
    from doing anything a non-Microsoft way.

    Bertram


    --
    Bertram Scharpf
    Stuttgart, Deutschland/Germany
    http://www.bertram-scharpf.de
    Bertram Scharpf, Aug 21, 2007
    #14
  15. On Aug 21, 12:04 am, "Rebhan, Gilbert" <>
    wrote:
    > Hi ,
    >
    > how to test if a file is binary or not ?
    >
    > There ain't something like File.binary =
    > NoMethodError: undefined method `binary?' for File:Class


    gem install ptools
    require 'ptools'
    File.binary?(file)

    Regards,

    Dan
    Daniel Berger, Aug 21, 2007
    #15
  16. Hi,

    Am Dienstag, 21. Aug 2007, 18:06:26 +0900 schrieb Bertram Scharpf:
    > class File
    > def binary?
    > while (b=f.read(256)) do
    > return true if b[ "\0"]
    > end
    > end
    > end


    This is blunder, of course. Some better ones:

    def File.binary? name
    open name do |f|
    while (b=f.read(256)) do
    return true if b[ "\0"]
    end
    end
    false
    end

    def File.binary? name
    open name do |f|
    f.each_byte { |x|
    x.nonzero? or return true
    }
    end
    false
    end

    Just to be corrrect.

    Bertram


    --
    Bertram Scharpf
    Stuttgart, Deutschland/Germany
    http://www.bertram-scharpf.de
    Bertram Scharpf, Aug 21, 2007
    #16
  17. * Robert Klemme <> (09:04) schrieb:

    > If I'd really need it I'd probably do a heuristic based on
    > distribution of byte values across an initial portion of the file.


    That only shows how many non-ascii-characters are used. It won't
    recognise russian script in utf-8 as text, or uuencode as binary.

    What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
    is text if it's logically organized in short lines, and eohl cahracters
    are used only for ending lines.

    class File
    def self.binary?(name)
    cr, len, mlen = false, 0, 0
    File.open(name, "rb") {|io| io.read(1024)}.each_byte do |bt|
    return false if cr and bt != 10
    case bt
    when 13
    cr = true
    when 10
    mlen = len if len > mlen
    len = 0
    else
    len += 1
    end
    end
    mlen > 1000
    end
    end

    I chose 1000 as the maximum line length, to fit whole paragraphs in one
    line. But of course the maximum of the proceeding tool is relevant here.
    There is the right place to do the check anyway.

    mfg, simon .... l
    Simon Krahnke, Aug 21, 2007
    #17
  18. Simon Krahnke wrote:
    > * Robert Klemme <> (09:04) schrieb:
    >
    >> If I'd really need it I'd probably do a heuristic based on
    >> distribution of byte values across an initial portion of the file.

    >
    > That only shows how many non-ascii-characters are used. It won't
    > recognise russian script in utf-8 as text, or uuencode as binary.
    >
    > What diff (and thus rcs, cvs, svn ...) cares about is lines. Something
    > is text if it's logically organized in short lines, and eohl cahracters
    > are used only for ending lines.


    [snip]

    > I chose 1000 as the maximum line length, to fit whole paragraphs in one
    > line. But of course the maximum of the proceeding tool is relevant here.
    > There is the right place to do the check anyway.


    That's why clearcase (on windows) claimed my pure-ascii xml-file was
    non-text (and did refuse to check it in). One line exceeded 8000 characters.

    This is on my personal list of 'bad practices', but it may be
    appropriate to others.

    My 0.02EUR

    Stefan
    Stefan Mahlitz, Aug 21, 2007
    #18
  19. * Stefan Mahlitz <> (22:40) schrieb:

    > That's why clearcase (on windows) claimed my pure-ascii xml-file was
    > non-text (and did refuse to check it in). One line exceeded 8000 characters.


    You can't seriously treat a file with lines longer than 8000 characters
    as line oriented. It's far from being readable by a human. You declare
    that file as application/xml.

    One small change in that line will produce a patch of more than 8000
    characters. And if that change is at the end of the line the diff tool
    will have to use 4 pages of memory for the compare.

    > This is on my personal list of 'bad practices', but it may be
    > appropriate to others.


    I think it's bad practice to declare something with huge lines as text.

    mfg, simon .... l
    Simon Krahnke, Aug 22, 2007
    #19
  20. Simon Krahnke wrote:
    > * Stefan Mahlitz <> (22:40) schrieb:
    >
    >> That's why clearcase (on windows) claimed my pure-ascii xml-file was
    >> non-text (and did refuse to check it in). One line exceeded 8000 characters.

    >
    > You can't seriously treat a file with lines longer than 8000 characters
    > as line oriented. It's far from being readable by a human. You declare
    > that file as application/xml.


    Maybe this was a bad example. You are right, the xml-file would be best
    treated by clearcase as application/xml or text/xml. This did not work
    (and I was bitten by this recently - so this strange behaviour was fresh
    when I read your email).

    But I cannot see the problem with text-files containing long lines. If I
    write a single paragraph with more than 1000 or 8000 characters - why
    shouldn't this be text?

    Why do you think it is not readable?

    > One small change in that line will produce a patch of more than 8000
    > characters. And if that change is at the end of the line the diff tool
    > will have to use 4 pages of memory for the compare.


    Sorry, I fail to see your point. Are we really judging whether a file is
    text by how much memory pages a diff will take or how many characters a
    patch has?

    I couldn't find a definition of text except that text means absence of
    binary data. This is weak - so I would follow your definition - A text
    file is a file which can be read by a human.

    >> This is on my personal list of 'bad practices', but it may be
    >> appropriate to others.

    >
    > I think it's bad practice to declare something with huge lines as text.


    Well, I disagree.

    But to get (slightly at least) ontopic again, if I would have to detect
    whether a file is text I would go with a combination of Robert Klemmes
    and Bertram Schrapfs solutions.

    Stefan
    Stefan Mahlitz, Aug 22, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Pete Fraser
    Replies:
    4
    Views:
    6,774
    Mike Treseler
    Nov 4, 2004
  2. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    912
    James Kanze
    Apr 28, 2008
  3. Skybuck Flying

    Call oddities: &Test() vs &Test vs Test

    Skybuck Flying, Oct 4, 2009, in forum: C Programming
    Replies:
    1
    Views:
    683
    Skybuck Flying
    Oct 4, 2009
  4. Jim
    Replies:
    6
    Views:
    722
  5. Hunt Jon
    Replies:
    1
    Views:
    96
    Patrick Doyle
    Dec 15, 2008
Loading...

Share This Page