Determining if a file is binary or text

Discussion in 'Ruby' started by James Masters, Sep 19, 2009.

  1. Hi all,

    I need to search text files for a given expression and flag a warning/
    error if that expression does not exist. I'm going to search a large
    number of files using the Linux "find" command, so I won't know if
    they are binary or text.

    I realize that this can be OS-dependent and can be tricky to
    determine. I was going to use the Linux "file" command which works
    well in providing human-readable information about the file; however,
    due to a variety of possible file types, I cannot easily determine the
    file type without specifying every single possible text file format to
    consider. For example, the "file" command can produce the following
    (all of which are ASCII):

    ASCII text
    XML document text
    Lisp/Scheme program text
    ....

    Is there an easy way to do this in Ruby? After looking around quite a
    bit, I thought about looking at a few first lines of the file and
    matching against this regular expression:

    # Character class:
    # [:print:] Any printable character, including space
    line.match(/^[[:print:]]+$/)

    Which I believe could work. Any comments?

    Thanks,
    -James
    James Masters, Sep 19, 2009
    #1
    1. Advertising

  2. James Masters

    Seebs Guest

    On 2009-09-18, James Masters <> wrote:
    > I need to search text files for a given expression and flag a warning/
    > error if that expression does not exist. I'm going to search a large
    > number of files using the Linux "find" command, so I won't know if
    > they are binary or text.


    This question is not well defined.

    Think about UTF8 and ISO-8859-1...

    Basically, stop and think what you *mean* by "binary or text". Once you've
    articulated that more clearly, you may well have a much better notion of what
    you mean.

    Would you be expecting to not see this message in a "binary" file? If so,
    why are they different? What about binary files makes them not need the
    message (or what about text files makes them not need it...)? If you mean
    "executables", you might approximate decently by checking the execute
    permission bit...

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    Seebs, Sep 19, 2009
    #2
    1. Advertising

  3. On Sep 18, 5:12 pm, Seebs <> wrote:
    > Basically, stop and think what you *mean* by "binary or text".  Once you've
    > articulated that more clearly, you may well have a much better notion of what
    > you mean.


    How about a file that contains any single byte character (0-255) that
    you cannot find a key for on a standard US keyboard (English)? The
    [:print:] regular expression character set comprises the range of
    characters 32-126, which is what I believe that I need, but I wanted
    to see if there are better ways to accomplish this.

    Basically I'm trying to search for the presence of a header in source
    code files (which may have various extensions or no extensions at
    all). The source code files are mixed with executable and non-
    executable "binary" files (data files; not something that you can
    read). I don't want to flag the non-source code files as not having a
    header. The scope of this problem is small so I don't need to worry
    about any character sets, etc.

    I realize that this can be a complicated problem to solve, but there
    are solutions to it. For example, the Linux "file" command is a
    robust solution but does not meet my needs for the previously stated
    reason. I also know that SVN can automatically detect binary files as
    well.

    Hopefully this helps clear things up...

    Thanks,
    -James
    James Masters, Sep 19, 2009
    #3
  4. James Masters

    Seebs Guest

    On 2009-09-19, James Masters <> wrote:
    > How about a file that contains any single byte character (0-255) that
    > you cannot find a key for on a standard US keyboard (English)? The
    > [:print:] regular expression character set comprises the range of
    > characters 32-126, which is what I believe that I need, but I wanted
    > to see if there are better ways to accomplish this.


    Well, you probably also want tabs and newlines. :)

    I would think that [:print:] might also, in some locales, get you things
    like accented letters. Whether or not you want this is harder to say.

    > Basically I'm trying to search for the presence of a header in source
    > code files (which may have various extensions or no extensions at
    > all). The source code files are mixed with executable and non-
    > executable "binary" files (data files; not something that you can
    > read). I don't want to flag the non-source code files as not having a
    > header. The scope of this problem is small so I don't need to worry
    > about any character sets, etc.


    I thought that until I found a dozen Makefiles with copyright symbols
    embedded in them. :p

    I'd say as a first approximation, just check for NUL bytes. I'm pretty
    sure that the vast majority of binary files will contain at least one,
    and the vast majority of text files will contain none.

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    Seebs, Sep 19, 2009
    #4
  5. James Masters

    Seebs Guest

    On 2009-09-19, James Masters <> wrote:
    > Fortunately, I'm working with a small team of individuals who will be
    > authoring the files so I do have some control on the type of text that
    > I'm looking for. So I might try [:print], \n, \t, and maybe \r (just
    > in case) and then fall back on the NULL idea as a Plan B.


    How many files are you dealing with?

    Hmm. Some source files (scripts, say) will be executable, so you can't
    assumme executables are binaries. But... You might want to experiment with
    testing a few likely heuristics and maybe making a chart. Say, make a list
    of:

    TEST: .jpg x-bit NUL 128-255

    FILE:
    foo.jpg X - X X
    foo.sh - X - -
    ....

    and then look to see whether you can make some simple rules, like
    "everything with .jpg or .gif is definitely a binary." If you can
    get a couple of simple rules that deal with 90% of so of the files,
    then you can look at the remainder as a separate case and work from
    there.

    Don't feel compelled to make a single perfect test when three easy tests
    that handle 70% of the cases might give you a remaining pool for which
    it's much easier to write a good test.

    -s
    --
    Copyright 2009, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
    Seebs, Sep 19, 2009
    #5
  6. On Sep 18, 7:54 pm, Seebs <> wrote:
    > Well, you probably also want tabs and newlines.  :)


    Ah, good point... :)

    > I thought that until I found a dozen Makefiles with copyright symbols
    > embedded in them.  :p
    >
    > I'd say as a first approximation, just check for NUL bytes.  I'm pretty
    > sure that the vast majority of binary files will contain at least one,
    > and the vast majority of text files will contain none.


    Yeah, this is another idea that I had also considered... I'm just not
    sure if all of the binary files that I'm dealing with have NULL bytes
    though. But that might just be good enough.

    Fortunately, I'm working with a small team of individuals who will be
    authoring the files so I do have some control on the type of text that
    I'm looking for. So I might try [:print], \n, \t, and maybe \r (just
    in case) and then fall back on the NULL idea as a Plan B.

    Thanks again,
    -James
    James Masters, Sep 19, 2009
    #6
  7. [Note: parts of this message were removed to make it a legal post.]

    Evening James.

    On Fri, Sep 18, 2009 at 4:15 PM, James Masters <>wrote:

    > ...


    For example, the "file" command can produce the following
    > (all of which are ASCII):
    >
    > ASCII text
    > XML document text
    > Lisp/Scheme program text
    >


    What about file -i which returns the MIME type instead of "human readable"
    format. That should limit the choices it will return or at least give you
    something you can work with.

    John
    John W Higgins, Sep 19, 2009
    #7
  8. On Sep 18, 8:47 pm, John W Higgins <> wrote:
    > What about file -i which returns the MIME type instead of "human readable"
    > format. That should limit the choices it will return or at least give you
    > something you can work with.


    Hi John - that's a good idea - I looked over the "file" command
    options over and over again today and somehow I missed this.
    James Masters, Sep 19, 2009
    #8
  9. On 19.09.2009 01:14, James Masters wrote:
    > Hi all,
    >
    > I need to search text files for a given expression and flag a warning/
    > error if that expression does not exist. I'm going to search a large
    > number of files using the Linux "find" command, so I won't know if
    > they are binary or text.
    >
    > I realize that this can be OS-dependent and can be tricky to
    > determine. I was going to use the Linux "file" command which works
    > well in providing human-readable information about the file; however,
    > due to a variety of possible file types, I cannot easily determine the
    > file type without specifying every single possible text file format to
    > consider. For example, the "file" command can produce the following
    > (all of which are ASCII):
    >
    > ASCII text
    > XML document text
    > Lisp/Scheme program text
    > ...
    >
    > Is there an easy way to do this in Ruby? After looking around quite a
    > bit, I thought about looking at a few first lines of the file and
    > matching against this regular expression:
    >
    > # Character class:
    > # [:print:] Any printable character, including space
    > line.match(/^[[:print:]]+$/)
    >
    > Which I believe could work. Any comments?


    Just using a single "+" seems too unsafe to me: you need only three
    matching bytes which does not seem too unlikely even for binary files.

    Some more random thoughts: if you use Ruby to determine file types you
    can as well use Find.find to find all files removing the dependency to
    an external program.

    A complete different approach would be to define classes of bytes and do
    statistics on the first n bytes from the file, e.g.

    32-127, \r, \n, \t printable
    0-31 without \n, \t, \r, 128-255 non printable

    Then determine based on ratio of occurrences. Of course, that approach
    can also be tricky...

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Sep 19, 2009
    #9
  10. James Masters

    g_f Guest

    By convention, source and object files use standardized file-type
    extensions, which should help you weed out files to ignore.

    As a starting point ask the developers what file-type extensions
    they're using. As a second check, run something like the following
    commands at the top of the path you'll be checking:

    find . | xargs -n1 basename | egrep '\.\w+$' | awk -F. {'print
    $2'} | sort -u

    to give you a list of possible extensions, then check those too.

    Use "file" and "file -i" to do a best-guess once you've narrowed your
    possibilities. Both use "magic" files which define where file should
    look inside a target file to determine what type it is. They are
    fallible though and you can get false positives. Do a "man magic" from
    the command-line on your Linux box for more info.

    Also, be careful assuming only binary files have \x00 bytes or high-
    order ASCII. Old text files that have migrated from other systems
    could have them, as could files where someone ALT+fat-fingered on the
    keypad as could a source file coming from a non-english speaking
    nation where the developer used variable names in his native language.
    You just never know what you'll find in those pesky source files.
    g_f, Sep 19, 2009
    #10
  11. Robert Klemme wrote:
    > On 19.09.2009 01:14, James Masters wrote:
    >> Hi all,
    >>
    >> I need to search text files for a given expression and flag a warning/
    >> error if that expression does not exist. I'm going to search a large
    >> number of files using the Linux "find" command, so I won't know if
    >> they are binary or text.
    >>
    >> I realize that this can be OS-dependent and can be tricky to
    >> determine. I was going to use the Linux "file" command which works
    >> well in providing human-readable information about the file; however,
    >> due to a variety of possible file types, I cannot easily determine the
    >> file type without specifying every single possible text file format to
    >> consider. For example, the "file" command can produce the following
    >> (all of which are ASCII):
    >>
    >> ASCII text
    >> XML document text
    >> Lisp/Scheme program text
    >> ...
    >>
    >> Is there an easy way to do this in Ruby? After looking around quite a
    >> bit, I thought about looking at a few first lines of the file and
    >> matching against this regular expression:
    >>
    >> # Character class:
    >> # [:print:] Any printable character, including space
    >> line.match(/^[[:print:]]+$/)
    >>
    >> Which I believe could work. Any comments?

    >
    > Just using a single "+" seems too unsafe to me: you need only three
    > matching bytes which does not seem too unlikely even for binary files.
    >
    > Some more random thoughts: if you use Ruby to determine file types you
    > can as well use Find.find to find all files removing the dependency to
    > an external program.
    >
    > A complete different approach would be to define classes of bytes and do
    > statistics on the first n bytes from the file, e.g.
    >
    > 32-127, \r, \n, \t printable
    > 0-31 without \n, \t, \r, 128-255 non printable
    >


    I have a problem with considering 128-255 being non-printable. A lot of
    these characters are printable, and can be part of text, much like I use
    Alt-0xxx keys in Pagemaker a lot. The other problem with saying a file
    is not a text file is determining what is meant by a text file. Is it
    strictly a file with only Ascii text like a log file, or does it include
    formated text like word processor file? Word processing and spreadsheet
    files contain many characters that are considered non-printable but
    display as text with the correct program.


    > Then determine based on ratio of occurrences. Of course, that approach
    > can also be tricky...
    >
    > Kind regards
    >
    > robert
    >
    Michael W. Ryder, Sep 19, 2009
    #11
  12. James Masters

    Xavier Noria Guest

    FWIW Subversion flags binaries automatically.

    If svn does that, I guess there's gonna be some heuristics that work
    reasonably well in practice.
    Xavier Noria, Sep 19, 2009
    #12
  13. James Masters wrote:
    > Hi all,
    >
    > I need to search text files for a given expression and flag a warning/
    > error if that expression does not exist. I'm going to search a large
    > number of files using the Linux "find" command, so I won't know if
    > they are binary or text.


    require 'ptools'

    File.binary?(your_file)

    Regards,

    Dan
    Daniel Berger, Sep 20, 2009
    #13
  14. On Sep 20, 2:17 am, Robert Klemme <> wrote:
    > I fully agree: the difficult part is in deciding: what is a text file?
    > If that has been clarified enough the algorithm for checking should
    > become much more obvious.


    I agree - this is what it comes down to. BTW, I tried the following
    on my project (using Find#find to get the tree) on the first 40
    "lines" (which I know can theoretically be very short or long in a
    "binary" file) and it seems to work for what I'm doing. This works
    well for me also because I'm checking for the presence of a header and
    I can do this check along with checking for a header while the file is
    still open:

    line.match(/^[[:print:]\t\n\r]+$/)

    But probably a better approach would be to use a ratio of characters
    that are printable against those that may traditionally be non-
    printable in the even that some "non-printable" characters are present
    in a text file. This is what SVN does (found the link from a post
    from Xavier on a "ptools" website when I Googled it):

    http://subversion.tigris.org/faq.html#binary-files

    And it also appears to be what File#binary? is doing in ptools (I
    checked the source code; thanks Dan for the pointer).
    James Masters, Sep 21, 2009
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stephen Walch

    Determining ContentType for binary attachments

    Stephen Walch, Jan 5, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    5,376
    Steven Cheng[MSFT]
    Jan 6, 2005
  2. javaBeginner
    Replies:
    1
    Views:
    13,015
    snoopyjc
    Apr 30, 2008
  3. Rajorshi
    Replies:
    4
    Views:
    19,167
    Rajorshi
    Mar 2, 2004
  4. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    927
    James Kanze
    Apr 28, 2008
  5. Jim
    Replies:
    6
    Views:
    731
Loading...

Share This Page