Determine file type (binary or text)

Discussion in 'Python' started by Sami Viitanen, Aug 13, 2003.

  1. Hello,

    How can I check if a file is binary or text?

    There was some easy way but I forgot it..


    Thanks in adv.
     
    Sami Viitanen, Aug 13, 2003
    #1
    1. Advertising

  2. Sami Viitanen

    bromden Guest

    > How can I check if a file is binary or text?

    >>> import os
    >>> f = os.popen('file -bi test.py', 'r')
    >>> f.read().startswith('text')

    1

    (btw, f.read() returns 'text/x-java; charset=us-ascii\n')

    --
    bromden[at]gazeta.pl
     
    bromden, Aug 13, 2003
    #2
    1. Advertising

  3. Sami Viitanen

    bromden Guest

    > >>> f = os.popen('file -bi test.py', 'r')
    > >>> f.read().startswith('text')


    sorry, it's not general, since "file -i" returns
    "application/x-shellscript" for shell scripts,
    it's better to go like that:
    >>> import os
    >>> f = os.popen('file test.py', 'r')
    >>> f.read().find('text') != -1


    --
    bromden[at]gazeta.pl
     
    bromden, Aug 13, 2003
    #3
  4. Works well in Unix but I'm making a script that works on both
    Unix and Windows.

    Win doesn't have that 'file -bi' command.

    "bromden" <> wrote in message
    news:bhd559$ku9$...
    > > How can I check if a file is binary or text?

    >
    > >>> import os
    > >>> f = os.popen('file -bi test.py', 'r')
    > >>> f.read().startswith('text')

    > 1
    >
    > (btw, f.read() returns 'text/x-java; charset=us-ascii\n')
    >
    > --
    > bromden[at]gazeta.pl
    >
     
    Sami Viitanen, Aug 13, 2003
    #4
  5. Hi,
    yes there is more than just Unix in the world ;-)
    Windows directories have no means to specify their contents type in any way.
    The approved method is using three-letter extensions, though this rule is
    not strictly followed (lot of files without extension nowadays!)

    When I had a similar problem I read 1000 characters, counted the amount of
    <32 and >255 characters and classified it "binary when this qota exceeded
    20%. I have no idea whether it will work good with chinese unicode files or
    some funny depositories or project files that store uncompressed texts....

    KIndly
    Michael P

    "Sami Viitanen" <> schrieb im Newsbeitrag
    news:v7p_a.1558$...
    > Works well in Unix but I'm making a script that works on both
    > Unix and Windows.
    >
    > Win doesn't have that 'file -bi' command.
    >
    > "bromden" <> wrote in message
    > news:bhd559$ku9$...
    > > > How can I check if a file is binary or text?

    > >
    > > >>> import os
    > > >>> f = os.popen('file -bi test.py', 'r')
    > > >>> f.read().startswith('text')

    > > 1
    > >
    > > (btw, f.read() returns 'text/x-java; charset=us-ascii\n')
    > >
    > > --
    > > bromden[at]gazeta.pl
    > >

    >
    >
     
    Michael Peuser, Aug 13, 2003
    #5
  6. Sami Viitanen

    Karl Scalet Guest

    Michael Peuser schrieb:
    > Hi,
    > yes there is more than just Unix in the world ;-)
    > Windows directories have no means to specify their contents type in any way.


    That's even more true with linux/unix, as there is no need to do
    any stuff like line-terminator conversion.

    > The approved method is using three-letter extensions, though this rule is
    > not strictly followed (lot of files without extension nowadays!)
    >
    > When I had a similar problem I read 1000 characters, counted the amount of
    > <32 and >255 characters and classified it "binary when this qota exceeded
    > 20%. I have no idea whether it will work good with chinese unicode files or
    > some funny depositories or project files that store uncompressed texts....


    based on the idea from Mr. "bromden", why not use mimetypes.MimeTypes()
    and guess_type('file://...') and analye the returned string.
    This should work on windows / linux / unix / whatever.


    Karl


    >
    > KIndly
    > Michael P
    >
    > "Sami Viitanen" <> schrieb im Newsbeitrag
    > news:v7p_a.1558$...
    >
    >>Works well in Unix but I'm making a script that works on both
    >>Unix and Windows.
    >>
    >>Win doesn't have that 'file -bi' command.
    >>
    >>"bromden" <> wrote in message
    >>news:bhd559$ku9$...
    >>
    >>>>How can I check if a file is binary or text?
    >>>
    >>> >>> import os
    >>> >>> f = os.popen('file -bi test.py', 'r')
    >>> >>> f.read().startswith('text')
    >>>1
    >>>
    >>>(btw, f.read() returns 'text/x-java; charset=us-ascii\n')
    >>>
    >>>--
    >>>bromden[at]gazeta.pl
    >>>

    >>
    >>

    >
    >
     
    Karl Scalet, Aug 13, 2003
    #6
  7. Sami Viitanen

    Peter Hansen Guest

    Sami Viitanen wrote:
    >
    > How can I check if a file is binary or text?
    >
    > There was some easy way but I forgot it..


    First you need to define what you mean by binary and text.
    Is a file "text" simply because it contains only the
    printable (in ASCII) bytes between 31 and 127, plus
    CR and/or LF, or do you have a more complex definition
    in mind.

    Better yet, what do you need the information for? Maybe
    the answer to that will show us the proper path to take.
     
    Peter Hansen, Aug 13, 2003
    #7
  8. Sami Viitanen

    Trent Mick Guest

    [Sami Viitanen wrote]
    > Hello,
    >
    > How can I check if a file is binary or text?
    >
    > There was some easy way but I forgot it..


    Generally I define a text file as "it has no null bytes". I think this
    is a pretty safe definition (I would be interested to hear practical
    experience to the contrary). Assuming that, then:

    def is_binary(filename):
    """Return true iff the given filename is binary.

    Raises an EnvironmentError if the file does not exist or cannot be
    accessed.
    """
    fin = open(filename, 'rb')
    try:
    CHUNKSIZE = 1024
    while 1:
    chunk = fin.read(CHUNKSIZE)
    if '\0' in chunk: # found null byte
    return 1
    if len(chunk) < CHUNKSIZE:
    break # done
    finally:
    fin.close()

    return 0

    Cheers,
    Trent


    --
    Trent Mick
     
    Trent Mick, Aug 13, 2003
    #8
  9. In article <AFm_a.9725$>, Sami Viitanen wrote:

    > How can I check if a file is binary or text?


    In order to provide an answer, you'll have to define "binary"
    and "text".

    > There was some easy way but I forgot it..


    To _me_ a file isn't "binary" or "text". Those are two modes
    you can use to read a file. The file itself is neutral on the
    matter. At least under Windows and Unix. VMS and FILES-11
    contained a _lot_ more meta-data and actually did have several
    different fundamental file types (fixed length records,
    variable length records, byte-stream, etc.).

    --
    Grant Edwards grante Yow! Will it improve my
    at CASH FLOW?
    visi.com
     
    Grant Edwards, Aug 13, 2003
    #9
  10. Sami Viitanen

    Peter Hansen Guest

    Trent Mick wrote:
    >
    > [Sami Viitanen wrote]
    > > Hello,
    > >
    > > How can I check if a file is binary or text?
    > >
    > > There was some easy way but I forgot it..

    >
    > Generally I define a text file as "it has no null bytes". I think this
    > is a pretty safe definition (I would be interested to hear practical
    > experience to the contrary).


    "Contains only printable characters" is probably a more useful definition
    of text in many cases. I can't say off the top of my head exactly when
    either definition might be a problem.... wait, how about this one: in
    CVS, if you don't have a file that is effectively line-oriented, human
    readable information, you probably don't want to let it be treated as
    "text" and stored as diffs. In that situation, "contains primarily
    printable characters organized in lines" is probably a more thorough,
    though less deterministic, definition.

    -Peter
     
    Peter Hansen, Aug 13, 2003
    #10
  11. Trent Mick wrote:

    >[Sami Viitanen wrote]
    >
    >
    >>Hello,
    >>
    >>How can I check if a file is binary or text?
    >>
    >>There was some easy way but I forgot it..
    >>
    >>

    >
    >Generally I define a text file as "it has no null bytes". I think this
    >is a pretty safe definition (I would be interested to hear practical
    >experience to the contrary).
    >


    Dangerous assumption. Even if many or most binary files contain NULs, it
    doesn't mean that they all do.

    It is trivial to create a non-text file that has no NULs.

    f = open('no_zeroes.bin', 'rb')
    for x in range(1, 256):
    f.write(chr(x))
    f.close()

    Sami, I would suggest that you need to stop thinking in terms of tools,
    and instead think in terms of the problem you're trying to solve. Why do
    you need to (or think you need to) determine whether a file is "binary"
    or "text"? Why would your application fail if it received a
    (binary/text) file when it expected a (text/binary) one?

    My guess is that the trait you are trying to identify will prove not to
    be "binary or text", but something more application-specific.

    -- Graham

    P.S. Sami, it's very bad form to "make up" an e-mail address, such as
    <>. I'm sure the owners of the none.net domain would agree.
    Can't you provide a real address?
     
    Graham Fawcett, Aug 13, 2003
    #11
  12. Sami Viitanen

    Peter Hansen Guest

    Grant Edwards wrote:
    >
    > In article <>, Peter Hansen wrote:
    >
    > > "Contains only printable characters" is probably a more useful definition
    > > of text in many cases.

    >
    > The definition of "printable" is dependent on the character
    > set, that will have to be specified.


    That's why I said "printable (in ASCII)" in another message, so I
    definitely agree. The problem was rather under-specified. :)
     
    Peter Hansen, Aug 13, 2003
    #12
  13. Sami Viitanen

    John Machin Guest

    "Michael Peuser" <> wrote in message news:<bhdaks$f92$07$-online.com>...
    >
    > When I had a similar problem I read 1000 characters, counted the amount of
    > <32 and >255 characters and classified it "binary when this qota exceeded


    How many characters > 255 did you get? Did you mean 127? If so, what
    about accented characters ... like umlauts?

    On a slightly more serious note, CR, LF, HT and FF would have to be
    considered "text" but their ordinal values are < 32.

    What was the problem that you thought you were solving?
     
    John Machin, Aug 13, 2003
    #13
  14. Sami Viitanen

    John Machin Guest

    Trent Mick <> wrote in message news:<>...

    > Generally I define a text file as "it has no null bytes". I think this
    > is a pretty safe definition (I would be interested to hear practical
    > experience to the contrary).


    Data file written by C program which has an off-by-one error and is
    including a trailing '\0' byte ...
     
    John Machin, Aug 13, 2003
    #14
  15. Sami Viitanen

    John Machin Guest

    Graham Fawcett <> wrote in message news:<>...
    >
    > It is trivial to create a non-text file that has no NULs.
    >
    > f = open('no_zeroes.bin', 'rb')
    > for x in range(1, 256):
    > f.write(chr(x))
    > f.close()


    I tried this but it didn't work. It said:

    IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.

    So I thought I had to be persistent but after doing it a few more times it said:

    SerialIdiotError: What I tell you three times is true.
    NotLispingError: You need 'wb' as in 'wascally wabbit'

    This is very strange behaviour -- does my computer have worms?
     
    John Machin, Aug 14, 2003
    #15
  16. John Machin wrote:

    >Graham Fawcett <> wrote in message news:<>...
    >
    >
    >>It is trivial to create a non-text file that has no NULs.
    >>
    >> f = open('no_zeroes.bin', 'rb')
    >> for x in range(1, 256):
    >> f.write(chr(x))
    >> f.close()
    >>
    >>

    >
    >I tried this but it didn't work. It said:
    >
    >IOError: [Errno 2] No such file or directory: 'no_zeroes.bin'.
    >
    >So I thought I had to be persistent but after doing it a few more times it said:
    >
    >SerialIdiotError: What I tell you three times is true.
    >NotLispingError: You need 'wb' as in 'wascally wabbit'
    >
    >This is very strange behaviour -- does my computer have worms?
    >
    >


    No, but my brain does. Glad you caught my typo.

    However, it looks like your computer definitely has an AttitudeError!

    -- Graham
     
    Graham Fawcett, Aug 14, 2003
    #16
  17. Sami Viitanen

    Peter Hansen Guest

    John Machin wrote:
    >
    > Trent Mick <> wrote in message news:<>...
    >
    > > Generally I define a text file as "it has no null bytes". I think this
    > > is a pretty safe definition (I would be interested to hear practical
    > > experience to the contrary).

    >
    > Data file written by C program which has an off-by-one error and is
    > including a trailing '\0' byte ...


    To be fair, I'd call that a "binary" file in any case, or at least
    a defective text file...
     
    Peter Hansen, Aug 14, 2003
    #17
  18. Peter Hansen <> wrote in message news:<>...

    > "Contains only printable characters" is probably a more useful definition
    > of text in many cases. I can't say off the top of my head exactly when
    > either definition might be a problem.... wait, how about this one: in
    > CVS, if you don't have a file that is effectively line-oriented, human
    > readable information, you probably don't want to let it be treated as
    > "text" and stored as diffs. In that situation, "contains primarily
    > printable characters organized in lines" is probably a more thorough,
    > though less deterministic, definition.


    We check for binary files in our CVS commitprep script like this:

    look for -kb arg
    open the file in binary mode, read 4k fom the file and...

    for i in range(len(buff)):
    a = ord(buff)
    if (a < 8) or (a > 13 and a < 32) or (a > 126):
    non_text = non_text + 1

    If 10 percent of the characters are found to be non-text, we reject
    the file if it was not commited with the -kb flag, or print a warning
    if the file appears to be text but is being checked in as a binary.

    We don't bother checking for charsets other than ascii, because
    localized files have to be checked in as binaries or bad things
    (tm) happen.
     
    Brian Lenihan, Aug 14, 2003
    #18
  19. Thanks for the answers.

    To be more specific I'm making a script that should
    identify binary files as binary and text files as text.

    The script is for automating CVS commands and
    with CVS you have to add the -kb flag to
    add (or import) binary files. (because it can't itself
    determine what type the file is). If binary file is not
    added with -kb the results are awful.

    Script example usage:
    -import.py <directory_name>

    Script makes list of all files under that directory
    and then determines each files filetype. After that
    all files are added with Add command and binary
    files get that additional -kb automatically.


    "Sami Viitanen" <> wrote in message
    news:AFm_a.9725$...
    > Hello,
    >
    > How can I check if a file is binary or text?
    >
    > There was some easy way but I forgot it..
    >
    >
    > Thanks in adv.
    >
    >
     
    Sami Viitanen, Aug 14, 2003
    #19
  20. In article <gwH_a.1649$>, Sami Viitanen wrote:

    > To be more specific I'm making a script that should
    > identify binary files as binary and text files as text.


    That's "more specific"? ;)

    --
    Grant Edwards grante Yow! I hope I
    at bought the right
    visi.com relish... zzzzzzzzz...
     
    Grant Edwards, Aug 14, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spotted Owl Eater

    Determine the File Type of an Uploaded File

    Spotted Owl Eater, Nov 16, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    1,854
    Patrice
    Nov 16, 2005
  2. Gernot Frisch

    how to determine type-type?

    Gernot Frisch, Jan 12, 2005, in forum: C++
    Replies:
    3
    Views:
    400
    Ulrich Achleitner
    Jan 13, 2005
  3. Sunner Sun

    how to determine a file is ASCII or binary?

    Sunner Sun, Apr 9, 2004, in forum: C Programming
    Replies:
    22
    Views:
    18,711
  4. Mark Gibson

    determine file type

    Mark Gibson, Mar 27, 2006, in forum: Python
    Replies:
    3
    Views:
    421
    Steven D'Aprano
    Mar 27, 2006
  5. Peña, Botp
    Replies:
    1
    Views:
    241
    Robert Klemme
    Jan 24, 2004
Loading...

Share This Page