is_ascii() or is_binary() for files?

Discussion in 'C++' started by Brad, Jul 5, 2008.

  1. Brad

    Brad Guest

    Is there a way to determine whether a file is plain ascii text or not
    using standard C++?
    Brad, Jul 5, 2008
    #1
    1. Advertising

  2. Brad

    osmium Guest

    "Brad" wrote:

    > Is there a way to determine whether a file is plain ascii text or not
    > using standard C++?


    No. It's in the eye of the beholder. You can make a very good guess by
    looking by counting control characters that wouldn't likely be in text. But
    the possibility exists that a binary file might not have any of them either.
    osmium, Jul 5, 2008
    #2
    1. Advertising

  3. Brad

    Medvedev Guest

    On Jul 5, 11:22 am, Sherman Pendley <> wrote:
    > Brad <> writes:
    > > Is there a way to determine whether a file is plain ascii text or not
    > > using standard C++?

    >
    > Sure, just read its contents and look for any byte that's > 127. If
    > you find one, the file's contents are not plain ASCII.


    if he try to test in a text file which contain non-English text , he
    will fail!!
    because non-English char are > 127
    Medvedev, Jul 5, 2008
    #3
  4. Brad

    Stefan Ram Guest

    Brad <> writes:
    >Is there a way to determine whether a file is plain ascii text
    >or not using standard C++?


    If someone can define in words when a file is deemed to be a
    »a plain ascii text« without ambiguity and for each possible
    file, I am sure that then this newsgroup will be able to
    help to implement a test for it in C++.

    The general problem is that the code of a piece of information
    can not be stored with this piece of information itself, so
    usually one only can guess the code - not retrieve it if all
    one has is this piece of information.

    For example, the following line might be deemed to be a text line:

    hD1X-s0P_kUHP0UxGWX4ax1y1ieimnfeinklddmemkjanmndnadmndnpbbnhhpbb

    But it is also an executable MS-DOS .com-File (aka »binary«)
    to solve sudokus written by Herbert Kleebauer, see

    http://groups.google.com/group/de.sci.mathematik/msg/db8088fafbdf5131?output=gplain&dmode=source

    (I have abbreviate the program a little bit here.)

    So is this line a »plain ascii text« or am I guilty of posting
    binary data to a non-binary-newsgroup now?
    Stefan Ram, Jul 5, 2008
    #4
  5. Brad

    red floyd Guest

    Medvedev wrote:
    > On Jul 5, 11:22 am, Sherman Pendley <> wrote:
    >> Brad <> writes:
    >>> Is there a way to determine whether a file is plain ascii text or not
    >>> using standard C++?

    >> Sure, just read its contents and look for any byte that's > 127. If
    >> you find one, the file's contents are not plain ASCII.

    >
    > if he try to test in a text file which contain non-English text , he
    > will fail!!
    > because non-English char are > 127


    OP specified ASCII, not non-English text.
    red floyd, Jul 5, 2008
    #5
  6. Brad

    Medvedev Guest

    On Jul 5, 11:45 am, Medvedev <> wrote:
    > On Jul 5, 11:22 am, Sherman Pendley <> wrote:
    >
    > > Brad <> writes:
    > > > Is there a way to determine whether a file is plain ascii text or not
    > > > using standard C++?

    >
    > > Sure, just read its contents and look for any byte that's > 127. If
    > > you find one, the file's contents are not plain ASCII.

    >
    > if he try to test in a text file which contain non-English text , he
    > will fail!!
    > because non-English char are > 127


    sorry man , u r right
    i found non-English represented by negative sign
    and binary is the file which it's byte MAY BE > 127
    as it can hold 256-bit pattern

    source:
    http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html
    Medvedev, Jul 5, 2008
    #6
  7. Brad

    James Kanze Guest

    On Jul 5, 9:45 pm, Medvedev <> wrote:
    > On Jul 5, 11:22 am, Sherman Pendley <> wrote:


    > > Brad <> writes:
    > > > Is there a way to determine whether a file is plain ascii text or not
    > > > using standard C++?


    > > Sure, just read its contents and look for any byte that's > 127. If
    > > you find one, the file's contents are not plain ASCII.


    > if he try to test in a text file which contain non-English
    > text , he will fail!! because non-English char are > 127


    ASCII is a seven bit code, so no characters are greater than
    127 in it.

    Of course, just because you don't find any characters greater
    than 127 doesn't mean that it is ASCII. It could still be ISO
    8859-1, or UTF-8, in which, by chance, none of the characters
    happen to be greater than 127. (Or it could be that plain char
    is signed on your machine, in which case, it can't contain a
    value greater that 127, regardless of the encoding:).)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jul 5, 2008
    #7
  8. Brad

    Brad Guest

    Stefan Ram wrote:
    > Brad <> writes:
    >> Is there a way to determine whether a file is plain ascii text
    >> or not using standard C++?

    >
    > If someone can define in words when a file is deemed to be a
    > »a plain ascii text« without ambiguity and for each possible
    > file, I am sure that then this newsgroup will be able to
    > help to implement a test for it in C++.
    > ...


    Thanks for all the responses. The program recurses through a directory
    processing files. I do not know beforehand what type of files the
    program may encounter. The processing is simply reading the file and
    passing its content to a regular expression to search for certain strings.

    Binary files cause problems, so I thought if I could just skip them and
    only read ASCII and perhaps UTF-8 encoded files, things would be better.
    That lead to my initial question. Later I could learn how to deal with
    binary files that I may want to search like PDF and MS Office documents.
    Just curious if standard C++ had some built-in function that made this easy.

    Thanks again,

    Brad
    Brad, Jul 6, 2008
    #8
  9. On 2008-07-06 02:48, Brad wrote:
    > Stefan Ram wrote:
    >> Brad <> writes:
    >>> Is there a way to determine whether a file is plain ascii text
    >>> or not using standard C++?

    >>
    >> If someone can define in words when a file is deemed to be a
    >> »a plain ascii text« without ambiguity and for each possible
    >> file, I am sure that then this newsgroup will be able to
    >> help to implement a test for it in C++.
    > > ...

    >
    > Thanks for all the responses. The program recurses through a directory
    > processing files. I do not know beforehand what type of files the
    > program may encounter. The processing is simply reading the file and
    > passing its content to a regular expression to search for certain strings.
    >
    > Binary files cause problems, so I thought if I could just skip them and
    > only read ASCII and perhaps UTF-8 encoded files, things would be better.
    > That lead to my initial question. Later I could learn how to deal with
    > binary files that I may want to search like PDF and MS Office documents.
    > Just curious if standard C++ had some built-in function that made this easy.


    The simplest way to solve your problem is probably to impose some
    additional constraints, such as requiring that text files have a name
    ending with ".txt" or that you only guarantee correct operation if no
    none ASCII files are in the directory.

    If you are running on a POSIX system you can also use the 'file' program
    which tries to figure out what kind of contents a file has.

    --
    Erik Wikström
    Erik Wikström, Jul 6, 2008
    #9
  10. Brad

    James Kanze Guest

    On Jul 6, 3:52 am, Sam <> wrote:
    > Brad writes:
    > > That lead to my initial question. Later I could learn how to
    > > deal with binary files that I may want to search like PDF
    > > and MS Office documents. Just curious if standard C++ had
    > > some built-in function that made this easy.


    > No. The only 'built-in' function of any kind is one to test if
    > a single character belongs in a given character class:
    > isascii() and its equivalents. It's up to you to scan the
    > entire contents of the file, to classify it.


    There is no isascii function, and the other isxxx functions are
    locale dependent (and don't really work for narrow characters
    anyway). There are heuristics for "guessing" the type of
    contents of a file, but they're just that, heuristics, and none
    are 100% certain.

    Most systems have various conventions which may reveal the type,
    but those are also just conventions, and individual files may
    actually violate them: you can give a text file an name ending
    with .exe under Windows, and there's nothing to prevent a binary
    file from starting with something that looks like like
    "<!DOCTYPE..." on any system.

    > In POSIX, you might be able to get away with opening a file,
    > stat()ing its contents, to get the file's size, mmap-ing the
    > file into memory, then using std::find_if() to search for
    > non-ascii bytes. Of course, if you hit a 4gb file, that might
    > cause ...problems.


    Under most Unix systems, you'd probably read the first N bytes
    (maybe 512, although that's a lot more than would typically be
    necessary), and then exploit magic. For that matter,
    *generally*, reading the first 512 bytes, then looking for
    characters outside the set 0x07-0x0D and 0x20-0x7E, is probably
    a pretty good heuristic; the probability of your guessing wrong
    is pretty slim (but of course, it will treat non-ascii text
    files as binary).

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jul 6, 2008
    #10
  11. Brad

    James Kanze Guest

    On Jul 6, 11:18 am, Erik Wikström <> wrote:
    > On 2008-07-06 02:48, Brad wrote:
    > If you are running on a POSIX system you can also use the
    > 'file' program which tries to figure out what kind of contents
    > a file has.


    Note that the information output by file is not guaranteed to be
    correct (except in specific cases: the file doesn't exist, isn't
    a regular file, or is empty). (On the other hand, it also works
    under Windows, if you've installed it correctly.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jul 6, 2008
    #11
  12. Sherman Pendley wrote:
    > Sure, just read its contents and look for any byte that's > 127. If
    > you find one, the file's contents are not plain ASCII.


    Actually there are certain characters with values < 32 which can be a
    sign of non-ascii file if present, 0 being the most prominent one.
    Juha Nieminen, Jul 6, 2008
    #12
  13. Brad

    James Kanze Guest

    On Jul 6, 4:58 pm, Juha Nieminen <> wrote:
    > Sherman Pendley wrote:
    > > Sure, just read its contents and look for any byte that's >
    > > 127. If you find one, the file's contents are not plain
    > > ASCII.


    > Actually there are certain characters with values < 32 which
    > can be a sign of non-ascii file if present, 0 being the most
    > prominent one.


    Technically, 0 is the encoding of the character nul in ASCII.
    ASCII defines "characters" for all encodings in the range 0-127.

    Practically, I don't think he really means ASCII per se, but
    rather text encoded using ASCII. Or rather files that can be
    interpreted as such---it's been years since I've seen a file
    encoded as "ASCII" (but a lot of files created as ISO 8859-1 or
    UTF-8 can probably be read as ASCII, if the file only contains
    characters from the basic character set).

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jul 6, 2008
    #13
  14. James Kanze wrote:
    > (but a lot of files created as ISO 8859-1 or
    > UTF-8 can probably be read as ASCII, if the file only contains
    > characters from the basic character set).


    UTF-8 has been specifically designed so that if the highest bit of any
    byte is set, you know you can't interpret that character as a simple
    ASCII one, so in this case the check is rather easy.
    Juha Nieminen, Jul 7, 2008
    #14
  15. Brad

    James Kanze Guest

    On Jul 7, 3:04 pm, Juha Nieminen <> wrote:
    > James Kanze wrote:
    > > (but a lot of files created as ISO 8859-1 or
    > > UTF-8 can probably be read as ASCII, if the file only contains
    > > characters from the basic character set).


    > UTF-8 has been specifically designed so that if the highest
    > bit of any byte is set, you know you can't interpret that
    > character as a simple ASCII one, so in this case the check is
    > rather easy.


    The same is true of the ISO 8859 encodings. I don't know of any
    machines still using ASCII, but most do use either one of the
    ISO 8859 encodings, or UTF-8. And most of those that don't also
    follow this rule. So as long as all of the characters in the
    file are in the basic execution character set, as defined by the
    standard, you can read it as if it were ASCII. There are a few
    additional characters which don't cause problems either: $, or @
    for example.

    The problem with doing so, of course, is that whatever tool
    generated the file might have inserted the word "naïve" (or
    anything else with a special character: a true less than or
    equals sign, or the section sign §, or the name of someone)
    somewhere near the end, so even reading the first 512 bytes
    won't reveal it.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Jul 7, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tom Hansen
    Replies:
    2
    Views:
    2,383
    Tom Hansen
    Nov 21, 2003
  2. Chad

    .vb files to .resx files

    Chad, Dec 17, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    300
  3. crazyprakash
    Replies:
    4
    Views:
    3,365
    adrian
    Oct 30, 2005
  4. Replies:
    4
    Views:
    938
    M.E.Farmer
    Feb 13, 2005
  5. Replies:
    3
    Views:
    1,440
    Rolf Magnus
    Jan 18, 2009
Loading...

Share This Page