B
Brad
Is there a way to determine whether a file is plain ascii text or not
using standard C++?
using standard C++?
Brad said:Is there a way to determine whether a file is plain ascii text or not
using standard C++?
Sure, just read its contents and look for any byte that's > 127. If
you find one, the file's contents are not plain ASCII.
Brad said:Is there a way to determine whether a file is plain ascii text
or not using standard C++?
Medvedev said:if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are > 127
if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are > 127
[/QUOTE]Sure, just read its contents and look for any byte that's > 127. If
you find one, the file's contents are not plain ASCII.
if he try to test in a text file which contain non-English
text , he will fail!! because non-English char are > 127
Stefan said:If someone can define in words when a file is deemed to be a
»a plain ascii text« without ambiguity and for each possible
file, I am sure that then this newsgroup will be able to
help to implement a test for it in C++.
> ...
Thanks for all the responses. The program recurses through a directory
processing files. I do not know beforehand what type of files the
program may encounter. The processing is simply reading the file and
passing its content to a regular expression to search for certain strings.
Binary files cause problems, so I thought if I could just skip them and
only read ASCII and perhaps UTF-8 encoded files, things would be better.
That lead to my initial question. Later I could learn how to deal with
binary files that I may want to search like PDF and MS Office documents.
Just curious if standard C++ had some built-in function that made this easy.
No. The only 'built-in' function of any kind is one to test if
a single character belongs in a given character class:
isascii() and its equivalents. It's up to you to scan the
entire contents of the file, to classify it.
In POSIX, you might be able to get away with opening a file,
stat()ing its contents, to get the file's size, mmap-ing the
file into memory, then using std::find_if() to search for
non-ascii bytes. Of course, if you hit a 4gb file, that might
cause ...problems.
On 2008-07-06 02:48, Brad wrote:
If you are running on a POSIX system you can also use the
'file' program which tries to figure out what kind of contents
a file has.
Sherman said:Sure, just read its contents and look for any byte that's > 127. If
you find one, the file's contents are not plain ASCII.
Actually there are certain characters with values < 32 which
can be a sign of non-ascii file if present, 0 being the most
prominent one.
James said:(but a lot of files created as ISO 8859-1 or
UTF-8 can probably be read as ASCII, if the file only contains
characters from the basic character set).
UTF-8 has been specifically designed so that if the highest
bit of any byte is set, you know you can't interpret that
character as a simple ASCII one, so in this case the check is
rather easy.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.