Binary or text file

L

list

Hi folks,

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

Thanks for your help.
 
O

osmium

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

You can't. You can only determine with high probability what the file is.
Assuming ASCII code only a very few of the control characters ever appear in
a text file. It's like finding a white crow, if you had looked at just one
more crow, it might have been a white one. But if you have the file
extender for the file (as above), you can look at Wotsit and get an answer.

http://www.wotsit.org/
 
K

Keith Halligan

You can't. You can only determine with high probability what the file is.
Assuming ASCII code only a very few of the control characters ever appear in
a text file.

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value > 30.
 
D

Diego Martins

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value > 30.

there is a 'file' command utility in unix that does the job
borrow source code from it :)
 
J

James Kanze

Thats pretty much the way to do it. If you take the unix command
`file' it does it pretty much like this. It'll generally take the
first 512 bytes of the file and from that it can determine the type of
file. Binary files tend to have a lot of padding with bytes zeroed
out, while ascii files will have every byte having a value > 30.

Note, however, that the file utility has a very high error rate.
And it knows a fair amount about the formats of different types
of binary files, and can recognize those because of various
embedded magic numbers---if the file matches a known format,
then it isn't plain text.

In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding. A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.
 
G

Gianni Mariani

In practice, today, ASCII is pretty much inexistant; most text
is in some other encoding.

Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.
.... A file in UTF-32LE, for example,
with English text, will have close to 3/4 of the bytes 0. You
can still try some heuristics: if you have a file with 1 byte
non-0, then three 0's, and that pattern repeats, with few
exceptions, there's a very good chance that it is UTF-32LE. But
it's more complicated (and globally, less reliable) that back in
the days when everything was ASCII.

I have yet to see a UTF-32LE file in the wild. Even the UTF-16 files
I've seen are far and few between. I'd like to believe that utf-8
will become the default text format and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).
 
J

James Kanze

Really ? Most text files I see don't have any characters beyond the
ASCII set which would make them ASCII.

Really. You must live a very parochial life. I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

You're reading this thread; there are non-ASCII characters in
the messages in it. (Check out my signature, for example.)
Practically, if you're connected to the network, you can forget
about ASCII; you have to be able to handle a large number of
different character encodings.
I have yet to see a UTF-32LE file in the wild.

I haven't either, but I know that they exist. I've also created
a few for test purposes.
Even the UTF-16 files I've seen are far and few between.

Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".
I'd like to believe that utf-8
will become the default text format

I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.
and there are a few tests to
determine the likliness of a file being utf-8 (and no, it's probably
not a BOM at the beginning of the file).

Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.
 
G

Gianni Mariani

Really. You must live a very parochial life.

What is with you French ? Nuking the pacific is not enough ?
... I find accented
characters pretty regularly in my files (including in C++ source
files). And ASCII doesn't have any accented characters.

I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.

....
Curious. From what I understand, UTF-16 is the standard
encoding under Windows. And machines running Windows aren't
exactly "few and far between".

Still, even on Windows, most text files are created as 8 bit. The
only tool I use regularly that produces utf-16 files in regedit
although it will read utf-8 files correctly.

I suspect very few applications will read utf-16 in a conforming way.
I don't if ISO-10646 has been updated, but a while back, utf-16 was a
stateful encoding (it still is for all intents and purposes). Any
time you read a reversed BOM you need to swap endianness. I have met
very few programmers that know what a surrogate pair is.
I would too, but given the passive that has to be taken into
account, I don't realistically expect it to happen any time
soon.

Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool
Actually, UTF-8 isn't that difficult. If the first 500 some
bytes don't contain an illegal UTF-8 sequence, there's only a
very small probability that the file isn't UTF-8.

Yes. That's right. You need to have a lib that is robust enough to
tell you.
 
A

ajk

Hi folks,

I am new to Googlegroups. I asked my questions at other forums, since
now.

I have an important question: I have to check files if they are
binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
check a file an find out if the file is binary or text?

Thanks for your help.

Depends a bit what you mean with "binary"

If you are under Windows you can determine if a file is an .exe-file
by reading the first few bytes in the file. Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.
/ajk
 
O

osmium

ajk said:
Depends a bit what you mean with "binary"

If you are under Windows you can determine if a file is an .exe-file
by reading the first few bytes in the file. Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.

Since he posted the question to a.l.c++ we assume he wants an answer that is
appropriate within the context of that
language. I think you should think more deeply about the difference between
"highly likely" and *is*.
 
J

James Kanze

What is with you French ? Nuking the pacific is not enough ?

Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. From what I've seen of other languages, this seems to
be the usual case. Long before Unicode, different regions
developed different encodings to handle non-US ASCII characters,
because a definite need for it was felt.
I think my claim is valid, most, i.e. 50% or more of text files I use
are ASCII. If it wasn't for your .sig having a few 8859-1 characters
in it, your posts would be ASCII as well.

Not all my posts. I frequently post to fr.comp.lang.c++ and
de.comp.lang.iso-c++ as well, and my posts there contain
characters which are not ASCII.

Formally, of course, the issue is far from simple. If you're
dealing with text data over the network, you have to be ready to
handle different code sets. In practice, most protocols will
insist on either one of the Unicode encodings or an encoding
which shares the first 129 characters with ASCII for the start
of the headers, until you've transmitted the information as to
which encoding you are actually using. And if you know that it
is text, and that it starts with a header, picking between
UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE and a byte encoding is
trivial, and that allows you to get through until you've read
the real encoding.

And of course, most of the newer protocols just say: it has to
be UTF-8.
Still, even on Windows, most text files are created as 8 bit. The
only tool I use regularly that produces utf-16 files in regedit
although it will read utf-8 files correctly.
I suspect very few applications will read utf-16 in a conforming way.
I don't if ISO-10646 has been updated, but a while back, utf-16 was a
stateful encoding (it still is for all intents and purposes). Any
time you read a reversed BOM you need to swap endianness. I have met
very few programmers that know what a surrogate pair is.

I have met very few programmers who even know that there exist
character sets which aren't encoded using single, 8 bit
characters. I'm not saying that ignorance isn't wide spread,
but I will try to fight it, whenever I can.
Well. there are alot of websites that claim to push utf-8 and most
browsers support utf-8 well - even bidi selection works like it should
which is quite cool

It's making headway. But a lot of code and text is old code and
text. And it's not going to go away anytime soon.
Yes. That's right. You need to have a lib that is robust enough to
tell you.

Or write one yourself:).
 
G

Gianni Mariani

Racist, on top of it. I've worked in both France and Germany,
and it is a fact of life that both languages have characters
which aren't present in ASCII, but which are more or less
necessary if the text is to be understood, or at least appear
normal. ...

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

Because someone does not use accented characters one is now
"parochial".

And because someone does not agree with you one is "inexperienced".

Yup. Sounds French to me. If you can't use facts, use personal
attacks.
... From what I've seen of other languages, this seems to
be the usual case. Long before Unicode, different regions
developed different encodings to handle non-US ASCII characters,
because a definite need for it was felt.

ISO-8859-1, -2 ... -15, JIS, ShiftJIS, EUC-*, ISO-2022, Big5, KOI-8
are ones I have personally worked with. It was a mess. That's why I
pushed for Unicode (utf-8) adoption as much as I could. Many file
formats became utf-8 because I suggested and explained to developers
what they needed to do otherwise and believe me, it was not easy to
convince people to use utf-8.

One of the nicest but underused features of uicode text is language
tagging. A unicode text string is able to tell you what language it
is (meaning that all unicode text is stateful) but very few people
implement it.
Not all my posts. I frequently post to fr.comp.lang.c++ and
de.comp.lang.iso-c++ as well, and my posts there contain
characters which are not ASCII.

That's nice. You have such a colorful world with that accented É and
that eszet character 'ß' it the pivot of the spice of life.
....
And of course, most of the newer protocols just say: it has to
be UTF-8.

That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.
I have met very few programmers who even know that there exist
character sets which aren't encoded using single, 8 bit
characters. I'm not saying that ignorance isn't wide spread,
but I will try to fight it, whenever I can.

Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8. Some code will break because it
splits characters or it compares un-normalized strings, but these
problems are far easier to deal with than the mish-mash of encodings
in the past.
It's making headway. But a lot of code and text is old code and
text. And it's not going to go away anytime soon.

Do you normalize your unicode strings ? Do you apply state from
unicode language tags across all strings you extract from a stream of
unicode characters ?
Or write one yourself:).

You have probably used one I wrote. Do you know where the "-l" in
iconv came from ?
 
I

Ian Collins

Gianni said:
OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

Because someone does not use accented characters one is now
"parochial".
Why all the crap? Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

If you want to pick a fight, find a rough bar.
 
G

Gianni Mariani

Why all the crap?

Is that a technical term ?
... Just because you and I don't see many text files with
extended character sets, doesn't mean that aren't in widespread use.

The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong. Having pointed that out
to him, he shoots back using "parochial" or "inexperienced" to justify
himself.

James, being of German and French background, I could hope for a more
Swiss-neutral attitude but it appears that we have a classic Parisian
arrogance with a German bureaucratic mind-set. I haven't met too many
of these guys around.
If you want to pick a fight, find a rough bar.

You're right, I should have known better.

So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.
 
G

Gianni Mariani

On 10 May 2007 09:58:41 -0700, (e-mail address removed) wrote:
...Strictly speaking all
files are stored in binary format and it is a matter of interpreting
the contents.

Strictly speaking, that is not true depending on who you're talking
about doing the interpretation. Some systems (VMS) didn't allow you
to read the binary stream of all files and would have a "record
management services" (RMS) get in the way. Those days are more or less
gone (thank Unix).
 
M

Markus Schoder

OK, the French didn't nuke the Pacific now ... and by claiming they did
one is now racist ?

Because someone does not use accented characters one is now "parochial".

And because someone does not agree with you one is "inexperienced".

Yup. Sounds French to me. If you can't use facts, use personal
attacks.

Funny how nationalism rears its ugly head in the most unlikely places.

Welcome to my kill file.
 
J

James Kanze

OK, the French didn't nuke the Pacific now ... and by claiming they
did one is now racist ?

What does nuking the Pacific have to do with anything. It's
racist to condemn all French because some idiotic government
officials do something stupid. If you're going to judge
everyone by their government, what would one say about the
Americans today?
Because someone does not use accented characters one is now
"parochial".

Because one doesn't take into account that they exist, one is
very parochial.

[...]
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

By who? I think that there is a consensus that UTF-8 is the way
to go. The problem is that reality isn't following that
consensus very quickly, and that as soon as a computer is
connected to the network, it has to deal with all sorts of wierd
encodings. It's a lot of extra work, for everyone involved, but
that's life.

[...]
Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.

Are you kidding? What about code which uses e.g. "isalpha()".
Some code will break because it splits characters or it
compares un-normalized strings, but these problems are far
easier to deal with than the mish-mash of encodings in the
past.

Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.
You have probably used one I wrote. Do you know where the "-l" in
iconv came from ?

What's iconv?
 
J

James Kanze

More significantly, the software which generated what you are
processing as "pure ASCII" probably was actually using some
exended code set. There is no support for "pure ASCII" under
Linux, as far as I can see, for example. The reality is that if
your software doesn't correctly handle characters with a bit 7
set, it is broken, because even in America, most of the tools
can easily generate such files.

I know that I have a couple of files which contain a 'ÿ' (y with
a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
many programs treat it as an end of file. Would you (or Gianni,
for that matter) consider this "correct", even if the program
didn't have to deal with accented characters per se? Would you
(or Gianni) consider it OK to not test this (limit) case,
knowing that it is a frequent error?
The claim by James was that today "ASCII is pretty much
inexistant(sic)". Which is blatantly wrong.

Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)
Having pointed that out to him, he shoots back using
"parochial" or "inexperienced" to justify himself.
James, being of German and French background,

James, being born and raised in the United States, and still
holding an American passport...
I could hope for a more
Swiss-neutral attitude but it appears that we have a classic Parisian
arrogance with a German bureaucratic mind-set.

More racism. I've not encountered any arrogance in Paris, and
I've not found Germany to be any more bureaucratic that anywhere
else.

People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.

[...]
So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,998
Latest member
MarissaEub

Latest Threads

Top