Test if file is binary ?

Simon Krahnke · Aug 22, 2007

* Stefan Mahlitz said:
Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.

But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)

Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.

Sorry, I fail to see your point.

That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

mfg, simon .... l

Xavier Noria · Aug 22, 2007

Note that Subversion would just treat the file as binary and
process it
with its binary diff.

It also disables newline normalization (which may or may not be an
issue in that case).

-- fxn

Simon Krahnke · Aug 22, 2007

* Xavier Noria said:
It also disables newline normalization (which may or may not be an
issue in that case).

Which is configurable for text files, too.

mfg, simon .... end of off topic

Stefan Mahlitz · Aug 22, 2007

Simon said:
Note that Subversion would just treat the file as binary and process it
with its binary diff.

I didn't know this. Thanks for the info.

If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)

Agreed. But it is still text - which was the point I tried to make.

I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable - so I completely agree with you that 8000 chars of
xml-data in a single line is far from being readable by a human. Anyway
- xml is meant to be processed by machines.

But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data. The xml in a
file is semantics. And I assume the question text or binary refers to
syntax.

That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

These are things I'm normally not concerned about, that's why I couldn't
follow that subject change.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Stefan

Simon Krahnke · Aug 23, 2007

* Stefan Mahlitz said:
My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable

Well, there is lot of XML files that I find readable. Including many I
or my software wrote.

Of course there are perversions like XMI and Microsoft's new formats.

- so I completely agree with you that 8000 chars of xml-data in a
single line is far from being readable by a human.

And thus it's binary and not text.

Anyway - xml is meant to be processed by machines.

It's meant to be read by an XML parser, which a regular diff isn't. So
only special cases are well suited for diff, and other special cases are
human readable.

But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data.

I would say it's text when interpreted as text/plain it's human
readable. Otherwise it's binary. That is, binary = for machines only.

If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.

Yes, in the original post he didn't say, for what purpose. If it's for
diffing the line structure is what matters.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

The other approaches are good for deciding if the files contains text in
latin based scripts. That's only a small subset of text, and they will
happily classify base64 as text.

Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Yes. Probably the best approach was using file(1).

mfg, simon .... l

Wolfgang NÃ¡dasi-Donner · Sep 26, 2007

Don't forget the possibility, that a file ist encoded in UTF-16 or
UTF-32. To recognize these textual data you need an extra recognition
step in front of the rest.

Wolfgang WoNÃ¡Do

Test case	1	May 10, 2023
Uploading images - binary or unsupported text encoding	2	Dec 24, 2022
Need an if statement	8	Jun 13, 2023
undefined method `call' for nil:NilClass (NoMethodError)	2	Jul 20, 2010
Data saving in condition of changing reality	0	Apr 29, 2022
Game developpement	1	May 1, 2021
finding a tag in a binary file	11	Feb 27, 2011
find and replace string in binary file	8	Mar 4, 2014

Test if file is binary ?

Simon Krahnke

Xavier Noria

Simon Krahnke

Stefan Mahlitz

Simon Krahnke

Wolfgang NÃ¡dasi-Donner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads