Test if file is binary ?

S

Simon Krahnke

* Stefan Mahlitz said:
Maybe this was a bad example. You are right, the xml-file would be best
treated by clearcase as application/xml or text/xml. This did not work
(and I was bitten by this recently - so this strange behaviour was fresh
when I read your email).

Note that Subversion would just treat the file as binary and process it
with its binary diff.
But I cannot see the problem with text-files containing long lines. If I
write a single paragraph with more than 1000 or 8000 characters - why
shouldn't this be text?

If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)
Why do you think it is not readable?

I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.
Sorry, I fail to see your point.

That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.
Are we really judging whether a file is text by how much memory pages
a diff will take or how many characters a patch has?

No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

mfg, simon .... l
 
X

Xavier Noria

Note that Subversion would just treat the file as binary and
process it
with its binary diff.

It also disables newline normalization (which may or may not be an
issue in that case).

-- fxn
 
S

Simon Krahnke

* Xavier Noria said:
It also disables newline normalization (which may or may not be an
issue in that case).

Which is configurable for text files, too.

mfg, simon .... end of off topic
 
S

Stefan Mahlitz

Simon said:
Note that Subversion would just treat the file as binary and process it
with its binary diff.

I didn't know this. Thanks for the info.
If that's really a paragraph there is no problem, except maybe of style.
(When wrapped in lines of 80 characters, it's 100 lines!)

Agreed. But it is still text - which was the point I tried to make.
I think that an XML file that has huge lines is unreadable since a
human wouldn't recognize any structure, when all the elements are on a
single line.

My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable - so I completely agree with you that 8000 chars of
xml-data in a single line is far from being readable by a human. Anyway
- xml is meant to be processed by machines.

But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data. The xml in a
file is semantics. And I assume the question text or binary refers to
syntax.
That's another point. Or actually two. The hard one being the limitation
of the software used: It may have a maximum line length.

If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.
No, this has nothing to do with being text, just with being well suited
as input to a diff algorithm.

Text usually is suited as well as everything else that is line oriented
and typical changes affect only one or a few neighboring lines.

These are things I'm normally not concerned about, that's why I couldn't
follow that subject change.

Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Stefan
 
S

Simon Krahnke

* Stefan Mahlitz said:
My question was directed to the 8000 char-paragraph. I even find small
xml-files unreadable

Well, there is lot of XML files that I find readable. Including many I
or my software wrote.

Of course there are perversions like XMI and Microsoft's new formats.
- so I completely agree with you that 8000 chars of xml-data in a
single line is far from being readable by a human.

And thus it's binary and not text.
Anyway - xml is meant to be processed by machines.

It's meant to be read by an XML parser, which a regular diff isn't. So
only special cases are well suited for diff, and other special cases are
human readable.
But even this case I would classify as text (I'm changing my earlier
definition slightly) if it does not contain binary data.

I would say it's text when interpreted as text/plain it's human
readable. Otherwise it's binary. That is, binary = for machines only.
If I understand the original poster correctly he wants to
programmatically detect whether a file is "binary or text". My point was
that he shouldn't restrict his program artifically - but this depends on
context.

Yes, in the original post he didn't say, for what purpose. If it's for
diffing the line structure is what matters.
Do I summarize correctly that depending on the purpose of the check one
could use a maximum line length - or any other of the posted approaches?

The other approaches are good for deciding if the files contains text in
latin based scripts. That's only a small subset of text, and they will
happily classify base64 as text.
Aka 'use the right tool for the job' + 'There is no single answer to
this question'?

Yes. Probably the best approach was using file(1).

mfg, simon .... l
 
W

Wolfgang Nádasi-Donner

Don't forget the possibility, that a file ist encoded in UTF-16 or
UTF-32. To recognize these textual data you need an extra recognition
step in front of the rest.

Wolfgang WoNáDo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top