[Q] Text vs Binary Files

Eric · May 27, 2004

Assume that disk space is not an issue
(the files will be small < 5k in general for the purpose of storing
preferences)

Assume that transportation to another OS may never occur.

Are there any solid reasons to prefer text files over binary files
files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

-- tolerant of basic data type size changes (enumerated types have been
known to change in size from one version of a compiler to the next)

-- if a file becomes corrupted, it would be easier to find and repair
the problem potentially avoiding the annoying case of just
throwing it out

I would like to begin using XML for the storage of application
preferences, but I need to convince others who are convinced that binary
files are the superior method that text files really are the way to go.

Thoughts? Comments?

Arthur J. O'Dwyer · May 27, 2004

Assume that disk space is not an issue [...]
Assume that transportation to another OS may never occur.
Are there any solid reasons to prefer text files over binary files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

I would guess this is wrong, in general. Think of the difference
between a DOS/Win32 text file, a MacOS text file, and a *nix text
file (hint: linefeeds and carriage returns). Now think of the
difference between the same systems' binary files (hint: nothing).
There do exist many free tools to deal with line-ending troubles,
though, so this isn't really a disadvantage; just a counter to your
claim.

-- tolerant of basic data type size changes (enumerated types have been
known to change in size from one version of a compiler to the next)

It's about five minutes' work to write portable binary I/O functions
in most languages, if you're worried about the size of 'int' on your
next computer or something. Check out any file-format standard for
ideas, and Google "network byte order." If you're coming from a C
background, then you'll understand when I tell you that 'fwrite' should
never, ever be applied to anything but buffers of 'unsigned char'!

-- if a file becomes corrupted, it would be easier to find and repair
the problem potentially avoiding the annoying case of just
throwing it out

Yes, definitely. Also, it's much easier to tell if text has been
corrupted in transmission --- it won't look like text anymore!
Binary always looks like binary; you need explicit checksums and
guards against corruption there. (Again, see file-format standards,
especially my favorite, the PNG image standard.)

I would like to begin using XML for the storage of application
preferences, but I need to convince others who are convinced that binary
files are the superior method that text files really are the way to go.

One major advantage of plain text is that it can be sent over HTTP
and other Web protocols without "armoring." You can put plain text
in the body of a POST request, for example, where I doubt arbitrary
bytes would be accepted. (I dunno, though.)
Along the same lines, you can email your data files back and forth
in the body of an email message, rather than mucking about with
attachments.

The disadvantage is size; but you don't seem worried about that.
Another possible disadvantage would be that text is easily read and
reverse-engineered, if you're worried about that (e.g., proprietary
config files or savefiles for a game) --- but then you can always
encrypt whatever you don't want read immediately. [Whatever you
don't want read *ever*, you simply don't give to your users, because
they'll crack anything given enough time.]

HTH,
-Arthur

Eric · May 27, 2004

Arthur J. O'Dwyer said:
I would guess this is wrong, in general. Think of the difference
between a DOS/Win32 text file, a MacOS text file, and a *nix text
file (hint: linefeeds and carriage returns).

Which is why I mentioned at the end using a solid XML parser to deal
with such issues transparently. I likely wouldn't consider using a text
file if something like XML and solid parsers weren't available and free.

Now think of the
difference between the same systems' binary files (hint: nothing).

Well, you say 'same systems'...so, yes, in general, reading & writing a
binary file that will never be moved to another OS shouldn't present any
serious issues. (or am I wrong here?)

However, the point was that it could be moved, in which case dealing
with big/little endian issues would become important.

It's about five minutes' work to write portable binary I/O functions
in most languages

Ah, but it's five minutes I don't want to spend, especially since the
time would need to be spent every time something changed. I believe in
fixing a problem once.

Plus, the potental for spending time attempting to figure out why the
@#$%@$ isn't being read properly isn't accounted for here.

Another possible disadvantage would be that text is easily read and
reverse-engineered

In my case, this is a benefit.

gswork · May 27, 2004

Assume that disk space is not an issue
(the files will be small < 5k in general for the purpose of storing
preferences)

Assume that transportation to another OS may never occur.

Are there any solid reasons to prefer text files over binary files
files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

-- tolerant of basic data type size changes (enumerated types have been
known to change in size from one version of a compiler to the next)

-- if a file becomes corrupted, it would be easier to find and repair
the problem potentially avoiding the annoying case of just
throwing it out

All good reasons...

I would like to begin using XML for the storage of application
preferences, but I need to convince others who are convinced that binary
files are the superior method that text files really are the way to go.

Thoughts? Comments?

For your application i think you have it right. Preferences in an XML
text file are more flexible for the user/admin (can be edited by hand
as last resort) and also for you as developer, a text file can have
entries listed 'out of order' and with the right tags and parsing it
won't really matter. For the same reasons they can also be easier to
change and add to over time.

The main reasons for using binary files to store preferences are:

-security (but they're crackable, and text files can be encrypted
anyway)
-programming ease, it can be easier to just have a preference
structure than to attempt a robust parsing of a given set of text
items, the text could be messed with after all
-size, relevant if they need to be shuttled around a network a lot or
will take up lots disk space

It sounds like they don't apply in your case.

Arthur J. O'Dwyer · May 27, 2004

Which is why I mentioned at the end using a solid XML parser to deal
with such issues transparently. I likely wouldn't consider using a text
file if something like XML and solid parsers weren't available and free.

Ah, but what do you do when the XML standard changes?

Seriously,
this is something you really need to consider IMHO. (Of course, this
is cross-posted to an XML group, and I don't know much about XML, so
don't take my word about anything...) There are XML Version Foo parsers
available now, but when XML Version Bar comes out, there'll be lag time.
Think of the messes with HTML 4.0 [about which I know little] and C'99
[about which I know much].
Free parsers *are* nice, though, no dispute there.

Well, you say 'same systems'...so, yes, in general, reading & writing a
binary file that will never be moved to another OS shouldn't present any
serious issues. (or am I wrong here?)

Misunderstood. By "the same systems," I meant the systems I just
mentioned: DOS/Win32, Unix, and MacOS. Their binary data formats are
identical.

Ah, but it's five minutes I don't want to spend,

Versus five minutes trying to make your free XML parser compile?
I'd take five minutes with binary files any day. ;-)

especially since the
time would need to be spent every time something changed. I believe in
fixing a problem once.

So do I. That's why you spend the five minutes writing your portable
binary I/O functions. Then you never need to write them again. For
a not-so-hot-but-portable-across-aforementioned-systems example, see
http://www.contrib.andrew.cmu.edu/~ajo/free-software/ImageFmtc.c,
functions 'fread_endian' and 'bwrite_endian'. Write once, use many
times.
The number of bits in a 32-bit integer is *never* going to change.
The number of bits in a machine word is *definitely* going to change.
This is why all existing file-format standards explicitly state that
they are dealing with 32-bit integers, not machine words: so the
file-format code never has to change, no matter where it runs.

Plus, the potental for spending time attempting to figure out why the
@#$%@$ isn't being read properly isn't accounted for here.

Of course not. I/O is trivial. It's your *algorithms* that are
going to be broken; and they'd be broken no matter what output format
you used.

In my case, this is a benefit.

Good.

-Arthur

Eric · May 27, 2004

Arthur J. O'Dwyer said:
Ah, but what do you do when the XML standard changes?

Please correct me if I am wrong, but the design of XML already takes
this into account. In otherwords, the idea that it can and will change
is a part of the design - this is one reason why XML is such a nifty
technology.

Misunderstood. By "the same systems," I meant the systems I just
mentioned: DOS/Win32, Unix, and MacOS. Their binary data formats are
identical.

What do you mean by 'their binary data formats are identical'?...this
would seem to imply that big/little endian issues are a thing of the
past...?

Versus five minutes trying to make your free XML parser compile?

Binaries of the better parsers are available, so this is a non-issue.

Of course not. I/O is trivial.

Once you track down the problem...however, it would not be uncommon to
think the problem lies elsewhere first and spend hours before finding
the trivial fix.

It's your *algorithms* that are
going to be broken; and they'd be broken no matter what output format
you used.

With XML, the risk of this is far less, as long as you're not changing
the tag names or what they mean, if it really exists at all.

Arthur J. O'Dwyer · May 27, 2004

Please correct me if I am wrong, but the design of XML already takes
this into account. In otherwords, the idea that it can and will change
is a part of the design - this is one reason why XML is such a nifty
technology.

Probably true. I don't know much about XML's namespacing rules
(by which I mean the rules that say that <foo> is an okay tag for
a user to create, but <bar> could be given special meaning by
future standards). [If anyone wants to give me a lecture, that's
fine; otherwise, I'll just look it up when I need to know.

]

What do you mean by 'their binary data formats are identical'?...this
would seem to imply that big/little endian issues are a thing of the
past...?

Yup. The vast majority of computers these days use eight-bit
byte-oriented transmission and storage protocols. Whatever bit-ordering
problems there are have moved "downstream" to those people involved in
the construction of hardware that has to choose whether to transmit
bit 0 or bit 7 first (and I'm sure they have their own relevant
standards in those fields, too).
Again, I refer you to standards like RFCs 1950, 1951, and 1952
(Google "RFC 1950"). Note the utter lack of concern with the vagaries
of the machine. We have indeed moved past big/little-endian wars;
now, whoever's[1] writing the relevant standard simply says, "All eggs
distributed according to the Fred protocol must be broken at the
big end," and that's the end of *that!*

Once you track down the problem...however, it would not be uncommon to
think the problem lies elsewhere first and spend hours before finding
the trivial fix.

You misunderstand me. I/O is trivial; thus, after the first five
minutes spent making sure the trivial code is correct (which is trivial
to prove), you never need to touch it or look at it again. If you
never touch it, you can't possibly introduce bugs into it. And if it
starts out bugfree (trivially proven), and never has any bugs introduced
into it (because it's never modified), then it will remain bugfree
forever. (And thus you never need to fix it, trivially or not.)

I'm completely serious and not using hyperbole at all when I say
I/O is trivial. It really is.

-Arthur

[1] - In speech I'd say "who'sever writing...," but that looks
awful no matter how I spell it. Whosever? Whos'ever? Who's-ever?
Yuck.

Darrell Grainger · May 27, 2004

Assume that disk space is not an issue
(the files will be small < 5k in general for the purpose of storing
preferences)

Assume that transportation to another OS may never occur.

Are there any solid reasons to prefer text files over binary files
files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

-- tolerant of basic data type size changes (enumerated types have been
known to change in size from one version of a compiler to the next)

-- if a file becomes corrupted, it would be easier to find and repair
the problem potentially avoiding the annoying case of just
throwing it out

I would like to begin using XML for the storage of application
preferences, but I need to convince others who are convinced that binary
files are the superior method that text files really are the way to go.

Thoughts? Comments?

In favour of binary, if a customer has access to it, they will be more
likely to muck with a text file then a binary file.

In favour of text, will you ever need to diff the files (old version
against new version)? Will you need to source control and/or merge the
files? Easier to do as text.

Ben Measures · May 27, 2004

Arthur said:
Which is why I mentioned at the end using a solid XML parser to deal
with such issues transparently. I likely wouldn't consider using a text
file if something like XML and solid parsers weren't available and free.

Click to expand...

Ah, but what do you do when the XML standard changes? Seriously,
this is something you really need to consider IMHO. (Of course, this
is cross-posted to an XML group, and I don't know much about XML, so
don't take my word about anything...) There are XML Version Foo parsers
available now, but when XML Version Bar comes out, there'll be lag time.
Think of the messes with HTML 4.0 [about which I know little] and C'99
[about which I know much].
Free parsers *are* nice, though, no dispute there.

XML was created to solve the problem of the HTML version mess. The
specification itself is very flexible (yet precise) with the result that
the language can be extended without needing a change to the
specification (or parsers based on the specification).

It's so good it's almost magical.

The number of bits in a 32-bit integer is *never* going to change.
The number of bits in a machine word is *definitely* going to change.
This is why all existing file-format standards explicitly state that
they are dealing with 32-bit integers, not machine words: so the
file-format code never has to change, no matter where it runs.

IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit
integer" - the int type can be more than 32-bits.

Of course not. I/O is trivial. It's your *algorithms* that are
going to be broken; and they'd be broken no matter what output format
you used.

Unless you're using somebody else's parser, which may not be broken.
Such as libxml2 which is *very* unlikely to be broken.

Arthur J. O'Dwyer · May 28, 2004

XML was created to solve the problem of the HTML version mess. The
specification itself is very flexible (yet precise) with the result that
the language can be extended without needing a change to the
specification (or parsers based on the specification).

It's so good it's almost magical.

Okay, I'm convinced, then.

IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit
integer" - the int type can be more than 32-bits.

More is better. A 33-bit integer can hold all the values that a
32-bit integer can, and then some. If the particular algorithms in
question are defined not to use the "and then some" part of the integer,
that's fine. (The at-least-32-bit type in C and C++ is 'long int'.
When I use the word 'integer', I'm using it in the same sense as the
C standard: to mean "any integral type," not to mean "'int' type."
Just in case that was confusing you.)

*Again* I urge the consultation of the RFCs defining any standard
binary file format, and the notice of the complete lack of regard
for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
byte level, these things simply never come up.

Unless you're using somebody else's parser, which may not be broken.
Such as libxml2 which is *very* unlikely to be broken.

I don't see the connection between my statement and your reply.
What is the antecedent of your "Unless"? (Literally, you're saying
that if you use libxml2 for I/O, then your non-I/O-related algorithms
will have no bugs. This is what used to be called "spooky action at a
distance," and I don't think it applies to code.

-Arthur

Programmer Dude · May 28, 2004

Eric writes:
Arthur J. O'Dwyer writes:

E> ...the files will be [...] for the purpose of storing preferences)
E>
E> Assume that transportation to another OS may never occur.
E> [...]
E> -- should transportation to another OS become useful or needed,
E> the text files would be far easier to work with

A> I would guess this is wrong, in general. Think of the difference
A> between a DOS/Win32 text file, a MacOS text file, and a *nix text
A> file (hint: linefeeds and carriage returns). Now think of the
A> difference between the same systems' binary files (hint: nothing)

Sizes are different. Endian-ness is different. Formats may be
different (think: floating point and other more exotic formats).

Consider finding the file in five years and not having any of the
previous tools that used it. Which is likely to be easier to get
the data out of: text or binary?

How often have we had people come here to ask help in decyphering
a binary file?

A> The vast majority of computers these days use eight-bit
A> byte-oriented transmission and storage protocols. Whatever
A> bit-ordering problems there are have moved "downstream" to
A> those people involved in the construction of hardware that
A> has to choose whether to transmit bit 0 or bit 7 first...

So what happens when I transmit a binary floating point number to
a machine with a different format?

I agree these issues are quite solveable, but I think they are
more *Easily* solveable with text as an intermediate format.

A> It's about five minutes' work to write portable binary I/O
A> functions in most languages, if you're worried about the
A> size of 'int' on your next computer or something.

Might be a little more than five minutes, but I agree it's not hard.

But what IS five minutes work is a CR/CRLF/LF converter! (-:

I know this 'cause I've done it several times over the years.

FOOD FOR THOUGHT:
=================
Consider: The Rosetta Stone.

Now consider the bestest, most *useful* binary format you can name.
Think it stands any chance AT ALL of surviving that long?

If you want the broadest, most robust, most portable format
possible, there is only one answer: TEXT!

Accept no substitutes! (-:

Arthur J. O'Dwyer · May 28, 2004

Sizes are different. Endian-ness is different. Formats may be
different (think: floating point and other more exotic formats).

[For --hopefully-- the last time: I wasn't talking about sizes,
or endianness, or floating-point formats. I was talking about the
format in which a binary file is stored. Binary means bytes. On
the vast majority of modern computers, that's eight bits per byte.
I refer you to the file format standard for ANYTHING EVER, but
especially PNG, because it's very cool and quite possibly *more*
modular than XML.

]

Consider finding the file in five years and not having any of the
previous tools that used it. Which is likely to be easier to get
the data out of: text or binary?

Without any of the computers that used it? Pretty close to zero,
even with the help of an electron microscope. Assuming you have
no hex editor, but you do have a computer and a text editor, then
obviously text will be easier to display. Contrariwise, if you
have no text editor but do have a hex editor, binary will be easier
to display. Neither will necessarily be easier to interpret unless
you have a copy of the relevant file format standard, and then the
point is pretty much moot anyway.

How often have we had people come here to ask help in decyphering
a binary file?

How often have people come here to ask help in writing "Hello
world!" programs? How often have people come to sci.crypt to
ask help in "deciphering" cryptograms? If you're saying that a
lot of people are stupid, I'm inclined to agree with you.

A> The vast majority of computers these days use eight-bit
A> byte-oriented transmission and storage protocols. Whatever
A> bit-ordering problems there are have moved "downstream" to
A> those people involved in the construction of hardware that
A> has to choose whether to transmit bit 0 or bit 7 first...

So what happens when I transmit a binary floating point number to
a machine with a different format?

Ick, floating point!

Seriously, I don't have much experience
with floating point, but I would expect you'd either use a fixed-point
representation (common in the domains in which I work), or you'd
convert to some IEEE format (about which I know little, and your
point about relevant standards' becoming extinct may well apply).

I agree these issues are quite solveable, but I think they are
more *Easily* solveable with text as an intermediate format.

How do you save a floating-point number to a text file?
Losslessly? How many lines of <your PLOC here> code is that?

Once I've seen a compelling answer to that, I may start thinking
in earnest about how to save floating-point numbers losslessly in
binary. And we'll see who comes out on top.

FOOD FOR THOUGHT:
=================
Consider: The Rosetta Stone.

Now consider the bestest, most *useful* binary format you can name.
Think it stands any chance AT ALL of surviving that long?

If you want the broadest, most robust, most portable format
possible, there is only one answer: TEXT!

Written on STONE TABLETS! And then BURIED IN THE DESERT!

Accept no substitutes! (-:

Absolutely 100% agreed! (-:

-Arthur

Piet Blok · May 29, 2004

Without taking a stand pro or con binary or text in this discussion, I like
to point out that XML files ARE stored in binary format, conformant to the
encoding attribute in the XML declaration. Now, not all encodings are ASCII
like, think of the various EBCDIC character sets. If you must view an
EBCDIC encoded XML file on your PC at home you need code conversion
(implemented in XML parsers). A simple text editor like NotePad will not be
very helpfull.

When XML data is transmitted over networks it should be done binary, not in
text mode, because in text mode, the data may be translated to some other
encoding scheme. But the encoding attribute, being part of the data, will
not be adjusted. The result is no XML anymore.

Piet

Ben Measures · May 30, 2004

Piet said:
Without taking a stand pro or con binary or text in this discussion, I like
to point out that XML files ARE stored in binary format, conformant to the
encoding attribute in the XML declaration. Now, not all encodings are ASCII
like, think of the various EBCDIC character sets. If you must view an
EBCDIC encoded XML file on your PC at home you need code conversion
(implemented in XML parsers). A simple text editor like NotePad will not be
very helpfull.

When XML data is transmitted over networks it should be done binary, not in
text mode, because in text mode, the data may be translated to some other
encoding scheme. But the encoding attribute, being part of the data, will
not be adjusted. The result is no XML anymore.

Piet

Good point not yet considered.

Sammy Tough · Jun 1, 2004

hi,

I agree, you can make the same errors in coding information using either
plain text or regulated plain text like xml. But you have more tools in your
hand if you dont invent your own format. There are more abstraction layers
used in xml. If you think over it you often have to invent such abstraction
layers in your proprietary format too. With the difference it has to be
invented from every new programmer every time he has to code a new type of
information. If you write down a rule how such coding should work (e.g.
every new tupel is finished by a <cr>-sign) you are following a similar way,
the xml-developers did.

greetings

Sammy

Programmer Dude · Jun 8, 2004

Arthur said:
Sizes are different. Endian-ness is different. Formats may be
different (think: floating point and other more exotic formats).

Click to expand...

[For --hopefully-- the last time: I wasn't talking about sizes,
or endianness, or floating-point formats. I was talking about the
format in which a binary file is stored. Binary means bytes.

Usually. Only usually. (-:

I refer you to the file format standard for ANYTHING EVER,..

Click to expand...

And folks who write code that deals with these formats need to be
fully up to speed on the format, don't they. And in the case of
evolving formats, need to consider upgrading so they can continue
to read newer formats. This thread has touched on many of the
*tools* (e.g. network transport layers) available to deal with these
binary formats, AND THAT'S THE POINT: you need all this *stuff* and
knowledge.

Text is simple. You stop even *thinking* about a lot of stuff.
And it has the advantage of easy human readability, a "nice to have"
for debugging and maintanence purposes.

Binary, in comparison, is a headache. (-:

Without any of the computers that used it? Pretty close to zero,
even with the help of an electron microscope.

Click to expand...

No, it would be silly of me to mean that.

Assuming you have no hex editor,...

Click to expand...

Hey, I'll even grant you the hex editor!

...but you do have a computer and a text editor, then
obviously text will be easier to display.

Click to expand...

Even if you can examine the hex, do you see the hassle required to
analyse what all those bits *mean*? Compare that to a text file
that very likely *tags* (labels) the data! I mean, come on, how
can you beat named, trivial-to-view data?

And ya see that? Even given equal ability to examine the raw file
(that is, sans intelligent interpreter), text is a monster winner.

Contrariwise, if you have no text editor but do have a hex editor,
binary will be easier to display.

Click to expand...

Ummmm, you're winging it here. (-: First, really, a hex viewer, but
no text viewer? I think that'd be a first in computing history, but
stranger things have happened.

Second, doncha think viewing the text in the hex viewer would still
be a lot more obvious (given those labels) than the raw bin bits?

Even when you tilt the playing field insanely, text still wins! (-:

Neither will necessarily be easier to interpret unless you have a
copy of the relevant file format standard, and then the point is
pretty much moot anyway.

Click to expand...

Well, right, we're assuming the fileformat is lost or unavailable.
And even if we somehow lost the "format" to text/plain, the pattern
of text lines with repeating delimiters is a red flag. Consider too
that at this extreme--where we've forgotten ASCII--how much harder
would it be to figure out binary storage formats (remember there's
likely no clue where object boundaries are)?

How often have people come here to ask help in writing "Hello
world!" programs? How often have people come to sci.crypt to
ask help in "deciphering" cryptograms? If you're saying that a
lot of people are stupid, I'm inclined to agree with you.

Click to expand...

No (well, actually, yes that's true, but not my point right now .

I'm pointing out--comparing like with like--no one stumbling on a
text file containing important data comes begging interpretation.
Cryptograms are play, and I doubt the urgent, often work-related
situation happens in s.crypt.

Ick, floating point!

Click to expand...

[bwg] Exactly my point! Which would you rather deal with:

"99.1206" 0x42c63dbf

Seriously, I don't have much experience with floating point, but I
would expect you'd either use a fixed-point representation (common
in the domains in which I work),...

Click to expand...

Let me guess. CAD/CAM or NC or something involving physical coords?
Fixed point isn't uncommon in environments when you know the range
of values expected. When you don't and need the largest range possible.
(Or when you DO and need a huge range.) You need floating point.

How do you save a floating-point number to a text file?

Click to expand...

As you'd expect. printf("%d") ... strtod()

Losslessly?

Click to expand...

Within certain parameters, close enough. Once you're dealing with FP,
you sorta have to give up the concept of lossless. Experts in FP know
how to deal with it to make the pain as low as possible, but FP is all
about approximation.

If you need absolute precision, you could always save the bytes as a
hex string. Fast and easy in and out.

How many lines of <your PLOC here> code is that?

Click to expand...

Only a few surrounding strtod() if you don't mind a little edge loss.
(IIRC, within precision limits, text<=>FP *is* fully deterministic?)

Rolf Magnus · Jun 9, 2004

Arthur said:
Assume that disk space is not an issue [...]
Assume that transportation to another OS may never occur.
Are there any solid reasons to prefer text files over binary files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

Click to expand...

I would guess this is wrong, in general. Think of the difference
between a DOS/Win32 text file, a MacOS text file, and a *nix text
file (hint: linefeeds and carriage returns).

Linefeeds and carriage returns don't matter in XML. The other
differences are ruled out by specifying the encoding. Any XML parser
should understand utf-8.

Now think of the difference between the same systems' binary files
(hint: nothing).

That's wrong. Under most (but not all) DOS compilers, int is 16bit,
under Windows, it's 32bit. Under Linux on x86, long double is 80bit,
und Windows, it's 64bit. And the OS is not the only thing that matters.
On the Motorola CPUs, data is stored in big endian, on x86 in little
endian. A 64bit CPU might use a 64bit type for long (or it might not),
while on most 32bit CPUs, long is 32bit. Some systems have special
alginment reqirements, others don't. And there are a lot of other
potential problems with binary data. Those problems can all be worked
around, but it's a lot easier with text, especially xml.

Jeff Brooks · Jun 9, 2004

Rolf said:
Arthur said:

Assume that disk space is not an issue [...]
Assume that transportation to another OS may never occur.
Are there any solid reasons to prefer text files over binary files?

Some of the reasons I can think of are:

-- should transportation to another OS become useful or needed,
the text files would be far easier to work with

Click to expand...

I would guess this is wrong, in general. Think of the difference
between a DOS/Win32 text file, a MacOS text file, and a *nix text
file (hint: linefeeds and carriage returns).

Click to expand...

Linefeeds and carriage returns don't matter in XML. The other
differences are ruled out by specifying the encoding. Any XML parser
should understand utf-8.

Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
has byte ordering issues. Writing an UTF-16 file on different cpus can
result in text files that are different. This can be resolved because of
the encoding the the UTF standards use but it means that any true XML
parser must deal with high-endian, low-endian issues.

Most people consider having to write code in a way that translates the
format to your specific cpu as the measure for data not being portable.
XML does have this issue so if thats your definition of portable then
XML isn't portable.

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"The primary feature of Unicode 3.1 is the addition of 44,946 new
encoded characters. These characters cover several historic scripts,
several sets of symbols, and a very large collection of additional CJK
ideographs.

For the first time, characters are encoded beyond the original 16-bit
codespace or Basic Multilingual Plane (BMP or Plane 0). These new
characters, encoded at code positions of U+10000 or higher, are
synchronized with the forthcoming standard ISO/IEC 10646-2."
- http://www.unicode.org/reports/tr27/

The majority of XML parsers only use 16-bit characters. This means that
the majority of XML parsers can't actually read XML.

Jeff Brooks

Corey Murtagh · Jun 9, 2004

Jeff said:
Rolf Magnus wrote:

Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
has byte ordering issues. Writing an UTF-16 file on different cpus can
result in text files that are different. This can be resolved because of
the encoding the the UTF standards use but it means that any true XML
parser must deal with high-endian, low-endian issues.

Don't want to be seen to be supporting XML here, but doesn't the UTF-16
standard define byte ordering? I was under the impression (without
having done any work with it) that a UTF-16 multi-byte sequence could be
parsed as a byte stream.

Malcolm Dew-Jones · Jun 9, 2004

Jeff Brooks ([email protected]) wrote:
: Rolf Magnus wrote:
: > Arthur J. O'Dwyer wrote:
: >
: >>On Thu, 27 May 2004, Eric wrote:
: >>
: >>>Assume that disk space is not an issue [...]
: >>>Assume that transportation to another OS may never occur.
: >>>Are there any solid reasons to prefer text files over binary files?
: >>>
: >>>Some of the reasons I can think of are:
: >>>
: >>>-- should transportation to another OS become useful or needed,
: >>> the text files would be far easier to work with
: >>
: >> I would guess this is wrong, in general. Think of the difference
: >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
: >>file (hint: linefeeds and carriage returns).
: >
: > Linefeeds and carriage returns don't matter in XML. The other
: > differences are ruled out by specifying the encoding. Any XML parser
: > should understand utf-8.

: Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
: has byte ordering issues.

You can only have byte order issues when you store the UTF-16 as 8 bit
bytes. But a stream of 8 bit bytes is _not_ UTF-16, which by definition
is a stream of 16 bit entities, so it is not the UTF-16 that has byte
order issues.

However, even the storage issue should have been trivial to solve - and
would simply have consisted of requiring 8 bit streams encoding 16 bit
unicode values to use network byte order, as is required in similar
situation within internet protocols (which are used with no
interoperability issues between all sorts of endians). The lack of
specifying and requiring this, and instead using zero width non-breaking
spaces to help the reader "quess" that byte ordering was used in the
translation from 16 bit information units into 8 bit storage units, is by
far one of the biggest kludges ever.

Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
binary vs text mode for files	57	Mar 21, 2014
Measuring a string of text	1	Sep 15, 2022
Batch modifying text - content and context based	5	Jan 19, 2023
streams binary and text	4	Aug 29, 2012

[Q] Text vs Binary Files

Eric

Arthur J. O'Dwyer

Eric

gswork

Arthur J. O'Dwyer

Eric

Arthur J. O'Dwyer

Darrell Grainger

Ben Measures

Arthur J. O'Dwyer

Programmer Dude

Arthur J. O'Dwyer

Piet Blok

Ben Measures

Sammy Tough

Programmer Dude

Rolf Magnus

Jeff Brooks

Corey Murtagh

Malcolm Dew-Jones

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads