[Q] Text vs Binary Files

Richard Tobin · Jun 9, 2004

Don't want to be seen to be supporting XML here
???

but doesn't the UTF-16 standard define byte ordering?

No. There are names for the encodings corresponding to
big-endian-UTF-16 and little-endian-UTF-16, but UTF-16 itself can be
stored in either order.

XML processors can distinguish between them easily because any XML
document not in UTF-8 must begin with a less-than or a byte-order mark
(unless some external indication of encoding is given).

-- Richard

Richard Tobin · Jun 9, 2004

Malcolm Dew-Jones said:
You can only have byte order issues when you store the UTF-16 as 8 bit
bytes.

Which is to say, always in practice.

-- Richard

Jeff Brooks · Jun 9, 2004

Corey said:
Don't want to be seen to be supporting XML here, but doesn't the UTF-16
standard define byte ordering? I was under the impression (without
having done any work with it) that a UTF-16 multi-byte sequence could be
parsed as a byte stream.

Unicode FAQ
http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks

Jeff Brooks · Jun 9, 2004

Malcolm said:
Jeff Brooks ([email protected]) wrote:
: Rolf Magnus wrote:
: > Arthur J. O'Dwyer wrote:
: >
: >>On Thu, 27 May 2004, Eric wrote:
: >>
: >>>Assume that disk space is not an issue [...]
: >>>Assume that transportation to another OS may never occur.
: >>>Are there any solid reasons to prefer text files over binary files?
: >>>
: >>>Some of the reasons I can think of are:
: >>>
: >>>-- should transportation to another OS become useful or needed,
: >>> the text files would be far easier to work with
: >>
: >> I would guess this is wrong, in general. Think of the difference
: >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
: >>file (hint: linefeeds and carriage returns).
: >
: > Linefeeds and carriage returns don't matter in XML. The other
: > differences are ruled out by specifying the encoding. Any XML parser
: > should understand utf-8.

: Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
: has byte ordering issues.

You can only have byte order issues when you store the UTF-16 as 8 bit
bytes. But a stream of 8 bit bytes is _not_ UTF-16, which by definition
is a stream of 16 bit entities, so it is not the UTF-16 that has byte
order issues.

http://www.unicode.org/unicode/faq/utf_bom.html#37

Jeff Brooks

Ben Measures · Jun 10, 2004

Jeff said:
Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
has byte ordering issues. Writing an UTF-16 file on different cpus can
result in text files that are different. This can be resolved because of
the encoding the the UTF standards use but it means that any true XML
parser must deal with high-endian, low-endian issues.

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"Entities encoded in UTF-16 MUST [snip] begin with the Byte Order Mark
described by section 2.7 of [Unicode3]"
http://www.w3.org/TR/REC-xml/#charencoding

This makes it trivial to overcome any endian issues, and since endian
issues are so fundamental I don't see it as making XML any less portable.

Michael Wojcik · Jun 10, 2004

[Followups restricted to comp.programming.]

"All XML processors MUST accept the UTF-8 and UTF-16 encodings of
Unicode 3.1"
- http://www.w3.org/TR/REC-xml/#charsets

"The primary feature of Unicode 3.1 is the addition of 44,946 new
encoded characters. ...

For the first time, characters are encoded beyond the original 16-bit
codespace or Basic Multilingual Plane (BMP or Plane 0). These new
characters, encoded at code positions of U+10000 or higher, are
synchronized with the forthcoming standard ISO/IEC 10646-2."
- http://www.unicode.org/reports/tr27/

The majority of XML parsers only use 16-bit characters. This means that
the majority of XML parsers can't actually read XML.

I don't believe this is correct. UTF-16 encodes characters in U+10000
- U+10FFFF as surrogate pairs. None of the surrogate code points match
any of the scalar code points, so there's no ambiguity - all surrogate
pairs are composed of 16-bit values that can't be mistaken for scalar
UTF-16 characters.

As long as the parser processes the surrogate pair without altering
it and recognizes it unambiguously, the parser would seem to be
complying with the XML specification. None of those characters (in
their surrogate-pair UTF-16 representation or any other) has any
special meaning in XML, so a parser that treated the surrogate pair
as a pair of 16-bit characters should do just fine.

In other words, the parser doesn't have to recognize that characters
from U+10000 and up (in their surrogate-pair encoding) are special,
because to it they aren't special.

The only case that immediately comes to mind where the distinction
would matter is if the parser had an API that returned data character-
by-character, which should have special provisions for surrogate
pairs (or be documented as returning them in halves). However, I've
not seen such a parser, AFAIK, and I don't know why one would provide
such an API.

Or, I suppose, if the parser offered to transform the document data
among various supported encodings. In that case, not handling UTF-16
surrogate pairs would indeed be a bug. On the other hand, I'm not
sure such transformations are necessarily the job of an XML parser;
that could be considered a bug in a set of additional utilities
provided alongside the parser.

Donald Roby · Aug 27, 2004

*Again* I urge the consultation of the RFCs defining any standard
binary file format, and the notice of the complete lack of regard
for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
byte level, these things simply never come up.

Try (for example) RFC 1314.

These things certainly do come up, and they're handled by encoding the
rules in a header of the format.

Arthur J. O'Dwyer · Aug 28, 2004

Try (for example) RFC 1314.

[RFC defining among other things a subset(?) of the TIFF image
file format]

These things certainly do come up, and they're handled by
encoding the rules in a header of the format.

Not really. TIFF /is/ weird in that it explicitly provides
both a "big-endian" format and a "little-endian" format, and TIFF
readers have to provide routines to read both formats. But the
endianness/word size of the machine never comes up. If it did,
we wouldn't be able to write TIFF writers or readers that worked
on platforms with different endiannesses. (IIRC, this whole thread
was started way back in the mists of time with the idea that

fputs("42000\n", fp);

produces different results on different machines (because of the
embedded newline, which produces different bytes on different
systems; not to mention the possibility of EBCDIC!), while

unsigned int result = 42000;
unsigned char buffer[8];
buffer[0] = (result>>24)&0xFF;
buffer[1] = (result>>16)&0xFF;
buffer[2] = (result>>8)&0xFF;
buffer[3] = (result>>0)&0xFF;
fwrite(buffer, 1, 4, fp);

produces the exact same bytes on every platform. Thus "binary
is better than text" if you care about portability more than
human-readability.

But since we already had that discussion (several months ago,
IIRC), I'm not going to get back into it.

-Arthur,
signing off

Select Eof extension files based on text list of filenames with if condition	1	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
binary vs text mode for files	57	Mar 20, 2014
OST to PST Converter Freeware Download	0	Jan 17, 2025
Text box simply do not stand out against the wall paper.	3	Feb 7, 2025
Batch modifying text - content and context based	5	Jan 19, 2023

[Q] Text vs Binary Files

Richard Tobin

Richard Tobin

Jeff Brooks

Jeff Brooks

Ben Measures

Michael Wojcik

Donald Roby

Arthur J. O'Dwyer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads