UTF-8 & Unicode

H

Henri Sivonen

Philippe Poulard said:
this is theory

is there anybody who knows a parser that doesn't handle iso-8859-1
corresctly ?

I don't. I do, however, know a parser that does not support (by default
without extra work) ISO-8859-15, Windows-1252 or Shift-JIS: expat.

AFAIK, in *practice* the set of safe encodings is US-ASCII, ISO-8859-1
and UTF-8. In theory, it is UTF-8 and UTF-16. The intersection of
reality and theory is UTF-8.
 
R

Richard Tobin

Henri Sivonen said:
That is not a safe conclusion. XML processors are only required to
support UTF-8 and UTF-16. Support for any other encoding is an XML
processor-specific extra feature. It follows that using any encoding
other than UTF-8 or UTF-16 is unsafe.

This is an exaggeration. You might as well say: XML processors are
not required to support any particular URI scheme, so referring to
a DTD at an HTTP URI is unsafe.
If communication fails, because
someone sent an XML document in an encoding other than UTF-8 or UTF-16,
the sender is to blame.

So phone them up and ask them to change it. Not every XML document has
to be instantly useful to everyone.

-- Richard
 
H

Henri Sivonen

You might as well say: XML processors are
not required to support any particular URI scheme, so referring to
a DTD at an HTTP URI is unsafe.

I consider external subsets on the Web harmful. Not because of HTTP URIs
but because non-validating processors are not required to process the
DTD and the usefulness of DTDs relative to their usual size is low.
Mozilla, for one, never dereferences an HTTP URI to retrieve an external
entity.
 
A

Andreas Prilop


You should set a F'up-To. I've done this and remark only
what's relevant to c.i.w.a.html.
AFAIK, in *practice* the set of safe encodings is US-ASCII, ISO-8859-1
and UTF-8. In theory, it is UTF-8 and UTF-16. The intersection of
reality and theory is UTF-8.

Google still doesn't support UTF-16 as can be seen from
http://www.google.com/search?q=U.T.F.1-6
Hence the recommendation in
http://ppewww.ph.gla.ac.uk/~flavell/charset/checklist
to use only UTF-8 as Unicode encoding on the WWW.
 
A

Alan J. Flavell

Such conversion leads to bugs like this one:
https://bugzilla.mozilla.org/show_bug.cgi?id=174351

Does it? I'll have to ask you to explain that in more detail, please.
As far as I can see, the bug relates to a byte stream which is not
valid utf-8 - which by definition is therefore not utf-8 at all.

What I'm talking about is taking a properly-labelled and
properly-formed character stream in some known encoding, and
transcoding that into properly-formed utf-8 (with appropriate
re-labelling, of course).
 
H

Henri Sivonen

Alan J. Flavell said:
Does it? I'll have to ask you to explain that in more detail, please.
As far as I can see, the bug relates to a byte stream which is not
valid utf-8 - which by definition is therefore not utf-8 at all.

What I'm talking about is taking a properly-labelled and
properly-formed character stream in some known encoding, and
transcoding that into properly-formed utf-8 (with appropriate
re-labelling, of course).

The problem is that the XML spec is not only concerned with proper UTF-8
streams but also says what to do in improper cases. If the character
encoding conversion is decoupled from the XML processor, but this is
viewed as an implementation detail so that the combination of the
converter and actual XML processor is subjected to the conformance
requirements placed on XML processors, non-conformance ensues if the
converter is lenient, which they usually are.
 
A

Alan J. Flavell

[...]

The problem is that the XML spec is not only concerned with proper
UTF-8 streams but also says what to do in improper cases. If the
character encoding conversion is decoupled from the XML processor,
but this is viewed as an implementation detail so that the
combination of the converter and actual XML processor is subjected
to the conformance requirements placed on XML processors,
non-conformance ensues if the converter is lenient, which they
usually are.

Thanks. I understand your point now.

I have this feeling that there's a lot of scope for practical utility
without running the risk of falling foul of this particular problem;
but I won't drag the argument out.

all the best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top