UTF-8 & Unicode

Discussion in 'XML' started by EU citizen, Jan 28, 2005.

  1. EU citizen

    EU citizen Guest

    Do web pages have to be created in unicode in order to use UTF-8 encoding?
    If so, can anyone name a free application which I can use under Windows 98
    to create web pages?
    EU citizen, Jan 28, 2005
    1. Advertisements

  2. Yes, but that doesn't mean you need a special text editor: any plain
    US-ASCII (but not ISO 8859-1) file is automatically correct in UTF-8.
    Leif K-Brooks, Jan 28, 2005
    1. Advertisements

  3. Pierre Goiffon, Jan 31, 2005
  4. EU citizen

    Lachlan Hunt Guest

    Lachlan Hunt, Feb 2, 2005
  5. EU citizen

    EU citizen Guest

    I wish people would give simple answers to simple questions.
    This is not a silly question; See
    http://www.w3schools.com/xml/xml_encoding.asp on XML Encoding. Slightly
    edited, this says:

    XML documents can contain foreign characters like Norwegian æøå, or French
    To let your XML parser understand these characters, you should save your XML
    documents as Unicode.
    Windows 95/98 Notepad cannot save files in Unicode format.
    You can use Notepad to edit and save XML documents that contain foreign
    characters (like Norwegian or French æøå and êèé),
    But if you save the file and open it with IE 5.0, you will get an ERROR

    Windows 95/98 Notepad files must be saved with an encoding attribute.
    To avoid this error you can add an encoding attribute to your XML
    declaration, but you cannot use Unicode.
    The encoding below (open it with IE 5.0), will NOT give an error message:
    <?xml version="1.0" encoding="UTF-8"?>
    EU citizen, Feb 2, 2005
  6. [/QUOTE]
    It may be a simple question for you, because you know what you mean,
    but for the rest of us it's a hard-to-understand question, because
    if you use UTF-8, you are inevitably using Unicode, since it's a
    way of writing Unicode.

    But from what you say now, it looks as if your question is really
    about some Windows software.
    Presumably this means that Notepad saves documents containing those
    characters in some non-Unicode encoding, in which case you must put
    an appropriate encoding declaration at the top of the document. But
    you will need to know the name of the encoding that Notepad uses.

    This is mysterious. What does it mean? That Notepad won't save
    them without one? Or that you have to add one to make it work
    in the web browser?
    It only makes sense to say that you're using UTF-8 if you are. If Notepad
    really doesn't know about Unicode, this will only be true if you
    restrict yourself to ASCII characters, because they're the same
    in UTF-8 as they are in ASCII and most other common encodings.

    -- Richard
    Richard Tobin, Feb 2, 2005
  7. I don't think you've understood the problem. If the questioner was in
    a position to understand the "simple answer" which you say you want, I
    can't imagine how they would have asked the question in that form in
    the first place.
    The original questioner should not feel offended or dispirited by what
    I'm going to say: but, in the form in which is was asked, the question
    is incoherent.

    This is not unusual: many people are confused both by the theory and
    by the terminology of character representation, especially if they
    gained an initial understanding in a simpler situation (typically,
    character repertoires of 256 characters or less, represented by an
    8-bit character encoding such as iso-8859-anything; and fonts that
    were laid out accordingly).
    How very strange. This claims to be XHTML, but, as far as I can see,
    it has no character encoding specified on its HTTP Content-type header
    *nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>

    In the absence of a BOM, XML is entitled to deduce that it's utf-8:
    but since it's invalid utf-8, it *ought* to refuse to process it.
    Unless someone can show me what I'm missing.

    By looking at it, it is evidently encoded in iso-8859-1.
    It purports to declare that via a "meta http-equiv", but for XML this
    is meaningless - and anyway comes far too late.

    I don't know why the W3C validator doesn't reject it out of hand?

    (Of course the popular browsers will be slurping it as slightly
    xhtml-flavoured tag soup, so we can't expect to deduce very much from
    the fact that they calmly display what the author intended.)
    And those characters are presented encoded in iso-8859-1 ...
    Two things wrong here. What do they suppose they mean by "save ... as
    Unicode"? The XML Document Character Set is *by definition* Unicode,
    there's nothing that an author can do to change that (unlike SGML).

    Characters can be represented in at least two different ways in XML:
    by /numerical character references/ (), or as /encoded
    characters/ using some /character encoding scheme/. (In some contexts
    there may also be named character entities, but they introduce no new
    principles for the present purpose so we won't need to discuss them

    The only coherent interpretation I can put on their "should save as
    Unicode" statement is "should save in one of the character encoding
    schemes of Unicode". But /should/ we? Do they? No, they don't: they
    are using iso-8859-1 (they *could* even do it correctly); and they
    also discuss the use of windows-1252, although without giving much
    detail about the implications of deploying a proprietary character
    encoding on the WWW.

    The /conclusions/ are fine, in their way:

    * Use an editor that supports encoding.
    * Make sure you know what encoding it uses.
    * Use the same encoding attribute in your XML documents.

    But the reader still hasn't really learned anything about the
    underlying principles yet. And the page hasn't told them anything
    useful about *which* encoding to choose for deploying their documents
    on the WWW.
    Then it's unfit for composing the kind of document that we are
    discussing here. No matter - there are plenty of competent editors
    which can work on that platform.

    My own tutorial pages weren't really aimed at XML, so I won't suggest
    them as an appropriate answer here. Actually, the relevant chapter of
    the Unicode specification is not unreasonable as an introduction to
    the principles of character representation and encoding, even if they
    might be a bit indigestible at a first reading.
    Alan J. Flavell, Feb 2, 2005
  8. /EU citizen/:
    Hm, I don't see any Norwegian or French characters but some Cyrillic
    instead... could it be you forgot to label the encoding of your
    message? ;-)
    Stanimir Stamenkov, Feb 2, 2005
  9. EU citizen

    EU citizen Guest

    It may be a simple question for you, because you know what you mean,
    but for the rest of us it's a hard-to-understand question, because
    if you use UTF-8, you are inevitably using Unicode, since it's a
    way of writing Unicode.

    But from what you say now, it looks as if your question is really
    about some Windows software.[/QUOTE]

    No. I am using a version of Windows (like most computer users on this
    planet). However, my question isn't specific to Windows. For all I knew,
    declaring uft-8 encoding might've caused the file to be transformed into
    utf-8 regardless of the original file format.

    Based on what I know now, I agree. I always assumed that Notepad, being a
    simple text editor, saved files in Ascii format. Nothing in Notepad's Help,
    Windows' Help or Microsoft's website says anything about the formt used by
    Notepad. Through experimentation with the W3C HTML vakidator, I've worked
    out that iso-8859-1will work for Notepad files with standard english text
    plus acute accented vowels.
    I can't make head or tail of it.
    The need for the XML encoding statement to match the original file format
    was not mentioned in any of the (many) articles I've read on XM:/XHTML over
    the last *four* years.
    EU citizen, Feb 2, 2005
  10. EU citizen

    EU citizen Guest

    I think there's a lot of miscommunication going on, I don't entirely
    understand what your posting.

    You makee a number of valid criticisms about the w3schools article, but they
    turned up near the top of my Google search for information on this subject.
    It just shows how difficult it is to get reliable information.
    My original question asked for suggestions about suitable applications, and
    yet no one has named one.
    EU citizen, Feb 2, 2005
  11. Beware that Microsoft uses some proprietary encodings that are ISO-8859-1
    for characters A0-FF, but use the C1 controls (81-9F) for other purposes.
    If you don't use any of those (and the Euro symbol is quite likely one
    of them) you should be OK.
    In most circumstances UTF-8 is the default encoding for XML if there
    is no encoding declaration. In theory for text/* served by HTTP,
    8859-1 is (or was - they may have changed it) the default. But if you
    stick to ascii, it won't matter. And remember that you *can* stick to
    ASCII and use character references (such as £) or entity
    references (if you declare them in your DTD) for all non-ascii

    -- Richard
    Richard Tobin, Feb 2, 2005
  12. You need to set up your newsreader^W Outlook Express correctly
    in order to transmit special, non-ASCII characters:

    Tools > Options > Send
    Mail Sending Format > Plain Text Settings > Message format MIME
    News Sending Format > Plain Text Settings > Message format MIME
    Encode text using: None
    Andreas Prilop, Feb 2, 2005
  13. It's a bit more complicated than that.


    The /default/ is to look for a BOM - failing which, utf-8 is assumed.

    HTTP hasn't changed. RFC2616 section 3.7.1, last paragraph. Thanks!

    So I suppose /that/ was the explanation for the W3C validator not
    failing the cited page from w3schools. Thanks.
    True - although that's hardly a very efficient way to write, say,
    Cyrillic, or Arabic, or Japanese.
    Alan J. Flavell, Feb 2, 2005
  14. EU citizen

    Lachlan Hunt Guest

    If you cared to take the time to read the guide to unicode I linked to
    earlier, you would have found editors mentioned in part 2. Within it, I
    mentioned two windows editors that support Unicode: SuperEdi [1] and
    Macromedia Dreamweaver. A simple search for "Unicode Editor" also
    reveals many other editors that may be capable of doing the job.

    [1] http://www.wolosoft.com/en/superedi/
    Lachlan Hunt, Feb 3, 2005
  15. EU citizen

    Lachlan Hunt Guest

    By default, Notepad saves files as Windows-1252. The characters from 0
    to 127 (0x7F) are identical to US-ASCII, ISO-8859-1, UTF-8 and many
    other character sets that make use of the same subset. Thus, any file
    saved using Windows-1252 that only makes use of those characters is
    compatible with all those other encodings.

    The characters from 160 (0xA0) to 255 (0xFF) match those contained in
    ISO-8859-1. Thus, any file saved using Windows-1252 that only makes use
    of the aforementioned US-ASCII subset and that range of characters is
    compatible with ISO-8859-1.

    The characters from 128 (0x80) to 159 (0x9F), however, do not match
    those in any other encoding, making any Windows-1252 file using these
    characters incompatible with any other encoding. For XML, this must be
    declared appropriately in the XML declaration. The characters in this
    range contain the infamous "smart quotes" (Left and Right, single and
    double quotation marks: ‘ ’ “ â€) that cause so many problems for the
    uneducated. Use of this range while declaring ISO-8859-1, UTF-8 or any
    other encoding, will cause errors because they are control characters in
    the character repertoires used by those encodings.
    It is actually mentioned in a few places on the web, though it's not
    easy to find. Microsoft tend to incorrectly refer to it as ANSI, even
    though it is not.

    That's because Windows-1252 is compatible with ISO-8859-1 when that
    subset is used.
    It actually means that version of Notepad will only save as
    Windows-1252, so it needs to be declared in the XML declaration. That
    is because an XML parser will assume UTF-8 without it and that
    assumption is acceptable only when the US-ASCII subset is used.
    Lachlan Hunt, Feb 3, 2005
  16. The XML coding has to comply with the relevant bit of the XML
    specification. Whether you read it "over the last four years" or not.

    Talking about the "original file format" could be misleading, bearing
    in mind that some HTTP servers are set up to transcode the
    internally-stored file format into one that's more appropriate for use
    on the web. For XML-based markups, that may call for appropriate
    rewriting of the document's XML encoding specification. And if you're
    using XHTML/1.0 Appendix C then the transcoded document would need to
    confirm to its constraints too.
    Alan J. Flavell, Feb 3, 2005
  17. RFC3023 talk about XML media types

    i retain that text/xml (and text/and-others-related-to-xml) should be
    avoid on behalf of application/xml (and

    Here we get utf-8:
    Content-type: text/xml; charset="utf-8"
    <?xml version="1.0" encoding="utf-8"?>

    !?!?! Here we get US-ACII, despite the encoding specified:
    Content-type: text/xml
    <?xml version="1.0" encoding="utf-8"?>

    Here we get utf-16:
    Content-type: application/xml; charset="utf-16"
    {BOM}<?xml version="1.0" encoding="utf-16"?>

    Here we get the right encoding-known-by-your-parser:
    Content-type: application/xml
    <?xml version="1.0" encoding="encoding-known-by-your-parser"?>


    (. .)
    | Philippe Poulard |
    Philippe Poulard, Feb 3, 2005
  18. That is not a safe conclusion. XML processors are only required to
    support UTF-8 and UTF-16. Support for any other encoding is an XML
    processor-specific extra feature. It follows that using any encoding
    other than UTF-8 or UTF-16 is unsafe. If communication fails, because
    someone sent an XML document in an encoding other than UTF-8 or UTF-16,
    the sender is to blame.

    This simplifies to a rule of thumb:
    When producing XML, always use UTF-8 (and Unicode Normalization Form C).
    Those who absolutely insist on using UTF-16 can use UTF-16 instead of
    Henri Sivonen, Feb 4, 2005
  19. this is theory

    is there anybody who knows a parser that doesn't handle iso-8859-1
    corresctly ? i don't think so; otherwise, you should change, and
    communication became safe :)


    (. .)
    | Philippe Poulard |
    Philippe Poulard, Feb 4, 2005
  20. I guess that was one of the penalties of responding to a cross-posted
    But that's OK, since any plausible encoding produced by the editor can
    be transformed by rote into utf-8 prior to subsequent XML processing
    (that's the XML relevance). And pretty much any plausible encoding
    produced by an editor that's meant for WWW use, is going to be
    supported by the available web browsers (that's the c.i.w.a.h
    I take your point, but again: as long as the document is correctly
    labelled, it can be transformed by rote into utf-8, it needs no
    special heuristics, nor does it run risks of being damaged in the

    all the best
    Alan J. Flavell, Feb 4, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.