Extended Characters in XML

Discussion in 'XML' started by barthome1@comcast.net, Mar 18, 2005.

  1. Guest

    Hello,

    My company collects data from non-US sources. We are starting projects
    where this data will be output in an XML document and passed around to
    our applications and third party tools.

    The data includes some of the extended characters. We get strange
    accent marks, italics and the like. These characters have decimal
    value in the 200+ range.

    So how do you handle these in XML with the assurance that you won't
    lose content and the off-the-shelf XML technologies will interpret them
    correctly and not simply reject the document as flawed?

    We know about the special escape sequences for the reserved XML
    characters like '>' and '<'. Is there a standard escape sequences for
    the extended characters?

    Thanks ahead of time for any help.

    Bart
    , Mar 18, 2005
    #1
    1. Advertising

  2. On 18 Mar 2005 wrote:

    > The data includes some of the extended characters. We get strange
    > accent marks, italics


    Italics??

    > and the like. These characters have decimal
    > value in the 200+ range.
    > So how do you handle these in XML with the assurance that you won't
    > lose content and the off-the-shelf XML technologies will interpret them
    > correctly and not simply reject the document as flawed?


    One possibility is to write all of them in the form
    where number is the decimal code position in Unicode.

    > Is there a standard escape sequences for the extended characters?


    , which is the same as in SGML/HTML. See
    http://www.unics.uni-hannover.de/nhtcapri/multilingual2.html
    for examples in various scripts.

    --
    Mars, unlike Earth, has no atmosphere.
    The Chicago manual of style, 15th ed., p. 362
    Andreas Prilop, Mar 18, 2005
    #2
    1. Advertising

  3. wrote:


    > My company collects data from non-US sources. We are starting projects
    > where this data will be output in an XML document and passed around to
    > our applications and third party tools.
    >
    > The data includes some of the extended characters. We get strange
    > accent marks, italics and the like. These characters have decimal
    > value in the 200+ range.


    Any XML parser is supposed to support the UTF-8 encoding thus you could
    encode your XML documents as UTF-8 and then you are able to use all
    characters Unicode supports directly in your document. You only need to
    make sure you use an editor that allows creation of UTF-8 encoded
    documents. Or you could, as already suggested, escape characters with
    the Unicode code point e.g. € for the Euro sign €.
    <http://www.unicode.org/>


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Mar 19, 2005
    #3
  4. Andreas Prilop <-hannover.de> wrote:

    > On 18 Mar 2005 wrote:
    >
    >> The data includes some of the extended characters. We get strange
    >> accent marks, italics

    >
    > Italics??


    That sounds somewhat strange indeed, since normally the font style is
    expressed at a level other than character level, e.g. in markup.
    (Contrary to populistic propaganda, XML markup is not inherently
    "logical"; nothing prevents you from using XML markup for purely
    presentational purposes. If you need to store information in a manner
    that preserves formatting information, that might be a good idea.
    Using <i> for italics as in HTML would be natural then.)

    But there _are_ characters in Unicode that are italicized variants of
    other characters. Many of them are compatibility characters that have
    been included just because they exist as characters in other standards.
    There are other cases as well. If this topic is relevant, then the
    document "Unicode in XML and other Markup Languages"
    http://www.w3.org/TR/unicode-xml/ should be studied.

    >> and the like. These characters have decimal
    >> value in the 200+ range.
    >> So how do you handle these in XML with the assurance that you
    >> won't lose content and the off-the-shelf XML technologies will
    >> interpret them correctly and not simply reject the document as
    >> flawed?

    >
    > One possibility is to write all of them in the form
    > where number is the decimal code position in Unicode.


    That's certainly a way represent them in XML, and this might be useful
    to protect against problems with encodings (and transcoding). However
    it normally wins nothing and loses a lot in readability of the text in
    XML source. (In XML it might be better to use where hhhh is
    the code in hexadecimal, since character code standards and references
    generally use hex.)

    If the data needs to be processed using old software too, then all
    kinds of problems may arise. If you need to prepared to _anything_,
    then only the invariant subset of ASCII is safe, or mostly safe. But it
    would be a mistake to convert data to ASCII using some simplifications
    and transmogrifications, unless you _know_ there will be serious and
    unsolvable problems otherwise.

    Anything that you can use XML technology even in the feeblest sense
    _must_ be able to accept data in UTF-8 encoding and at least store and
    forward it unmodified, even if it is incapable of rendering all the
    characters or recognizing them in a useful way. So the first step
    should be to convert the arriving data into UTF-8 in a safe way.
    Normally you should get information about the encoding of the data and
    do the conversions automatically, but at early phases you might wish to
    do some occasional checks to verify the sensibility of the data. It is
    not uncommon to send text data as incorrectly labelled (as regards to
    its encoding), or unlabelled (so that the recipient must guess or
    deduce what encoding has been used).

    Quite apart from this, we cannot realistically expect that all Unicode
    characters will be adequately processed and rendered. So it's very
    relevant what characters there will be in the input data and how it
    should be processed. For example, we can probably expect that if some
    software is advertized as reading XML data and storing it into a
    database and supporting some searching and retrieval, then it will
    accept and store any Unicode data in UTF-8 format. But it might fail to
    display the data when retrieved, its sorting routines might not work by
    Unicode rules, its case-insensitive search might be something rather
    trivial that really works for basic Latin letters only, and it might
    even fail to display characters properly right to left according to
    their directionality.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Mar 19, 2005
    #4
  5. In <>, on
    03/18/2005
    at 08:17 AM, said:

    >So how do you handle these in XML with the assurance that you won't
    >lose content and the off-the-shelf XML technologies will interpret
    >them correctly and not simply reject the document as flawed?


    You can't really guaranty anything, but your best bet is probably to
    use UTF-8, which is a transform of Unicode into 8-bit bytes. Note that
    there are standard entity names for many Unicode characters.

    --
    Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

    Unsolicited bulk E-mail subject to legal action. I reserve the
    right to publicly post or ridicule any abusive E-mail. Reply to
    domain Patriot dot net user shmuel+news to contact me. Do not
    reply to
    Shmuel (Seymour J.) Metz, Mar 20, 2005
    #5
  6. "Shmuel (Seymour J.) Metz" <>
    wrote:

    > You can't really guaranty anything, but your best bet is probably
    > to use UTF-8, which is a transform of Unicode into 8-bit bytes.


    Indeed.

    > Note that there are standard entity names for many Unicode
    > characters.


    No, there aren't - in XML. In XML, the only predefined entity names
    are &lt;, &gt;, &amp;, &quot;, and &apos;.

    There are "standard entity names" in the sense that the SGML standard
    contains a large number of entity declarations as samples, and some of
    them have been copied to HTML. But from the XML viewpoint, there is
    nothing standard about them; XML is logically independent of the SGML
    standard. One might argue that if you declare entities that denote
    Unicode characters, it would be advisable to use the same names as in
    the SGML standard if possible. But even this is far from clear; the
    SGML names are partly ridiculously and obscurely truncated (quickly,
    guess what the "mnemonic" &lang; means!). Besides, you don't _need_ the
    entities (except &lt; and &amp;) when you use UTF-8.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Mar 20, 2005
    #6
  7. Peter Flynn Guest

    wrote:

    > Hello,
    >
    > My company collects data from non-US sources. We are starting projects
    > where this data will be output in an XML document and passed around to
    > our applications and third party tools.
    >
    > The data includes some of the extended characters. We get strange
    > accent marks, italics and the like. These characters have decimal
    > value in the 200+ range.


    Accents are normal in many non-English languages, so they probably
    aren't "strange" to the originators. As Jukka has pointed out, what
    look like italics are probably variant characters which happen to
    be sloping.

    > So how do you handle these in XML with the assurance that you won't
    > lose content and the off-the-shelf XML technologies will interpret them
    > correctly and not simply reject the document as flawed?


    If you use XML software which conforms to the standards then it will handle
    all the characters correctly (provided you also conform to the same
    standards). If you need to be able to accept pretty much any character
    from any source, use the UTF-8 encoding.

    > We know about the special escape sequences for the reserved XML
    > characters like '>' and '<'. Is there a standard escape sequences for
    > the extended characters?


    ">" is not a reserved character, it's just a character. It only has a
    special meaning when it's used to close a start-tag or end tag. The
    only two reserved characters are "<" and "&". The latter is the one you
    want for the named or numeric codes for non-ASCII characters, but if you
    use UTF-8 then you won't need it at all except for espacing "<" and "&",
    as has already been pointed out.

    ///Peter
    --
    sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
    &;top"
    Peter Flynn, Mar 22, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Navanith
    Replies:
    2
    Views:
    5,263
    Navanith
    Dec 30, 2003
  2. Geoff Warnock
    Replies:
    2
    Views:
    7,987
    Daniel Tryba
    Mar 9, 2005
  3. Bob Hartung
    Replies:
    5
    Views:
    8,484
    shan23
    May 28, 2009
  4. wob
    Replies:
    4
    Views:
    440
    Dave Thompson
    Aug 1, 2005
  5. Andrew Holme
    Replies:
    1
    Views:
    493
    Andrew Holme
    May 15, 2007
Loading...

Share This Page