How to know the encoding of XML file?

Discussion in 'XML' started by davisjoseph@postmark.net, Sep 13, 2005.

  1. Guest

    Hi All,

    I'm newbie to this XML world. My problem is to identify the encoding
    type of XML at runtime. What currently I'm doing is checking whether
    BOM is available in the XML; based on the BOM I'm identifying the
    encoding type. Here is the problem, some type of UTF-8 encoded file
    does'nt have BOM in the starting. So I'm identying the file as
    iso-8859-1 encoded which is actually encoded in UTF-8.

    I dont have much idea about the encoding technolgy also.

    Is there any way to identify the encoding type of XML file
    programtically; I can use Xerces C++ library or any other free library
    to identify the correct encoding. Any other work around is also
    welcome.

    Thanks & Regards
    , Sep 13, 2005
    #1
    1. Advertising

  2. In <>, on
    09/13/2005
    at 04:01 AM, said:

    >Here is the problem, some type of UTF-8 encoded file
    >does'nt have BOM in the starting.


    Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
    bytes, such as UTF-16. UTF-8 uses 8-bit bytes.

    --
    Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

    Unsolicited bulk E-mail subject to legal action. I reserve the
    right to publicly post or ridicule any abusive E-mail. Reply to
    domain Patriot dot net user shmuel+news to contact me. Do not
    reply to
    Shmuel (Seymour J.) Metz, Sep 13, 2005
    #2
    1. Advertising

  3. wrote:

    > I'm newbie to this XML world. My problem is to identify the encoding
    > type of XML at runtime. What currently I'm doing is checking whether
    > BOM is available in the XML; based on the BOM I'm identifying the
    > encoding type. Here is the problem, some type of UTF-8 encoded file
    > does'nt have BOM in the starting. So I'm identying the file as
    > iso-8859-1 encoded which is actually encoded in UTF-8.


    Well for XML there are clear rules, if there is no XML declaration
    specifying the encoding then it can only be UTF-8 or UTF-16 encoded and
    that is something you can decide with the BOM respectively the existance
    of the BOM (e.g. UTF-16 always needs one, UTF-8 BOM is optional).
    So look at the BOM and the XML declaration (that <?xml
    version="version.number" encoding="encoding-is-here"?>) to find the
    encoding for XML:
    <http://www.w3.org/TR/REC-xml/#charencoding>
    Of course what you really do with the above is detect the encoding the
    XML document is supposed to be in and an XML parser then has to check
    the whole document to comply with that encoding, e.g. if you read the
    XML declaration saying encoding="ISO-8859-1" that means the XML is
    supposed to be in that encoding and a parser then checks whether any
    byte sequences are encountered which can't be decoded properly using
    that encoding.

    In general there needs to be a declaration of the encoding associated
    with a document (e.g. in XML in the XML declaration, in HTML in a <meta>
    element, or for resources accessed via HTTP in the response header) as
    there is no general algorithm to detect any encoding that exists. For
    instance you can not detect whether a document is meant to be ISO-8859-1
    encoded or ISO-8859-15 encoded, the document author has to declare the
    encoding, the same bytes are just interpreted as different characters.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Sep 13, 2005
    #3
  4. Shmuel (Seymour J.) Metz escribió:
    > In <>, on
    > 09/13/2005
    > at 04:01 AM, said:
    >
    >>Here is the problem, some type of UTF-8 encoded file
    >>does'nt have BOM in the starting.

    >
    > Why would any UTF-8 file have a BOM? That's for encodings with 16-bit
    > bytes, such as UTF-16. UTF-8 uses 8-bit bytes.


    In mixed Unicode/non-unicode environments the BOM helps to discriminate
    between Unicode/UTF-8 files and simpler ASCII/ISO-8859-x/... text files.

    --
    To reply by e-mail, please remove the extra dot
    in the given address: m.collado -> mcollado
    Manuel Collado, Sep 13, 2005
    #4
  5. On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:

    > Why would any UTF-8 file have a BOM?


    FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

    > That's for encodings with 16-bit bytes, such as UTF-16.


    Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
    units (I'd avoid using the term "bytes"), but don't need a BOM,
    because their endian-ness is specified by the name of the encoding
    scheme.
    Alan J. Flavell, Sep 13, 2005
    #5
  6. In <43270386$>, on 09/13/2005
    at 09:51 AM, (Malcolm Dew-Jones) said:

    >: > Why would any UTF-8 file have a BOM?
    >: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29


    Note that the file doesn't contain a BOM, but rather the UTF-8
    encoding of a BOM. An actual BOM would not be valid UTF-8.

    >(I'm still waiting for hardware that increases character sizes.


    For most hardware, character size is irrelevant. Some devices deal
    with large blocks of data. Some deal with graphical data rather than
    text. Some deal with individual bits. Keyboards deal with scan codes
    rather than conventional character representations. The only common PC
    peripherals that I can think of that actually deal with characters as
    characters are a display adapter or printer in text mode, and those
    are essentially obsolete.

    --
    Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

    Unsolicited bulk E-mail subject to legal action. I reserve the
    right to publicly post or ridicule any abusive E-mail. Reply to
    domain Patriot dot net user shmuel+news to contact me. Do not
    reply to
    Shmuel (Seymour J.) Metz, Sep 13, 2005
    #6
  7. Alan J. Flavell () wrote:
    : On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:

    : > Why would any UTF-8 file have a BOM?

    : FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

    : > That's for encodings with 16-bit bytes, such as UTF-16.

    : Except that the encoding schemes utf-16BE and utf-16LE use 16-bit code
    : units (I'd avoid using the term "bytes"), but don't need a BOM,
    : because their endian-ness is specified by the name of the encoding
    : scheme.

    utf-16BE and utf-16LE must be using 8 bit bytes, because if they were
    using true 16-bit code units then there would be no endian-ness to
    consider.

    (I'm still waiting for hardware that increases character sizes. They've
    done it for all other elementary units on the computer, integers, memory
    pointers, etc, but for some reason not this one.)


    --

    This programmer available for rent.
    Malcolm Dew-Jones, Sep 13, 2005
    #7
  8. On Tue, 13 Sep 2005, Malcolm Dew-Jones wrote:

    > utf-16BE and utf-16LE must be using 8 bit bytes,


    That's the distinction (as set out in recent Unicode terminologies)
    between the Character Encoding Form (which in all these three cases is
    designated utf-16, consisting of 16-bit code units), and its Character
    Encoding Schemes (of which there are the three: utf-16 with BOM,
    utf-16LE, and utf-16BE) for representing the 16-bit code units as an
    octet stream.

    See chapter 2, sections 2.5 and 2.6 , e.g
    http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
    as well as the previously-cited FAQs

    > because if they were using true 16-bit code units then there would
    > be no endian-ness to consider.


    It's unfortunate that when one reads "utf-16", without context, it is
    unclear whether it's meant to refer to the C.E.F (and thus to comprise
    all three C.E.Ses), or only to the one C.E.S. Perhaps it's a pity
    they didn't devise different designations for the CEF and for the CES
    (maybe "utf-16BOM" for the CES).

    (This isn't a problem for utf-8, since there is only one CES for
    that particular CEF, with the BOM being optional.)

    > (I'm still waiting for hardware that increases character sizes.


    Historically, there has been at least one machine with 36-bit words
    that could be used as four 9-bit units; but that's past rather than
    future!

    > They've done it for all other elementary units on the computer,
    > integers, memory pointers, etc, but for some reason not this one.)


    I suspect you're more interested in raising it to 16 bits (or 32) than
    to some non-multiple of 8, though.

    best
    Alan J. Flavell, Sep 13, 2005
    #8
  9. On Tue, 13 Sep 2005, Shmuel (Seymour J.) Metz wrote:

    > >: FAQ: http://www.unicode.org/faq/utf_bom.html#28 and #29

    >
    > Note that the file doesn't contain a BOM, but rather the UTF-8
    > encoding of a BOM.


    *No* data stream ever literally "contains" a BOM, any more than it
    "contains" a copyright sign, or the letter "A" (the BOM, just like any
    Unicode character, is an abstract concept): what a data stream
    contains is the BOM encoded according to the appropriate "Character
    Encoding Scheme". That's the whole point of the BOM, so that the
    character encoding scheme can be recognised by inspecting the
    encoding. So there were no surprises there.

    > An actual BOM would not be valid UTF-8.


    An "actual BOM" is an abstract concept!

    The idea of dumping the hexadecimal number x'FEFF' into a utf-8 data
    stream - if that was what you had in mind - would make no sense, any
    more than dumping x'00A9' into it would make any sense to represent
    the copyright sign. Isn't that obvious?

    Let's cut them some slack: when they say that it "contains a BOM",
    they are taking it for granted that it means "appropriately encoded".
    You can't put an abstract concept into a data stream *without* an
    appropriate encoding, after all.
    Alan J. Flavell, Sep 13, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,797
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,315
    Real Gagnon
    Oct 8, 2004
  3. Andries

    I know, I know, I don't know

    Andries, Apr 23, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    219
    Gregory Toomey
    Apr 23, 2004
  4. Erik Wasser
    Replies:
    5
    Views:
    429
    Peter J. Holzer
    Mar 5, 2006
  5. Replies:
    2
    Views:
    353
Loading...

Share This Page