Question about Character Set

Discussion in 'XML' started by ssk@chol.net, Feb 3, 2005.

  1. Guest

    Hello!

    This might be a dumb question.

    An XML file starts with a line like the following line.
    <?xml version="1.0" encoding="ISO-8859-1"?>
    So an application knows what encoding the file is.
    However, how does an application read the first line without knowing
    what encoding it is?

    That is...
    To know what encoding it is, it should read the first line.
    To read the first line, it should know what encoding it is.

    Isn't this a chicken and egg issue?
    Am I missing an important point?

    TIA.
    Sam
    , Feb 3, 2005
    #1
    1. Advertising

  2. wrote:

    > This might be a dumb question.


    No, this is not a dumb question.

    > To know what encoding it is, it should read the first line.
    > To read the first line, it should know what encoding it is.
    >
    > Isn't this a chicken and egg issue?


    Yes, this is a chicken and egg problem.
    The problem goes even deeper when you
    consider files which are encoded in UTF-16.
    This is a very readable explanation:

    http://safari.oreilly.com/?x=1&mode...&t=1&c=1&u=1&r=&o=1&n=1&d=1&p=1&a=0&srchText=
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Feb 3, 2005
    #2
    1. Advertising

  3. <> wrote in message
    news:...
    > Hello!
    >
    > This might be a dumb question.
    >
    > An XML file starts with a line like the following line.
    > <?xml version="1.0" encoding="ISO-8859-1"?>
    > So an application knows what encoding the file is.
    > However, how does an application read the first line without knowing
    > what encoding it is?
    >
    > That is...
    > To know what encoding it is, it should read the first line.
    > To read the first line, it should know what encoding it is.


    quote from:
    http://www.w3c.org/TR/2004/REC-xml-20040204/#sec-guessing
    "Because the contents of the encoding declaration are restricted to
    characters from the ASCII repertoire (however encoded), a processor can
    reliably read the entire encoding declaration as soon as it has detected
    which family of encodings is in use."

    (however encoded) means that it still can be 16-bit or 32-bit character (see
    also "Without a Byte Order Mark" table) but all the characters in the
    declaration are in the ascii range of course.

    > Isn't this a chicken and egg issue?
    > Am I missing an important point?


    Not that complicated. Clever stuff from W3C XML Working Group however.

    > TIA.
    > Sam
    >

    with respect,
    Toni Uusitalo
    Toni Uusitalo, Feb 3, 2005
    #3
  4. Guest

    Thank you for the answer.
    I have a question.
    See in-line.

    Toni Uusitalo wrote:
    > <> wrote in message
    > news:...
    > > Hello!
    > >
    > > This might be a dumb question.
    > >
    > > An XML file starts with a line like the following line.
    > > <?xml version="1.0" encoding="ISO-8859-1"?>
    > > So an application knows what encoding the file is.
    > > However, how does an application read the first line without

    knowing
    > > what encoding it is?
    > >
    > > That is...
    > > To know what encoding it is, it should read the first line.
    > > To read the first line, it should know what encoding it is.

    >
    > quote from:
    > http://www.w3c.org/TR/2004/REC-xml-20040204/#sec-guessing
    > "Because the contents of the encoding declaration are restricted to
    > characters from the ASCII repertoire (however encoded), a processor

    can
    > reliably read the entire encoding declaration as soon as it has

    detected
    > which family of encodings is in use."
    >
    > (however encoded) means that it still can be 16-bit or 32-bit

    character (see
    > also "Without a Byte Order Mark" table) but all the characters in the
    > declaration are in the ascii range of course.


    I understand it.
    But what about UCS-2?
    It uses 2 bytes for all characters including ASCII characters.
    Well, I don't think I've seen UCS-2 used for encoding yet.
    But that's one of encodings, right?

    >
    > > Isn't this a chicken and egg issue?
    > > Am I missing an important point?

    >
    > Not that complicated. Clever stuff from W3C XML Working Group

    however.
    >
    > > TIA.
    > > Sam
    > >

    > with respect,
    > Toni Uusitalo


    Thanks again.
    Sam
    , Feb 4, 2005
    #4
  5. In article <>,
    <> wrote:

    >However, how does an application read the first line without knowing
    >what encoding it is?


    Very, very carefully...

    Since there must be an encoding declaration unless it's UTF-8, the
    first bytes must either be a byte-order mark or the characters
    "<?xml " or else you can assume UTF-8. So you can look at those first
    few bytes and determine the possibilities. Since the encoding
    declaration is limited to characters in the ascii set, you don't have
    to know whether it's latin-1 or latin-5 or some proprietary Microsoft
    encoding to read it. Likewise it won't matter which version of ebcdic
    it is if you have to deal with that.

    -- Richard
    Richard Tobin, Feb 4, 2005
    #5
  6. <> wrote in message
    news:...

    > But what about UCS-2?
    > It uses 2 bytes for all characters including ASCII characters.
    > Well, I don't think I've seen UCS-2 used for encoding yet.
    > But that's one of encodings, right?


    It's mentioned in the spec in "Without a Byte Order Mark" table as
    ISO-10646-UCS-2, encoding detection process is essentially the same
    as with UTF-16 without BOM, quote from that table:
    "UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code
    unit in big-endian order and ASCII characters encoded as ASCII values (the
    encoding declaration must be read to determine which)"

    with respect,
    Toni Uusitalo
    Toni Uusitalo, Feb 4, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Velvet
    Replies:
    9
    Views:
    14,771
    Joerg Jooss
    Jan 19, 2006
  2. raavi
    Replies:
    2
    Views:
    897
    raavi
    Mar 2, 2006
  3. Guest
    Replies:
    1
    Views:
    802
    Catalin Pitis
    Oct 21, 2004
  4. Guest
    Replies:
    1
    Views:
    466
    Ron Natalie
    Oct 21, 2004
  5. KwikRick
    Replies:
    1
    Views:
    350
    Christos TZOTZIOY Georgiou
    Aug 22, 2003
Loading...

Share This Page