the right way to detect encoding used in InputStream carrying HTML or XML

Discussion in 'Java' started by HK, May 26, 2005.

  1. HK

    HK Guest

    Suppose you are faced with an java.io.InputStream
    and it is supposed to carry either HTML or XML.
    Ultimately you want to read with a Reader and the
    correct encoding, of course.

    Is the following a correct strategy:

    1) Wrap the InputStream into a BufferedInputStream
    to make sure mark() and reset() work.

    2) Read single bytes from it up to some reasonable limit
    and convert them to characters by simple casting:

    char ch = (char)the_byte_I_read;

    3) check for encoding, e.g. with regexp
    4) call reset() on the BufferedInputStream
    5) wrap the BufferedInputStream into a Reader
    with the determined encoding
    6) Start reading.

    What bothers me a bit is the additional
    BufferedInputStream in between when the
    Reader later has another buffer. I am also
    not sure if the cast is the right way to
    convert bytes to chars before you know the
    encoding.

    Comments?
    Harald.
     
    HK, May 26, 2005
    #1
    1. Advertising

  2. Re: the right way to detect encoding used in InputStream carryingHTML or XML

    HK wrote:

    > Suppose you are faced with an java.io.InputStream
    > and it is supposed to carry either HTML or XML.
    > Ultimately you want to read with a Reader and the
    > correct encoding, of course.
    >
    > Is the following a correct strategy:
    >
    > 1) Wrap the InputStream into a BufferedInputStream
    > to make sure mark() and reset() work.
    >
    > 2) Read single bytes from it up to some reasonable limit
    > and convert them to characters by simple casting:
    >
    > char ch = (char)the_byte_I_read;
    >
    > 3) check for encoding, e.g. with regexp
    > 4) call reset() on the BufferedInputStream
    > 5) wrap the BufferedInputStream into a Reader
    > with the determined encoding
    > 6) Start reading.
    >
    > What bothers me a bit is the additional
    > BufferedInputStream in between when the
    > Reader later has another buffer. I am also
    > not sure if the cast is the right way to
    > convert bytes to chars before you know the
    > encoding.


    The Reader does not necessarily have another buffer. As far as I know,
    in fact, the only ones that do (in the platform library) are
    BufferedReader and its subclass, LineNumberReader. It is generally best
    to buffer as close to the source as possible, which is just what you
    propose to do.

    If encoding information is not provided externally (i.e. in an HTTP
    header, or a protocol-dependent default), then determining the encoding
    from the content itself is tricky, and differs between XML and HTML.
    The details are off-topic for this group, but all involve examining the
    initial portion of the byte stream. Some encodings are difficult or
    impossible to determine in this way.

    I see these problems with your strategy:

    (1) Relying on a BufferedInputStream to provide the ability to reset()
    the stream puts a fixed upper limit (the buffer size) on how far into
    the file the encoding information can be sought. If you get it from a
    <meta> tag in an HTML document, for instance, then it is impossible to
    place an absolute bound on how far into the file the relevant tag can
    occur (though you could probably choose a bound that in practice meets
    your needs).

    (2) Casting bytes to chars cannot be relied upon to work correctly for
    any multibyte or variable-length encoding (e.g. UTF-8, especially
    UTF-16). For UTF-16 with a byte-order mark, you may be able to guess
    the encoding from the first two bytes, without worrying about chars,
    though there you would thereafter want to _discard_ those bytes. UTF-8
    corresponds with ASCII and all the ISO-8859-X encodings over the first
    128 code points, so as long as you don't have any encoded, non-ASCII
    characters in the stream before whatever information you will use to
    determine the encoding, UTF-8 might nevertheless work OK. There is NO
    correct way to convert bytes to chars without knowing anything about the
    encoding.

    --
    John Bollinger
     
    John C. Bollinger, May 26, 2005
    #2
    1. Advertising

  3. HK

    Wibble Guest

    Re: the right way to detect encoding used in InputStream carryingHTML or XML

    XML parsers do something like that.

    Firstly, dont cast to char, leave it as bytes. Otherwise
    you may get into trouble with sign extension unless your
    careful.

    XML parsers look at the <?xml prefix of every message
    and see if its 8 or 16 bit encoded. Then they scan
    for the specific encoding in the header, which will not
    have any non 8bit chars up to that point. Once the
    encoding is parsed, the rest of the document may be
    read. Be careful because XML and java name encodings
    differently.

    You can probably generalize this to HTML for 8 vs 16 bit.
    You then have to scan a bit for the encoding in the header,
    which is not mandatory.

    HK wrote:
    > Suppose you are faced with an java.io.InputStream
    > and it is supposed to carry either HTML or XML.
    > Ultimately you want to read with a Reader and the
    > correct encoding, of course.
    >
    > Is the following a correct strategy:
    >
    > 1) Wrap the InputStream into a BufferedInputStream
    > to make sure mark() and reset() work.
    >
    > 2) Read single bytes from it up to some reasonable limit
    > and convert them to characters by simple casting:
    >
    > char ch = (char)the_byte_I_read;
    >
    > 3) check for encoding, e.g. with regexp
    > 4) call reset() on the BufferedInputStream
    > 5) wrap the BufferedInputStream into a Reader
    > with the determined encoding
    > 6) Start reading.
    >
    > What bothers me a bit is the additional
    > BufferedInputStream in between when the
    > Reader later has another buffer. I am also
    > not sure if the cast is the right way to
    > convert bytes to chars before you know the
    > encoding.
    >
    > Comments?
    > Harald.
    >
     
    Wibble, May 27, 2005
    #3
  4. HK

    HK Guest

    HK wrote:
    > Suppose you are faced with an java.io.InputStream
    > and it is supposed to carry either HTML or XML.
    > Ultimately you want to read with a Reader and the
    > correct encoding, of course.

    [...]

    Thanks for the answers which showed me that I did
    not fully understand the complexity of the
    problem. I actually thought that up until the
    encoding information the stream had to be ASCII
    or UTF-8 anyway. Now I read the fine manual:

    http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

    It has all that is needed for XML, at least.

    Harald.
     
    HK, May 27, 2005
    #4
  5. HK

    Dale King Guest

    Re: the right way to detect encoding used in InputStream carryingHTML or XML

    HK wrote:
    > Suppose you are faced with an java.io.InputStream
    > and it is supposed to carry either HTML or XML.
    > Ultimately you want to read with a Reader and the
    > correct encoding, of course.


    I believe XML is supposed to be UTF-8 unless it specifies otherwise
    using an encoding attribute. But your XML parser should handle all of
    that for you.
    --
    Dale King
     
    Dale King, May 31, 2005
    #5
  6. Re: the right way to detect encoding used in InputStream carryingHTML or XML

    Dale King wrote:
    > HK wrote:
    >
    >> Suppose you are faced with an java.io.InputStream
    >> and it is supposed to carry either HTML or XML.
    >> Ultimately you want to read with a Reader and the
    >> correct encoding, of course.

    >
    >
    > I believe XML is supposed to be UTF-8 unless it specifies otherwise
    > using an encoding attribute. But your XML parser should handle all of
    > that for you.


    If an XML document is not encoded in UTF-8, then its encoding must be
    specified in the XML declaration, true. If you don't know from some
    external source what the encoding is, however, then you may not be able
    to decode the XML declaration to find out. Many common cases can be
    handled without too much trouble, but I don't know any universal solution.

    --
    John Bollinger
     
    John C. Bollinger, Jun 6, 2005
    #6
  7. HK

    Dale King Guest

    Re: the right way to detect encoding used in InputStream carryingHTML or XML

    John C. Bollinger wrote:
    > Dale King wrote:
    >
    >> HK wrote:
    >>
    >>> Suppose you are faced with an java.io.InputStream
    >>> and it is supposed to carry either HTML or XML.
    >>> Ultimately you want to read with a Reader and the
    >>> correct encoding, of course.

    >>
    >>
    >>
    >> I believe XML is supposed to be UTF-8 unless it specifies otherwise
    >> using an encoding attribute. But your XML parser should handle all of
    >> that for you.

    >
    >
    > If an XML document is not encoded in UTF-8, then its encoding must be
    > specified in the XML declaration, true. If you don't know from some
    > external source what the encoding is, however, then you may not be able
    > to decode the XML declaration to find out. Many common cases can be
    > handled without too much trouble, but I don't know any universal solution.


    See appendix F of the XML spec.:
    http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
    --
    Dale King
     
    Dale King, Jun 7, 2005
    #7
  8. Re: the right way to detect encoding used in InputStream carryingHTML or XML

    Dale King wrote:

    > John C. Bollinger wrote:


    >> If an XML document is not encoded in UTF-8, then its encoding must be
    >> specified in the XML declaration, true. If you don't know from some
    >> external source what the encoding is, however, then you may not be
    >> able to decode the XML declaration to find out. Many common cases can
    >> be handled without too much trouble, but I don't know any universal
    >> solution.

    >
    >
    > See appendix F of the XML spec.:
    > http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing


    I am well aware of that; I was alluding to it when I wrote that many
    common cases can be handled. It doesn't even come close to covering
    _all_ the infinitely many possibilities, however. For the sake of
    argument only, I point out that no matter what autodetection algorithm
    you devise, I can produce an encoding that breaks it. In practice, such
    intentionally perverse encodings are less of an issue than possible real
    encodings that accidentally happen to confound existing algorithms. It
    may be that the procedure described in appendix F suffices for any
    particular purpose, but no one should be fooled into thinking that it is
    universal.

    --
    John Bollinger
     
    John C. Bollinger, Jun 7, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ani
    Replies:
    1
    Views:
    367
    Mike Moore [MSFT]
    Oct 28, 2003
  2. ani
    Replies:
    1
    Views:
    316
    Stephan Bour
    Oct 28, 2003
  3. R
    Replies:
    5
    Views:
    2,118
    Kevin McMurtrie
    Mar 13, 2005
  4. Chase Preuninger
    Replies:
    11
    Views:
    609
    Daniele Futtorovic
    Aug 6, 2008
  5. iMath
    Replies:
    8
    Views:
    133
    Dave Angel
    Dec 21, 2012
Loading...

Share This Page