Character encodings and invalid characters

Discussion in 'Java' started by Safalra, Jun 14, 2004.

  1. Safalra

    Safalra Guest

    [Crossposted as the questions to each group might sound a little
    strange without context; trim groups if necessary]

    The idea here is relatively simple: a java program (I'm using JDK1.4
    if that makes a difference) that loads an HTML file, removes invalid
    characters (or replaces them in the case of common ones like
    Microsoft's 'smartquotes'), and outputs the file.

    The problem is these files will be on disk, so the program won't have
    the character encoding information from the server.

    Questions:

    1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    the byte order markers. How does it identify other encodings? Will it
    just assume the system default encoding until it finds bytes that
    imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
    ISO-8859-1 and US-ASCII, but others may occur.

    2) I'm slightly confused by the HTML specification - are the valid
    characters precisely those that are defined in Unicode? (Java
    internally works with 16 but characters.) (I'm ignoring at this point
    characters that in HTML need escaping.)

    3) If it fails on esoteric character encodings, how badly is it likely
    to fail? Will it totally trash the HTML?

    --
    Safalra (Stephen Morley)
    http://www.safalra.com/
    Safalra, Jun 14, 2004
    #1
    1. Advertising

  2. On Mon, 14 Jun 2004, Safalra wrote:

    > Questions:
    >
    > 1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    > the byte order markers. How does it identify other encodings?


    [I can't answer that, but the use of a BOM is permissible in utf-8
    although it's not required. Actually, if I may be pedantic for a
    moment, utf-16BE and utf-16LE don't use a BOM - the endianness is
    specified by the name of the encoding; utf-16 uses a BOM and by
    looking at the BOM you work out for yourself whether it's LE or BE.

    Coming back to utf-8: unless it's entirely us-ascii in which case you
    can't tell the difference, there are validity criteria, and the more
    of it you get which meet the criteria, the more confident you can be
    that it really is utf-8. Just one single violation of the criteria is
    enough to rule that possibility out, and the Unicode rules *mandate*
    refusing to process the document further, for security reasons.

    > Will it just assume the system default encoding until it finds bytes
    > that imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
    > ISO-8859-1 and US-ASCII, but others may occur.


    Right, but define "others". Are you going to deal with any character
    encodings which define characters that don't exist in Unicode - e.g
    Klingon?

    You certainly aren't going to be able to guess 8-bit character
    encodings just by looking at them - you absolutely do, in general,
    need some external source of wisdom on what character coding you are
    dealing with. *Some* character encodings can be guessed, at least on
    plausibility grounds.

    > 2) I'm slightly confused by the HTML specification - are the valid
    > characters precisely those that are defined in Unicode?


    With the greatest of respect, you seem to be putting the cart before
    the horse. First you say you intend to remove invalid characters, and
    then it becomes clear that you're not sure how to define what they
    are. :-}

    I'm assuming that there's some substantive issue behind your problem,
    but I'm afraid you're not expressing it in terms that I can be
    confident that I understand what you're trying to achieve. Recall
    that there are in general three ways of representing characters in
    HTML:

    1. coded characters in the appropriate character encoding
    2. numerical character references or 
    3. character entity references &name; for those characters which have
    them.

    Can you address what you propose to do with each of these when you
    find them?

    > (I'm ignoring at this point characters that in HTML need escaping.)


    Hmmm? Are you referring to the use of &-notations here, or something
    else?

    > 3) If it fails on esoteric character encodings, how badly is it likely
    > to fail? Will it totally trash the HTML?


    Best answer I can give to that is that the HTML markup itself uses
    nothing more than plain us-ascii repertoire. If you can't recognise
    at least that repertoire in the original encoding, then you're going
    to do worse than trash only the HTML, no?

    good luck
    Alan J. Flavell, Jun 14, 2004
    #2
    1. Advertising

  3. Safalra

    Roedy Green Guest

    On 14 Jun 2004 09:48:55 -0700, (Safalra) wrote or
    quoted :

    >1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    >the byte order markers. How does it identify other encodings?


    You have to ask the user. You can find out the default encoding on his
    machine, but that's as good as it gets. People never thought to mark
    documents with the encoding or record it in a resource fork.

    You can take the same document and interpret it many ways. It would
    require almost AI to figure out which was the most likely encoding.

    You could do it my comparing letter frequencies to averages of
    samples.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 14, 2004
    #3
  4. Safalra wrote:
    > [Crossposted as the questions to each group might sound a little
    > strange without context; trim groups if necessary]
    >
    > The idea here is relatively simple: a java program (I'm using JDK1.4
    > if that makes a difference) that loads an HTML file, removes invalid
    > characters (or replaces them in the case of common ones like
    > Microsoft's 'smartquotes'), and outputs the file.


    Sounds like you want to re-invent JTidy or HTML Tidy (google is your
    friend).

    > The problem is these files will be on disk, so the program won't have
    > the character encoding information from the server.


    If you are lucky, there is a charset entry at the beginning of the HTML.

    > 1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
    > the byte order markers.


    No, it doesn't look at them. While I have presented pseudo code in this
    group to at least do this (and someone has later posted an
    implementation, search an archive of this group), this will not help
    you. You assume that the file has been correctly saved by the browser or
    some other tool. This would assume the browser had that information.

    > How does it identify other encodings?


    It doesn't. It can't.

    > 2) I'm slightly confused by the HTML specification - are the valid
    > characters precisely those that are defined in Unicode? (Java
    > internally works with 16 but characters.) (I'm ignoring at this point
    > characters that in HTML need escaping.)


    There are the specs, and there is what people really put into web pages.
    And they put everything in it, really everything.

    > 3) If it fails on esoteric character encodings, how badly is it likely
    > to fail? Will it totally trash the HTML?


    It can. It depends on the contents of the page. E.g. UTF-8 is
    indistinguishable from US-ASCII if only ASCII characters are used in the
    HTML (UTF-8 encoding of these characters happens to be the same). So if
    you pick the wrong encoding in this case, you won't see a difference at
    all. But if there are non ASCII characters in the UTF-8 data, and if you
    decode it as US-ASCII, you get strange additional characters. The amount
    of these characters entirely depends on the contents. And if you e.g.
    misinterpret a Shift-JIT as US-ASCII, you will most likely see only
    strange things.

    /Thomas
    Thomas Weidenfeller, Jun 15, 2004
    #4
  5. Safalra

    Safalra Guest

    "Alan J. Flavell" <> wrote in message news:<>...
    > On Mon, 14 Jun 2004, Safalra wrote:
    > > 2) I'm slightly confused by the HTML specification - are the valid
    > > characters precisely those that are defined in Unicode?

    >
    > With the greatest of respect, you seem to be putting the cart before
    > the horse. First you say you intend to remove invalid characters, and
    > then it becomes clear that you're not sure how to define what they
    > are. :-}
    >
    > I'm assuming that there's some substantive issue behind your problem,
    > but I'm afraid you're not expressing it in terms that I can be
    > confident that I understand what you're trying to achieve.


    Okay, I guess I should have given more detail:

    I wrote my dissertation on the subject of automated neatening of HTML.
    As part of this I wrote a Java program to demonstrate what could be
    done. It removed or replaced invalid characters, attributes and
    elements, turned presentation elements and attributes to CSS, and
    replaced many tables used for layout purposes (and some framesets)
    with divs and CSS. It worked suprisingly well, but I only had to test
    it on ISO-8859-1 documents. I worked out the invalid characters just
    by feeding them into the W3C Validator, and for the ones that were
    invalid but rendered under Windows (like smartquotes) I replaced those
    with valid equivalents.

    Once I've worked the program into a more presentable state, I'd like
    to release it (GPL'd, of course). The problem is, I've got no idea
    what would happen if, say, a Japanese person runs it on some Japanese
    HTML source on their harddisk - I've never used a foreign character
    encoding, so I don't even know how their text editors figure out the
    encoding. I was wondering if Java assumes it's the system default
    (unless it encounters unicode), and hence the program would still
    work. (I assume that people would usually use the same character
    encoding for their system and their HTML?)

    > Recall
    > that there are in general three ways of representing characters in
    > HTML:
    >
    > 1. coded characters in the appropriate character encoding
    > 2. numerical character references or 
    > 3. character entity references &name; for those characters which have
    > them.
    >
    > Can you address what you propose to do with each of these when you
    > find them?


    1. That's the one I'm asking about. :)

    Assuming I can get around character encoding problems.:

    2. If I understand the specification correctly, these refer to UCS
    code positions, so I just to to check whether the position is defined
    in Unicode.
    3. I just need to check whether these are defined in the
    specification.

    If occurances of (2) and (3) are valid, they'll just be outputted by
    the program in the same form.

    > > (I'm ignoring at this point characters that in HTML need escaping.)

    >
    > Hmmm? Are you referring to the use of &-notations here,


    Yes, but now we've discussed them above...

    --
    Safalra (Stephen Morley)
    http://www.safalra.com/
    Safalra, Jun 15, 2004
    #5
  6. Safalra wrote:
    > to release it (GPL'd, of course). The problem is, I've got no idea
    > what would happen if, say, a Japanese person runs it on some Japanese
    > HTML source on their harddisk - I've never used a foreign character
    > encoding, so I don't even know how their text editors figure out the
    > encoding.


    They assume it by convention, usually. This can (and does) go wrong.

    > I was wondering if Java assumes it's the system default
    > (unless it encounters unicode)


    Java *alway* assumes text is the system default encoding unless given an
    explicit encoding. Unicode does not play into it.

    Also, do remember that in theory, all HTML documents should declare
    their encoding explicitly, or have it supplied by the server in
    the header. In XHTML, the explicit declaration is in fact mandatory.

    But overall, text encoding is a horribly complex, muddled mess of
    legacy conventions, incompatibilities, hacks and workarounds. Most
    of the time, it breaks down horribly as soon as you cross a language
    barrier.
    Michael Borgwardt, Jun 15, 2004
    #6
  7. On Tue, 15 Jun 2004, Safalra wrote:

    > I wrote my dissertation on the subject of automated neatening of HTML.

    [...]
    > with divs and CSS. It worked suprisingly well, but I only had to test
    > it on ISO-8859-1 documents. I worked out the invalid characters just
    > by feeding them into the W3C Validator,


    I think I'm going to have to stand firm, and say that you really need
    to make the effort and cross the threshold of understanding the HTML
    character model in order to grasp what's behind this, otherwise you'd
    risk blundering on in a heuristic fashion without a robust mental
    picture of what's involved.

    This note makes no attempt to be a full tutorial on that, but just
    races through some key headings to see whether you can be persuaded to
    read the background and get up to speed.

    All of the characters from 0 to 31 decimal, and all of the characters
    from 127(sic) to 159 decimal, in the Document Character Set, are
    defined to be control characters, and almost all of them are excluded
    from use in HTML. These are the characters which are declared to be
    "invalid" by the specification (and by the validator).

    What's the "Document Character Set"? Well, in HTML2 it was
    iso-8859-1, and in HTML4 it was defined to be iso-10646 as amended.
    Loosely, you can read "iso-10646 as amended" as being the character
    model of Unicode. As far as the values from 0 to 255 are concerned,
    iso-8859-1 and iso-10646 are identical.

    How is this related to the external character encoding? Well, the
    character model that was introduced in RFC2070 and embodied in HTML4
    is based on the concept that the external encoding is converted into
    iso-10646/unicode prior to any other processing being done. It
    doesn't require implementations to work in that way internally, but it
    _does_ mandate that they give that impression externally (black box
    model).

    So from HTML's point of view, if you have a document which is coded in
    say Windows-1252, including those pretty quotes, then (as long as the
    recipient consents - see the HTTP Accept-charset) it's perfectly
    legal. All you need to do is apply the appropriate code mapping that
    you find at the Unicode site, and get the resulting Unicode character.

    Resources at http://www.unicode.org/Public/MAPPINGS/ , in this case
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT


    > and for the ones that were invalid but rendered under Windows (like
    > smartquotes) I replaced those with valid equivalents.


    What you're talking about here is probably a document which in reality
    is coded in Windows-1252 but erroneously claims to be - or is
    mistakenly presumed to be - iso-8859-1 (or its equivalent in other
    locales).

    There's nothing inherently wrong with these particular octet values
    (128-159 decimal) *in those codings which assign them to printable
    characters* (that's not only all of the Windows-125x codings, but also
    koi-8r and some other less-usual codings).

    What's wrong is when those octet values occur in codings which define
    them to be control characters which are not used in HTML.

    > Once I've worked the program into a more presentable state, I'd like
    > to release it (GPL'd, of course). The problem is, I've got no idea
    > what would happen if, say, a Japanese person runs it on some Japanese
    > HTML source on their harddisk - I've never used a foreign character
    > encoding, so I don't even know how their text editors figure out the
    > encoding.


    Sadly, quite a number of language locales simply *assume* that their
    local coding applies. Try looking at such a file on a system that's
    set for a different locale, and you'll get rubbish. Although it's
    sometimes possible to guess (look at the automatic charset selection
    in, say, Mozilla for examples of what can be done heuristically).

    OK, I've done the HTML part of this. I'm not a regular Java user so
    I'm leaving that to others.

    > > Recall
    > > that there are in general three ways of representing characters in
    > > HTML:
    > >
    > > 1. coded characters in the appropriate character encoding
    > > 2. numerical character references or 
    > > 3. character entity references &name; for those characters which have
    > > them.
    > >
    > > Can you address what you propose to do with each of these when you
    > > find them?

    >
    > 1. That's the one I'm asking about. :)


    Thanks - I did want to be sure about that first.

    [Don't make the mistake of confusing an 8-bit character of value 151
    decimal (in some specified 8-bit encoding), on the one hand, with the
    undefined(HTML)/illegal(XML) notation — on the other hand.]

    > 2. If I understand the specification correctly, these refer to UCS
    > code positions,


    basically yes, modulo some possible nit picking about high/low
    surrogates and stuff, that I don't want to go into here.

    > so I just to to check whether the position is defined
    > in Unicode.


    Er, not quite. Those control characters are certainly *defined*, but
    they are excluded from use in HTML by the "SGML declaration for HTML",
    and from XHTML by the rules of XML.

    And on the other hand I don't think an as-yet-unassigned Unicode code
    point is actually invalid for use in (X)HTML. Try it and see what the
    validator says?

    hope this helps a bit. The writeup of the HTML character model in the
    relevant part of the HTML4 spec and/or RFC2070 is not bad, I'd suggest
    giving it a try. There's also some material at
    http://ppewww.ph.gla.ac.uk/~flavell/charset/ which some folks have
    found helpful.
    Alan J. Flavell, Jun 15, 2004
    #7
  8. Safalra

    Roedy Green Guest

    On Mon, 14 Jun 2004 20:38:09 GMT, Roedy Green
    <> wrote or quoted :

    >You have to ask the user. You can find out the default encoding on his
    >machine, but that's as good as it gets. People never thought to mark
    >documents with the encoding or record it in a resource fork.


    for more info see
    http://mindprod.com/jgloss/encoding.html#IDENTIFICATION

    I am working up a student project to solve this problem.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 15, 2004
    #8
  9. Safalra

    Roedy Green Guest

    On Tue, 15 Jun 2004 21:59:54 GMT, Roedy Green
    <> wrote or quoted :

    >>You have to ask the user. You can find out the default encoding on his
    >>machine, but that's as good as it gets. People never thought to mark
    >>documents with the encoding or record it in a resource fork.

    >
    >for more info see
    >http://mindprod.com/jgloss/encoding.html#IDENTIFICATION
    >
    >I am working up a student project to solve this problem.


    see http://mindprod.com/projects/encodingidentification.html

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
    Roedy Green, Jun 15, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kenneth McDonald
    Replies:
    1
    Views:
    311
  2. JKPeck
    Replies:
    6
    Views:
    294
    Martin Miller
    Nov 14, 2006
  3. kevin
    Replies:
    0
    Views:
    953
    kevin
    Jan 16, 2008
  4. A_H
    Replies:
    3
    Views:
    885
    Gary Herron
    May 20, 2008
  5. Replies:
    7
    Views:
    3,541
Loading...

Share This Page