How to tell character encoding?

Discussion in 'Java' started by Aaron Fude, Aug 30, 2009.

  1. Aaron Fude

    Aaron Fude Guest

    Hi,

    This is not a java question, but I have a java application in mind and
    this is the only place where I get my computer questions answered.

    Suppose I see text on a webpage in a foreign language, e.g. French. This
    page, for example:

    http://www.gabay.com/sources/Liste_Fiche.asp?CV=117

    How can I determine the encoding used for the foreign text? What's the
    easiest way? If it's nonstandard, how do I convert it to something more
    standard? (Like a Unicode.)

    Many thanks in advance,

    Aaron
     
    Aaron Fude, Aug 30, 2009
    #1
    1. Advertising

  2. Aaron Fude wrote:
    > How can I determine the encoding used for the foreign text? What's the
    > easiest way? If it's nonstandard, how do I convert it to something more
    > standard? (Like a Unicode.)


    Determining encodings with 100% accuracy is impossible. The easiest way
    to figure out the encoding of a page requires you to search for and find
    associated metadata that lists it. For example, http provides a header
    which allows you to find the charset, and so do most MIME messages (your
    email was sent as ISO-8559-1, I can confirm). That, of course, assumes
    that the server is sending its data correctly, which is not necessarily
    a safe assumption.

    In cases such as HTTP or email, the library you use is probably smart
    enough to find the charset metadata and handle that information for you.

    Now suppose no one tells you any metadata, such as you're looking in a
    local file. In that case, there is typically a platform-default encoding
    which would be faithfully followed by default (if you pay careful
    attention in Java, it will automatically treat text in English-version
    Windows as ISO 8859-1 and text in most Linux systems as UTF-8).

    You can always try to do statistical analysis to guess which encoding is
    correct. This is mainly useful for deciding between two encodings, such
    as ISO 8859-1 and UTF-8. If you have text which is not valid UTF-8, then
    obviously it cannot be UTF-8; if you always have multiple high-bit
    sequences in a row, it's more likely UTF-8 than ISO 8859-1 (you can
    generally tell when UTF-8 is being misinterpreted as the latter, as you
    will see stuff like é); if you have no high bits set, it doesn't
    matter. Unless you have EBCDIC, but I'm going to discount that possibility.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Aug 30, 2009
    #2
    1. Advertising

  3. Aaron Fude <> writes:

    > Suppose I see text on a webpage in a foreign language, e.g. French. This
    > page, for example:
    > http://www.gabay.com/sources/Liste_Fiche.asp?CV=117
    > How can I determine the encoding used for the foreign text? What's the


    Look at the <head> element, there should be a meta element with
    the attribute http-equiv="Content-Type" and content attribute telling the
    charset. If there isn't, I'd expect the contents to be iso-8859-1.

    --
    Jukka Lahtinen
     
    Jukka Lahtinen, Aug 30, 2009
    #3
  4. Aaron Fude

    Arne Vajhøj Guest

    Aaron Fude wrote:
    > This is not a java question, but I have a java application in mind and
    > this is the only place where I get my computer questions answered.
    >
    > Suppose I see text on a webpage in a foreign language, e.g. French. This
    > page, for example:
    >
    > http://www.gabay.com/sources/Liste_Fiche.asp?CV=117
    >
    > How can I determine the encoding used for the foreign text? What's the
    > easiest way? If it's nonstandard, how do I convert it to something more
    > standard? (Like a Unicode.)


    For web pages use the following logic:

    if encoding specified in META tag then
    use that
    else if encoding specified in HTTP header then
    use that
    else
    use ISO-8859-1
    end

    Arne
     
    Arne Vajhøj, Aug 30, 2009
    #4
  5. Aaron Fude

    Arne Vajhøj Guest

    Arne Vajhøj wrote:
    > Aaron Fude wrote:
    >> This is not a java question, but I have a java application in mind and
    >> this is the only place where I get my computer questions answered.
    >>
    >> Suppose I see text on a webpage in a foreign language, e.g. French.
    >> This page, for example:
    >>
    >> http://www.gabay.com/sources/Liste_Fiche.asp?CV=117
    >>
    >> How can I determine the encoding used for the foreign text? What's the
    >> easiest way? If it's nonstandard, how do I convert it to something
    >> more standard? (Like a Unicode.)

    >
    > For web pages use the following logic:
    >
    > if encoding specified in META tag then
    > use that
    > else if encoding specified in HTTP header then
    > use that
    > else
    > use ISO-8859-1
    > end


    I think I have some Java code to do it if you are
    interested.

    Arne
     
    Arne Vajhøj, Aug 30, 2009
    #5
  6. Aaron Fude

    Roedy Green Guest

    On Sun, 30 Aug 2009 15:03:18 -0400, Aaron Fude <>
    wrote, quoted or indirectly quoted someone who said :

    >How can I determine the encoding used for the foreign text? What's the
    >easiest way? If it's nonstandard, how do I convert it to something more
    >standard? (Like a Unicode.)


    I wrote a utility to assist. It still requires guessing. See
    http://mindprod.com/applet/encodingrecogniser.html

    See http://mindprod.com/jgloss/encoding.html
    for info on how to convert.

    There is native2ascii, which used twice, converts.

    You can write a little utility to read/write. See
    http://mindprod.com/applet/fileio.html
    for the code.

    You can use HunkIO.readEntireFile to read the file in one fell swoop
    with one encoding and write it with another. see
    http://mindprod.com/products1.html#HUNKIO
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Simplicity is prerequisite for reliability,"
    ~ Edsger Wybe Dijkstra (born: 1930-05-11 died: 2002-08-06 at age: 72)
     
    Roedy Green, Aug 30, 2009
    #6
  7. rossum wrote:
    > On Sun, 30 Aug 2009 17:41:45 -0400, Arne Vajhøj <>
    > wrote:
    >
    >> Arne Vajhøj wrote:
    >>> Aaron Fude wrote:
    >>>> This is not a java question, but I have a java application in mind and
    >>>> this is the only place where I get my computer questions answered.
    >>>>
    >>>> Suppose I see text on a webpage in a foreign language, e.g. French.
    >>>> This page, for example:
    >>>>
    >>>> http://www.gabay.com/sources/Liste_Fiche.asp?CV=117
    >>>>
    >>>> How can I determine the encoding used for the foreign text? What's the
    >>>> easiest way? If it's nonstandard, how do I convert it to something
    >>>> more standard? (Like a Unicode.)
    >>> For web pages use the following logic:
    >>>
    >>> if encoding specified in META tag then
    >>> use that
    >>> else if encoding specified in HTTP header then
    >>> use that

    > else if Byte Order Mark (BOM) present then
    > use that


    You could do that.

    But note that the BOM bytes is two valid bytes in ISO-8859-1.

    The chances of these two coming as the first two bytes in
    a file is extremely small, but it is possible.

    >>> else
    >>> use ISO-8859-1
    >>> end

    >> I think I have some Java code to do it if you are
    >> interested.


    Arne
     
    Arne Vajhøj, Aug 31, 2009
    #7
  8. Arne Vajhøj wrote:

    >> else if Byte Order Mark (BOM) present then
    >> use that

    >
    > You could do that.
    >
    > But note that the BOM bytes is two valid bytes in ISO-8859-1.
    >
    > The chances of these two coming as the first two bytes in
    > a file is extremely small, but it is possible.


    Especially the chances of having a BOM but no Content-Type with
    charset-attribute. OTOH, Microsoft IIS can't cope with folded
    request-headers correctly so you can't assume anything in
    this world.


    Regards, Lothar
    --
    Lothar Kimmeringer E-Mail:
    PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

    Always remember: The answer is forty-two, there can only be wrong
    questions!
     
    Lothar Kimmeringer, Aug 31, 2009
    #8
  9. Aaron Fude

    markspace Guest

    rossum wrote:

    > On Sun, 30 Aug 2009 17:41:45 -0400, Arne Vajhøj <>
    > wrote:
    >
    >> Arne Vajhøj wrote:
    >>> For web pages use the following logic:
    >>>
    >>> if encoding specified in META tag then
    >>> use that
    >>> else if encoding specified in HTTP header then
    >>> use that


    > else if Byte Order Mark (BOM) present then
    > use that



    Are you sure about that? I thought the HTTP spec said that if there
    were no meta or other content tags, then the default was ISO-8859-1.
    The BOM thing might actually make certain types of files accidentally
    incorrect, I think.


    >>> else
    >>> use ISO-8859-1
    >>> end
     
    markspace, Aug 31, 2009
    #9
  10. Steven Simpson wrote:
    > Arne Vajhøj wrote:
    >> For web pages use the following logic:
    >>
    >> if encoding specified in META tag then
    >> use that
    >> else if encoding specified in HTTP header then
    >> use that
    >> else
    >> use ISO-8859-1
    >> end

    >
    > I think you're supposed to check HTTP before <meta>, at least for HTML:
    >
    >> To sum up, conforming user agents must observe the following
    >> priorities when determining a document's character encoding (from
    >> highest priority to lowest):
    >>
    >> 1. An HTTP "charset" parameter in a "Content-Type" field.
    >> 2. A META declaration with "http-equiv" set to "Content-Type" and a
    >> value set for "charset".
    >> 3. The charset attribute set on an element that designates an
    >> external resource.
    >>

    >
    > <http://www.w3.org/TR/html4/charset.html#h-5.2.2>


    Ooops.

    You are correct.

    I guess I only tested META tag with no HTTP header.

    Arne
     
    Arne Vajhøj, Sep 1, 2009
    #10
  11. Aaron Fude

    Tom Anderson Guest

    On Mon, 31 Aug 2009, Arne Vajh?j wrote:

    > Steven Simpson wrote:
    >> Arne Vajh?j wrote:
    >>> For web pages use the following logic:
    >>>
    >>> if encoding specified in META tag then
    >>> use that
    >>> else if encoding specified in HTTP header then
    >>> use that
    >>> else
    >>> use ISO-8859-1
    >>> end

    >>
    >> I think you're supposed to check HTTP before <meta>, at least for HTML:
    >>
    >>> To sum up, conforming user agents must observe the following
    >>> priorities when determining a document's character encoding (from
    >>> highest priority to lowest):
    >>>
    >>> 1. An HTTP "charset" parameter in a "Content-Type" field.
    >>> 2. A META declaration with "http-equiv" set to "Content-Type" and a
    >>> value set for "charset".
    >>> 3. The charset attribute set on an element that designates an
    >>> external resource.
    >>>

    >>
    >> <http://www.w3.org/TR/html4/charset.html#h-5.2.2>

    >
    > Ooops.
    >
    > You are correct.
    >
    > I guess I only tested META tag with no HTTP header.


    I tentatively consider that a bug in the spec - i'd prefer a meta tag to
    be able to override the protocol header. The reason being that the server
    serving up some static content doesn't always know the charset it's in,
    but the person writing that content does.

    tom

    --
    In the long run, we are all dead. -- John Maynard Keynes
     
    Tom Anderson, Sep 1, 2009
    #11
  12. Aaron Fude

    Lew Guest

    rossum wrote:
    >> else if Byte Order Mark (BOM) present then use that

    >


    markspace wrote:
    > Are you sure about that?  I thought the HTTP spec said that if there
    > were no meta or other content tags, then the default was ISO-8859-1.
    > The BOM thing might actually make certain types of files accidentally
    > incorrect, I think.
    >


    Somebody set us up the BOM.
    <http://allyour.basearebelongto.us/AYB3.swf>

    --
    Lew
     
    Lew, Sep 1, 2009
    #12
  13. Aaron Fude

    Roedy Green Guest

    On Sun, 30 Aug 2009 15:03:18 -0400, Aaron Fude <>
    wrote, quoted or indirectly quoted someone who said :

    >This is not a java question, but I have a java application in mind and
    >this is the only place where I get my computer questions answered.


    In the old days the notion you would possess a file without knowing
    what was on it would have been ludicrous. You needed a program or at
    least a detailed record layout, and all kinds of other trivia. Without
    it the file might as well be blank.

    It did not dawn on the ancient ones that every file needed a bundle of
    metadata permanently glued to it. Steve Jobs was one of the first to
    be enlightened. Even the very early Macs had data and resource forks.

    The ancient ones did not share files, except with great ceremonies
    involving lawyers. There was only one encoding within any one
    institution, so the question of what encoding was used never came up.
    Nearly all programs were written from scratch for that institution. I
    recall my bafflement on learning about VisiCalc, early word processors
    and accounting programs for the Apple][. How could the same program
    be sold uncustomised to more than one customer and still be useful?

    With global sharing of data, suddenly it became clear that we should
    have been attaching meta-information to files.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "People think of security as a noun, something you go buy. In reality, it’s an abstract concept like happiness. Openness is unbelievably helpful to security."
    ~ James Gosling (born: 1955-05-18 age: 54), inventor of Java.
     
    Roedy Green, Sep 2, 2009
    #13
  14. Steven Simpson wrote:
    > Tom Anderson wrote:
    >>> Steven Simpson wrote:
    >>>> I think you're supposed to check HTTP before <meta>, at least for
    >>>> HTML: [...]
    >>>> <http://www.w3.org/TR/html4/charset.html#h-5.2.2>

    >> I tentatively consider that a bug in the spec - i'd prefer a meta tag
    >> to be able to override the protocol header. The reason being that the
    >> server serving up some static content doesn't always know the charset
    >> it's in, but the person writing that content does.

    >
    > I know what you mean, but I think I get what the spec is trying to do,
    > i.e. allow the embedded setting to be overridden without having to alter
    > the document, perhaps following a more general principle that a
    > container should be able to override its contents.


    That may be the intention.

    But given that:
    * access to server config usually implies access to HTML files
    * access to HTML files does not imply access to server config
    then I agree with Tom that the opposite of current behavior
    would be more useful.

    Arne
     
    Arne Vajhøj, Sep 2, 2009
    #14
  15. Aaron Fude

    Roedy Green Guest

    On Tue, 1 Sep 2009 20:47:14 +0100, Tom Anderson <>
    wrote, quoted or indirectly quoted someone who said :

    >I tentatively consider that a bug in the spec - i'd prefer a meta tag to
    >be able to override the protocol header. The reason being that the server
    >serving up some static content doesn't always know the charset it's in,
    >but the person writing that content does.


    Imagine something like JSP that prepares the document in 16-bit, then
    it is converted to some encoding that the user likes based on the
    request header. In this case it is possible the womb knows more than
    the program building the content about what encoding finally goes out
    the wire. The womb knows about any compression. The programmer does
    not.

    It would be best to get in right in both places. This allows the
    client to pick it up from either place safely.

    It seems to me that embedded encodings are too late. You have to know
    at least the approximate encoding before you can parse the internal
    encoding. I think is it more documentation intended for the user who
    does a view source.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "People think of security as a noun, something you go buy. In reality, it’s an abstract concept like happiness. Openness is unbelievably helpful to security."
    ~ James Gosling (born: 1955-05-18 age: 54), inventor of Java.
     
    Roedy Green, Sep 2, 2009
    #15
  16. Aaron Fude

    Arne Vajhøj Guest

    Roedy Green wrote:
    > On Tue, 1 Sep 2009 20:47:14 +0100, Tom Anderson <>
    > wrote, quoted or indirectly quoted someone who said :
    >> I tentatively consider that a bug in the spec - i'd prefer a meta tag to
    >> be able to override the protocol header. The reason being that the server
    >> serving up some static content doesn't always know the charset it's in,
    >> but the person writing that content does.

    >
    > Imagine something like JSP that prepares the document in 16-bit, then
    > it is converted to some encoding that the user likes based on the
    > request header. In this case it is possible the womb knows more than
    > the program building the content about what encoding finally goes out
    > the wire.


    In JSP neither is the right way.

    In JSP it should be specified in the page directive.

    > The womb knows about any compression. The programmer does
    > not.


    Compression and charset is orthogonal.

    Charset is used after decompression.

    Arne
     
    Arne Vajhøj, Sep 2, 2009
    #16
  17. Arne Vajhøj wrote:
    > Steven Simpson wrote:
    >> Tom Anderson wrote:
    >>>> Steven Simpson wrote:
    >>>>> I think you're supposed to check HTTP before <meta>, at least
    >>>>> for
    >>>>> HTML: [...]
    >>>>> <http://www.w3.org/TR/html4/charset.html#h-5.2.2>
    >>> I tentatively consider that a bug in the spec - i'd prefer a meta
    >>> tag to be able to override the protocol header. The reason being
    >>> that the server serving up some static content doesn't always know
    >>> the charset it's in, but the person writing that content does.

    >>
    >> I know what you mean, but I think I get what the spec is trying to
    >> do, i.e. allow the embedded setting to be overridden without having
    >> to alter the document, perhaps following a more general principle
    >> that a container should be able to override its contents.

    >
    > That may be the intention.
    >
    > But given that:
    > * access to server config usually implies access to HTML files
    > * access to HTML files does not imply access to server config
    > then I agree with Tom that the opposite of current behavior
    > would be more useful.


    If the sender has no idea of the encoding, it shouldn't put one into
    the content type; this allows the data to identify itself. If, on the
    other hand, the sender is a program that knows damned well that it
    just converted chars to UTF-8, it needs a way to say so, overriding
    any text in the data which says that it began life as ISO-8859-1.
     
    Mike Schilling, Sep 2, 2009
    #17
  18. Aaron Fude

    Arne Vajhøj Guest

    Mike Schilling wrote:
    > Arne Vajhøj wrote:
    >> Steven Simpson wrote:
    >>> Tom Anderson wrote:
    >>>>> Steven Simpson wrote:
    >>>>>> I think you're supposed to check HTTP before <meta>, at least
    >>>>>> for
    >>>>>> HTML: [...]
    >>>>>> <http://www.w3.org/TR/html4/charset.html#h-5.2.2>
    >>>> I tentatively consider that a bug in the spec - i'd prefer a meta
    >>>> tag to be able to override the protocol header. The reason being
    >>>> that the server serving up some static content doesn't always know
    >>>> the charset it's in, but the person writing that content does.
    >>> I know what you mean, but I think I get what the spec is trying to
    >>> do, i.e. allow the embedded setting to be overridden without having
    >>> to alter the document, perhaps following a more general principle
    >>> that a container should be able to override its contents.

    >> That may be the intention.
    >>
    >> But given that:
    >> * access to server config usually implies access to HTML files
    >> * access to HTML files does not imply access to server config
    >> then I agree with Tom that the opposite of current behavior
    >> would be more useful.

    >
    > If the sender has no idea of the encoding, it shouldn't put one into
    > the content type; this allows the data to identify itself. If, on the
    > other hand, the sender is a program that knows damned well that it
    > just converted chars to UTF-8, it needs a way to say so, overriding
    > any text in the data which says that it began life as ISO-8859-1.


    Simple web servers serve usually files as BLOB's. They do not
    convert any charset.

    And often they set a charset for text/html.

    Arne
     
    Arne Vajhøj, Sep 3, 2009
    #18
  19. Arne Vajhøj wrote:
    > Mike Schilling wrote:
    >> Arne Vajhøj wrote:
    >>> Steven Simpson wrote:
    >>>> Tom Anderson wrote:
    >>>>>> Steven Simpson wrote:
    >>>>>>> I think you're supposed to check HTTP before <meta>, at least
    >>>>>>> for
    >>>>>>> HTML: [...]
    >>>>>>> <http://www.w3.org/TR/html4/charset.html#h-5.2.2>
    >>>>> I tentatively consider that a bug in the spec - i'd prefer a
    >>>>> meta
    >>>>> tag to be able to override the protocol header. The reason being
    >>>>> that the server serving up some static content doesn't always
    >>>>> know
    >>>>> the charset it's in, but the person writing that content does.
    >>>> I know what you mean, but I think I get what the spec is trying
    >>>> to
    >>>> do, i.e. allow the embedded setting to be overridden without
    >>>> having
    >>>> to alter the document, perhaps following a more general principle
    >>>> that a container should be able to override its contents.
    >>> That may be the intention.
    >>>
    >>> But given that:
    >>> * access to server config usually implies access to HTML files
    >>> * access to HTML files does not imply access to server config
    >>> then I agree with Tom that the opposite of current behavior
    >>> would be more useful.

    >>
    >> If the sender has no idea of the encoding, it shouldn't put one
    >> into
    >> the content type; this allows the data to identify itself. If, on
    >> the other hand, the sender is a program that knows damned well that
    >> it just converted chars to UTF-8, it needs a way to say so,
    >> overriding any text in the data which says that it began life as
    >> ISO-8859-1.

    >
    > Simple web servers serve usually files as BLOB's. They do not
    > convert any charset


    Sure, but they're not the only HTTP clients (or servers.) Say I've
    written a servlet that want to return some XML, which I've get in
    memory as a DOM or a character string. In either case, it's
    inconvenient to figure out whether it has an XML header or, if so,
    what encoding that specifies. It's much simpler for me to serialize
    it (or convert it) to UTF-8 and put that in the content-type.

    On the other hand, I could (in theory) write a web server that accepts
    lots of odd charsets for PUTs but saves everything as UTF-8, to be
    nice to clients. It should reports content-type of UTF-8, and that
    should override the <meta> tag.

    >
    > And often they set a charset for text/html.


    That's wrong. But the problem is the web server's claiming knowledge
    it doesn't possess, not the spec.
     
    Mike Schilling, Sep 3, 2009
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    19,041
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. raavi
    Replies:
    2
    Views:
    929
    raavi
    Mar 2, 2006
  3. yogesh
    Replies:
    1
    Views:
    397
    Victor Bazarov
    Mar 14, 2007
  4. Mike A
    Replies:
    17
    Views:
    286
    Dr John Stockton
    Nov 19, 2003
  5. PerlFAQ Server
    Replies:
    0
    Views:
    144
    PerlFAQ Server
    Apr 6, 2011
Loading...

Share This Page