[java programming] How to detect the file encoding?

Discussion in 'Java' started by Simon, May 25, 2009.

  1. Simon

    Simon Guest

    Hi all,

    May i know is there any possible solutions to detect the encoding or
    character set (charset) of a file automatically? Second, how to
    convert a particular encoding to Unicode once the file encoding is
    detected?

    Thanks in advance.

    --

    regards,
    Simon
    Simon, May 25, 2009
    #1
    1. Advertising

  2. Simon

    Stefan Ram Guest

    "Peter Duniho" <> writes:
    >AFAIK, Unicode is the only commonly used encoding with a "signature" (the
    >byte-order marker, "BOM"). Detecting other encodings can be done
    >heuristically, but I'm not aware of any specific support within Java to do
    >so, and it wouldn't be 100% reliable anyway.


    The program could return a /set/ of possible encodings.
    Or a map: Mapping each encoding to its probability.
    Or the top encoding with its probability (reliability estimation).

    One could make byte-value frequency statistics of many files
    in some common encodings and compare them to the byte-value
    frequency of the source given. (Advanced: Frequencies of
    byte-pairs and so.)

    It would help for this purpose, if one can assume a certain
    natural language for the content.

    Or, one might study how other software is doing this. Such software
    can be found using Google, for example:

    »enca -- detect and convert encoding of text files«

    http://www.digipedia.pl/man/enca.1.html

    (Or, install and call this software from Java.)
    Stefan Ram, May 25, 2009
    #2
    1. Advertising

  3. Simon

    Stefan Ram Guest

    -berlin.de (Stefan Ram) writes:
    >One could make byte-value frequency statistics of many files
    >in some common encodings and compare them to the byte-value
    >frequency of the source given. (Advanced: Frequencies of
    >byte-pairs and so.)


    Of course, one can take advantage of the fact, that certain
    octet values and octet sequence values are absolutely forbidden
    in certain encodings so as to exclude those encodings.

    The program then might even detect better than a decoding
    declared sometimes. For example, some authors declare »ISO-8859-1«,
    but use »Windows-1252«.

    Another idea, would be to assume a /common/ encoding, such as
    UTF-8 (including US-ASCII), ISO-8859-1, or Windows-1252, first
    and detect a rare encoding only when there is strong evidence
    for it.

    It is easy to tell UTF-8 from ISO-8859-1 by the encoding of
    character values above 127 and to tell ISO-8859-1 from
    Windows-1252 by the presence of the Windows-1252 extension
    octet values.

    So, it will help, if the user can give an estimation of the
    encodings most common in his realm.
    Stefan Ram, May 25, 2009
    #3
  4. Simon

    Stefan Ram Guest

    Stefan Ram, May 25, 2009
    #4
  5. Simon wrote:
    > May i know is there any possible solutions to detect the encoding or
    > character set (charset) of a file automatically? Second, how to
    > convert a particular encoding to Unicode once the file encoding is
    > detected?


    The short answer: there's no easy way to detect charset automatically.

    The long answer:
    Typically, no filesystem stores metadata that one can associate with a
    file encoding. All of your ISO 8859-* codes differ only in what the
    codepoints in the x80 - xFF range look like, be it standard accented
    characters (like à), Greek characters (α), or some other language.
    Pragmatically differentiating between these single-byte encodings forces
    you to resort to either heuristics or getting help from the user (if you
    notice, all major browsers allow you to select a web page's encoding for
    this very reason).

    There is another class of encodings-- variable-length encodings like
    UTF-8 or Shift-JIS. One can sometimes rule out these encodings, if
    invalid sequences are produced. For example, 0xa4 0xf4 is invalid UTF-8,
    so it's probably an ISO 8859-* language instead.

    Context is also helpful. You may recall coming across documents that
    have unusual character pairings, like ۍ or something (if your
    newsreader sucks at i18n, you'll probably be seeing those in this
    message as well). That is pretty much a dead giveaway that the message
    is UTF-8 but someone is treating it as ISO 8859-1 (or it's very close
    sibling, Windows-1252). If you're seeing multiple high-byte characters
    in a row, it's more likely UTF-8 than it is ISO 8859-1, although some
    other languages may have these cases routinely (like Greek).

    The final way to guess at the encoding is to look at what the platform's
    default is. Western European-localized products will tend to be in
    either Cp1252 (which is pretty much ISO 8859-1) or UTF-8; Japanese are
    probably either Shift-JIS or UTF-8. I believe Java's conversion methods
    will default to platform encoding for you anyways, so that may be a
    safer bet for you. The other alternative is to just assume everyone uses
    the same charset and not think about it.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, May 25, 2009
    #5
  6. Simon

    Roedy Green Guest

    On Sun, 24 May 2009 20:35:02 -0700 (PDT), Simon
    <> wrote, quoted or indirectly quoted someone
    who said :

    >May i know is there any possible solutions to detect the encoding or
    >character set (charset) of a file automatically? Second, how to
    >convert a particular encoding to Unicode once the file encoding is
    >detected?


    I wrote a utility to manually assist the process. You could do it
    automatically if you know the vocabulary of the file. Search for byte
    patterns of encoded words.

    see http://mindprod.com/jgloss/encoding.html

    The fact you can't tell is so dirty coffee cups and pizza boxes on the
    floor. I can't imagine that happening if someone like Martha Stewart
    were in charge.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
    ~ Noam Chomsky
    Roedy Green, May 25, 2009
    #6
  7. Simon

    Roedy Green Guest

    On 25 May 2009 12:56:47 GMT, -berlin.de (Stefan Ram)
    wrote, quoted or indirectly quoted someone who said :

    >
    > Of course, one can take advantage of the fact, that certain
    > octet values and octet sequence values are absolutely forbidden
    > in certain encodings so as to exclude those encodings.


    The biggest clue is the country source of the file. Check the
    national encodings first.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
    ~ Noam Chomsky
    Roedy Green, May 26, 2009
    #7
  8. Simon

    Roedy Green Guest

    On Sat, 30 May 2009 20:24:58 -0400, Wayne <nospam@all.4me.invalid>
    wrote, quoted or indirectly quoted someone who said :

    >I've often thought an elegant solution would be to define more than
    >one BOM (byte order mark) in Unicode. They could allocate enough
    >BOMs to have a different one for each encoding.


    There are hundreds of encodings. You could add it now with:

    BOM BOM name-of-encoding BOM.

    That way you don't have to reserve any new characters.

    While we are at it, we should encode the MIME type and create an
    extensible scheme to add other meta-information.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "Everybody’s worried about stopping terrorism. Well, there’s a really easy way: stop participating in it."
    ~ Noam Chomsky
    Roedy Green, May 31, 2009
    #8
  9. Simon

    Stefan Ram Guest

    Roedy Green <> writes:
    >There are hundreds of encodings. You could add it now with:
    >BOM BOM name-of-encoding BOM.


    It is called »XML«:

    <?xml encoding="name-of-encoding" ?><text><![CDATA[...]]></text>
    ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
    Stefan Ram, May 31, 2009
    #9
  10. Simon

    Roedy Green Guest

    On Sat, 06 Jun 2009 23:23:04 -0400, Wayne <nospam@all.4me.invalid>
    wrote, quoted or indirectly quoted someone who said :

    >FE FF UTF-16BE BOM
    >FF FE UTF-16LE BOM
    >EF BB BF UTF-8 BOM
    >
    >So there is already defined multiple BOMs, including one
    >for UTF-8. (I knew it was a good idea! :)


    I suppose we could try to get rid of all the old 8-bit encodings and
    use Unicode/UTF rather than try to patch all those text files out
    there with some scheme to mark the encoding.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Never discourage anyone... who continually makes progress, no matter how slow.
    ~ Plato 428 BC died: 348 BC at age: 80
    Roedy Green, Jun 7, 2009
    #10
  11. Simon

    Mayeul Guest

    Wayne wrote:
    > Stefan Ram wrote:
    >> Roedy Green <> writes:
    >>> There are hundreds of encodings. You could add it now with:
    >>> BOM BOM name-of-encoding BOM.

    >> It is called »XML«:
    >>
    >> <?xml encoding="name-of-encoding" ?><text><![CDATA[...]]></text>
    >> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
    >>

    >
    > Right, but there should be a simple way to deal with plain text
    > files too.
    >
    > Turns out there is one! I've been reading the HTML5 draft spec
    > and cam across this:
    >
    > 2.7.3 Content-Type sniffing: text or binary
    >
    > 1. The user agent may wait for 512 or more bytes of the resource
    > to be available.
    > 2. Let n be the smaller of either 512 or the number of bytes
    > already available.
    > 3. If n is 4 or more, and the first bytes of the resource match
    > one of the following byte sets:
    >
    > Bytes in
    > Hexadecimal Description
    > FE FF UTF-16BE BOM
    > FF FE UTF-16LE BOM
    > EF BB BF UTF-8 BOM
    >
    > So there is already defined multiple BOMs, including one
    > for UTF-8. (I knew it was a good idea! :)


    I wouldn't say that "multiple" BOMs are already defined. The idea of the
    BOM is to insert a zero-width no-break space character, whose code point
    is U+FEFF, at the start of the file.

    Since this character will be encoded differently by different encodings,
    it enables to distinguish between UTF-16BE, UTF-16LE, UTF-8 and other
    Unicode encodings.
    It is also a somewhat acceptable way to indicate a file is UTF-8 rather
    than latin-1 or something, since it seems unlikely that a plain text
    file would start with the characters that the BOM's binary represents in
    non-Unicode encodings.

    Bottomline, the BOM is a zero-width no-break space. It is unique, there
    are no multiple BOMs.

    Or if there are that I don't know of, that would be another norm the
    given table wouldn't conform with.

    --
    Mayeul
    Mayeul, Jun 9, 2009
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sunil
    Replies:
    0
    Views:
    600
    sunil
    Jul 28, 2004
  2. HK
    Replies:
    7
    Views:
    8,586
    John C. Bollinger
    Jun 7, 2005
  3. Rebhan, Gilbert

    Detect file encoding utf-8

    Rebhan, Gilbert, Aug 29, 2007, in forum: Ruby
    Replies:
    3
    Views:
    320
    Gilbert Rebhan
    Aug 29, 2007
  4. Replies:
    22
    Views:
    1,462
    Ilya Zakharevich
    May 22, 2006
  5. Replies:
    2
    Views:
    366
Loading...

Share This Page