determining character encoding format of a file

Discussion in 'Java' started by Alan, Oct 6, 2007.

  1. Alan

    Alan Guest

    Is there any easy way to determine what character encoding format
    (e.g., UTF-8) a text file uses?

    Thanks, Alan
    Alan, Oct 6, 2007
    #1
    1. Advertising

  2. Alan wrote:
    > Is there any easy way to determine what character encoding format
    > (e.g., UTF-8) a text file uses?
    >


    Easy? Not in general.

    <http://developers.sun.com/global/technology/standards/reference/faqs/determining-file-encoding.html>
    <http://codesnipers.com/?q=node/68>
    RedGrittyBrick, Oct 7, 2007
    #2
    1. Advertising

  3. Alan wrote:
    > Is there any easy way to determine what character encoding format
    > (e.g., UTF-8) a text file uses?


    Not in general.

    For ISO-8859-1 versus UTF-8 for a western language you may make
    a qualified guess.

    See attached code as a stating point (note that the
    code is designed to identify text in danish).

    Arne

    =============================

    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;

    public class CharSetGuesser {
    public static String guess(String filename) throws IOException {
    int[] freq = new int[256];
    InputStream is = new FileInputStream(filename);
    int c;
    while((c = is.read()) >= 0) {
    freq[c]++;
    }
    is.close();
    if((freq[197] + freq[198] + freq[200] +
    freq[201] + freq[203] + freq[216] +
    freq[229] + freq[230] + freq[232] +
    freq[233] + freq[235] + freq[248]) >
    (freq[133] + freq[134] + freq[136] +
    freq[137] + freq[139] + freq[152] +
    freq[165] + freq[166] + freq[168] +
    freq[169] + freq[171] + freq[184] +
    freq[195])) {
    return "ISO-8859-1";
    } else {
    return "UTF-8";
    }
    }
    public static void main(String[] args) throws Exception {
    System.out.println(guess("C:\\iso-8859-1.txt"));
    System.out.println(guess("C:\\utf-8.txt"));
    }
    }
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=, Oct 7, 2007
    #3
  4. Alan

    Alan Guest

    Thank you. Actually, my interest is in Arabic.
    Alan, Oct 7, 2007
    #4
  5. Alan wrote:
    > Actually, my interest is in Arabic.


    :)

    Try take a relevant text and store it in the relevant encodings
    and then do some statistics on bytes and see if there are
    some simple rules that can identify the encoding.

    Arne
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=, Oct 7, 2007
    #5
  6. Alan wrote:
    > Is there any easy way to determine what character encoding format
    > (e.g., UTF-8) a text file uses?


    Some UTF-8 files (esp.n Microsoft OSs) start with the Byte Order Mark (BOM),
    which is the unicode character U+FEFF, encoded in UTF-8. Other than that,
    no.
    Mike Schilling, Oct 7, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jon Maz
    Replies:
    1
    Views:
    399
    Joerg Jooss
    Jan 21, 2005
  2. raavi
    Replies:
    2
    Views:
    908
    raavi
    Mar 2, 2006
  3. Rajorshi
    Replies:
    4
    Views:
    19,162
    Rajorshi
    Mar 2, 2004
  4. Tony Houghton

    Determining encoding of a file

    Tony Houghton, Feb 3, 2007, in forum: Python
    Replies:
    3
    Views:
    327
    Tony Houghton
    Feb 4, 2007
  5. Ken Starks
    Replies:
    4
    Views:
    342
    Ken Starks
    Jun 23, 2008
Loading...

Share This Page