determining character encoding format of a file

A

Alan

Is there any easy way to determine what character encoding format
(e.g., UTF-8) a text file uses?

Thanks, Alan
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Alan said:
Is there any easy way to determine what character encoding format
(e.g., UTF-8) a text file uses?

Not in general.

For ISO-8859-1 versus UTF-8 for a western language you may make
a qualified guess.

See attached code as a stating point (note that the
code is designed to identify text in danish).

Arne

=============================

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class CharSetGuesser {
public static String guess(String filename) throws IOException {
int[] freq = new int[256];
InputStream is = new FileInputStream(filename);
int c;
while((c = is.read()) >= 0) {
freq[c]++;
}
is.close();
if((freq[197] + freq[198] + freq[200] +
freq[201] + freq[203] + freq[216] +
freq[229] + freq[230] + freq[232] +
freq[233] + freq[235] + freq[248]) >
(freq[133] + freq[134] + freq[136] +
freq[137] + freq[139] + freq[152] +
freq[165] + freq[166] + freq[168] +
freq[169] + freq[171] + freq[184] +
freq[195])) {
return "ISO-8859-1";
} else {
return "UTF-8";
}
}
public static void main(String[] args) throws Exception {
System.out.println(guess("C:\\iso-8859-1.txt"));
System.out.println(guess("C:\\utf-8.txt"));
}
}
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Alan said:
Actually, my interest is in Arabic.

:)

Try take a relevant text and store it in the relevant encodings
and then do some statistics on bytes and see if there are
some simple rules that can identify the encoding.

Arne
 
M

Mike Schilling

Alan said:
Is there any easy way to determine what character encoding format
(e.g., UTF-8) a text file uses?

Some UTF-8 files (esp.n Microsoft OSs) start with the Byte Order Mark (BOM),
which is the unicode character U+FEFF, encoded in UTF-8. Other than that,
no.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top