how can I get the text file's encode?

J

jtl.zheng

I use the following codes to check what encode the file is

-------
FileReader in=new FileReader("E:/aa.txt");
System.out.println(in.getEncoding());
-------

but it always print "GBK" when I point to different files whick I know
they are ANSI , UNICODE and UTF
my system is XP, and I make these different encode files by windows's
note.exe

how can I get the file's encode in java?
and can I write a file with special encode I want?

Thank you very much in advance.
 
M

Matt Humphrey

jtl.zheng said:
I use the following codes to check what encode the file is

-------
FileReader in=new FileReader("E:/aa.txt");
System.out.println(in.getEncoding());
-------

but it always print "GBK" when I point to different files whick I know
they are ANSI , UNICODE and UTF
my system is XP, and I make these different encode files by windows's
note.exe

getEncoding gives you the encoding the reader is using, not what the file is
using. The problem is that there isn't anything in the file that says what
encoding it is using. You can look at some byte patterns to try to
determine whether it's UTF-8, UTF-16BE, or whatever but there's no perfect
rule for it.
how can I get the file's encode in java?
and can I write a file with special encode I want?

To output a file in a particular encoding, just create the PrintWriter (or
whatever kind of writer you're using) with the encoding you want to use,
like

PrintWriter pw = new PrintWriter (File, "char set")

All the reader / writer classes make reference to some kind of encoder for
translating bytes to characters. If you don't specify an encoding, it uses
one that's default for your platform which may or may not actually be what
you want.

I'm glossing over that character encodings are not the same as character
sets--this prior discussion may help
http://groups.google.com/group/comp...2098d45e3b1/c30ab8388402d3a3#c30ab8388402d3a3

Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
S

Soren Kuula

jtl.zheng said:
I use the following codes to check what encode the file is

-------
FileReader in=new FileReader("E:/aa.txt");
System.out.println(in.getEncoding());
-------

but it always print "GBK" when I point to different files whick I know
they are ANSI , UNICODE and UTF
my system is XP, and I make these different encode files by windows's
note.exe

FileReader (or any other Readers) do not have any way of detecting
encoding. There is no standard way they could know about it, either:
There is no general, standard way of associating encoding metadata with
files!

- XML has a way
- HTML have several
- Some other specific file formats have a specified way to do it.
how can I get the file's encode in java?

You have to make use of what you know about the format of the file.
Otherwise, you have to use a detection algorithm of some kind.

I think there is a nice chinesecomputing.com web site, with an algorithm.
and can I write a file with special encode I want?

Reading:

FileInputStream fis = new FileInputStream("myFile.txt");
String encodingName = "utf-8"; // or
//EncodingDetectorThingy.guess("myFile.txt");

Reader r = new InputStreamReader(fis, encodingName);

Writing: Same pattern, just output streams and writers instead.

With new FileReader(filename), the system default encoding is used, and
that's what you have seen (so you're a mainland Chinese I see....)

Defaults are evil. They cause more confusion than good.

If you design your own file formats, ALWAYS remember to leave a field
with the name of the encoding that is used for string data in the file.
Of course, that name must be encoded too, but all encoding names can be
encoded in ASCII.

Xiwang that helped

Søren
 
J

jtl.zheng

Thanks very much for all your help

As both of you say,there is not way to determine which encode the file
is used
but if the JVM don't know which encode is used exactly
how can it read the file accurately and turn it to unicode?
is it by guest?
 
M

Matt Humphrey

jtl.zheng said:
Thanks very much for all your help

As both of you say,there is not way to determine which encode the file
is used
but if the JVM don't know which encode is used exactly
how can it read the file accurately and turn it to unicode?

It doesn't. Fundamentally, all data sources are byte streams and that's
what gets read. When you (in your program) ask for a stream to be read as
(converted to) characters or strings, the Reader methods read the bytes and
use the encoding to convert the bytes to Unicode characters. If the byte
stream is not valid for the encoding you gave it, you'll get garbage out and
possibly an exception or some other failure. If you scan prior messages in
this newsgroup, you'll see the dozens of different errors you might get by
using the wrong encoding, including some very difficult ones.
is it by guest?

By guessing? No, the JVM just does what you tell it. You'll be the one who
guesses when you run a program where you don't know what kind of encoding to
expect. It's the same problem as trying to tell if a file contains text or
binary (answer--they're all binary; those that are also interpretable as
text are not necessarily the same as those that are intended to be processed
as text)

Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
R

RC

jtl.zheng said:
how can it read the file accurately and turn it to unicode?
is it by guest?

Guess, not guest. They are different meaning

If you know the file is Chinese characters, you can narrow your guess
to Big5, GBK, HZ, GB2312, EUC-TW, etc. You won't guess Arabic, Hebrew.
If you don't know the file whether is Russain, Greek, or Spanish, etc.
Then use brute force try one by one (good luck!)

You can get all the charsets from

java.nio.charset.Charset.availableCharsets()
 
J

jtl.zheng

but .as:
--------
FileReader in =new FileReader("tt.txt");
System.out.println((char)in.read())l
--------
I didn't tell the jvm what encode is
so it must detect it itself
but it detect it from what?

in physics a file is only contain the name of the file and the content
binary of it
there is no thing else to tag what encode it is
is it right?

Thank you
: )
 
T

Thomas Weidenfeller

jtl.zheng said:
I didn't tell the jvm what encode is
so it must detect it itself
but it detect it from what?

Once again: it does not detect it. You either tell it what to use, or it
just uses some default.

The default has been chosen by the people who implement the VM. They
might have hard-coded it, or they might snoop a little bit around,
checking out the host operating system and locale and use the system's
default. But still, no one detects the encoding of a text file.

/Thomas
 
Joined
Oct 8, 2008
Messages
1
Reaction score
0
Use this code:It works fine
import com.sun.syndication.io.XmlReader;

File fl=new File("D:\\Test.txt");
XmlReader xr =new XmlReader(fl);
System.out.println("File encoding test ==========="+xr.getEncoding());
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top