Query:different coding systems

J

Jack Dowson

Hello Everybody:
As we all know,FileReader and FileWriter are both character stream
classes.When I use FileReader to read a text file which combines letters
and Chinese Characters coding in ANSI's ascii.I know that each letter
holds one byte disk space to store while every Chinese Characters
occupies two.When that file has been read,it prints on the monitor
screen totally corresponds with it's content!
Now,here is my question:How does JVM identify one byte letter and two
byte Chinese Character?
Here is my program demo:
import java.io.*;
class FileReaderDemo{
public static void main(String[] args) throws Exception{
FileReader fr = new FileReader("text.txt");
int ch =0;
int words = 0;
while((ch =fr.read())!= -1){
System.out.print((char)ch);
words++;
}
fr.close();
System.out.println("\nThere are totally " + words + " characters in
this file!");
}

And the text.txt is:
This is a test file!
ÕâÊÇÒ»¸ö²âÊÔÎļþ£¡

The outcome is:
This is a test file!
ÕâÊÇÒ»¸ö²âÊÔÎļþ£¡
There are totally 31 characters in this file!


Thanks!
Dowson.
 
T

Thomas Fritsch

Jack said:
Hello Everybody:
As we all know,FileReader and FileWriter are both character stream
classes. Yes!

When I use FileReader to read a text file which combines letters
and Chinese Characters coding in ANSI's ascii.
No, you don't. Chinese simply cannot be coded in ASCII. May be your text
file is encoded in UTF-8 (see below).
I know that each letter
holds one byte disk space to store while every Chinese Characters
occupies two.When that file has been read,it prints on the monitor
screen totally corresponds with it's content!
There is already a misconception on your side:
(1) Correct is that ASCII requires one byte per character, because
ASCII can only encode the characters from 0x0000 to 0x007F, (into
bytes 0x00 .. 0x7F), nothing more.
(2) ASCII simply cannot encode the Chinese chars (0x4E00 .. 0xA000).
The key is to understand that there is a difference between *byte*
streams (InputStream, OutputStream) and *char* streams (Reader, Writer).
A byte is in range 0x00..0xFF, a char is in range 0x0000..0xFFFF.
Files are always sequences of bytes, but in your Java code you want to
deal with chars. Therefore Java has to do a translation between byte
streams and char streams, which is called "encoding" or "decoding".

Unfortunately there are many different encoding algorithms. "ASCII" is
just of them, others are "ISO-8859-1", "UTF-16", "UTF-8" and many more.
Some encodings ("UTF-8", "UTF-16") are able to encode all possible 65536
chars into bytes. Some others can encode only a subset of chars into
bytes (ASCII: only chars from 0x0000 to 0x007F, ISO-8859-1: only chars
from 0x0000 to 0x00FF). "UTF-16" always encodes 1 char into 2 bytes.
"UTF-8" encodes 1 char into 1, 2 or 3 bytes (depending on the char).

You find more info and more links at
Now,here is my question:How does JVM identify one byte letter and two
byte Chinese Character?
*You* tell it which encoding algorithm will be used. For example you can
write:
FileReader fr = new FileReader("text.txt", "UTF-8");
When you write:
FileReader fr = new FileReader("text.txt");
that actually means
FileReader fr = new FileReader("text.txt",
System.getProperty("file.encoding"));
If you choose the wrong encoding (for example: if you choose "UTF-16",
but your input file is actually encoded with "UTF-8"), then your program
simply will do wrong.
Here is my program demo:
import java.io.*;
class FileReaderDemo{
public static void main(String[] args) throws Exception{
FileReader fr = new FileReader("text.txt");
int ch =0;
int words = 0;
while((ch =fr.read())!= -1){
System.out.print((char)ch);
words++;
}
fr.close();
System.out.println("\nThere are totally " + words + " characters in
this file!");
}

And the text.txt is:
This is a test file!
ÕâÊÇÒ»¸ö²âÊÔÎļþ£¡

The outcome is:
This is a test file!
ÕâÊÇÒ»¸ö²âÊÔÎļþ£¡
There are totally 31 characters in this file!
No, files always contain *bytes*, not *chars*.
Chars only occur within your Java program.
 
T

Thomas Fritsch

Thomas said:
Jack Dowson wrote:
[...]
Now,here is my question:How does JVM identify one byte letter and two
byte Chinese Character?
*You* tell it which encoding algorithm will be used. For example you can
write:
FileReader fr = new FileReader("text.txt", "UTF-8");
When you write:
FileReader fr = new FileReader("text.txt");
that actually means
FileReader fr = new FileReader("text.txt",
System.getProperty("file.encoding"));
Sorry, the above was wrong.

There is no constructor FileReader(String fileName, String encoding).
Hence there is no way to explicitly specify an encoding with FileReader.
When you write:
new FileReader("text.txt");
that essentially means
new InputStreamReader(new FileInputStream("test.txt"))
which in turn means
new InputStreamReader(new FileInputStream("test.txt"),
System.getProperty("file.encoding"))

Therefore I would strongly recommend *not* to use FileReader at all.
Instead use for example:
new InputStreamReader(new FileInputStream("test.txt"),
"UTF-8")
so that the encoding you get is really the encoding you want.
 
G

Greg R. Broderick

Jack Dowson said:
Hello Everybody:
As we all know,FileReader and FileWriter are both character stream
classes.When I use FileReader to read a text file which combines letters
and Chinese Characters coding in ANSI's ascii.

Chinese characters can not be coded in ASCII.

Some links to get you started in the wonderful world of international
character sets:

http://czyborra.com/
http://www.i18nguy.com/unicode/codepages.html
http://www.unicode.org/
http://www.faqs.org/rfcs/rfc2044.html

Cheers
GRB


--
---------------------------------------------------------------------
Greg R. Broderick (e-mail address removed)

A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top