Read utf-8 char one by one

M

moonhkt

Hi All

how to read utf-8 char one by one ?

Below not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_utf_char {
public static void main(String[] args) {
File aFile = new File("utf8_test.text");
try {
String str = "";
char[] ch = new char[];
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile), "UTF8"));
while ( in.read(ch) != -1 )
{
System.out.print(ch);
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}
 
M

Mayeul

moonhkt said:
Hi All

how to read utf-8 char one by one ?

Below not work.

As far as I know, it works if your utf-8 stream contains only BMP
characters (characters with code point 0xFFFF or below.)

But it is indeed incorrect in the general case where you can't assume
characters are all in the BMP. This is a known Java limitation.

In the general case, you just don't read unicode characters one by one
from a stream. Either you convert the stream to String first (and then
use a clever combination of String.codePointAt() and
Character.charCount(), read the JavaDoc.)
Either you read looking for your delimiters, but storing whatever is
*not* your delimiter, in a char buffer, untouched. You do not write it
directly. For instance, BufferedReader implements reading line by line.
I suppose other implementations enable to read using a different delimiter.
 
L

Lothar Kimmeringer

moonhkt said:
Below not work.
[...]

char[] ch = new char[];

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
M

moonhkt

moonhkt said:
Below not work.
[...]

    char[] ch = new char[];

Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer                E-Mail: (e-mail address removed)
               PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
                 questions!

Thank. I get below Example. But I can not get the UTF-8 char code.

class CodePointAtstring
{
public static void main(String[] args)
{
// Declaration of String
String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
//Displays the Actual String declared above
System.out.println("GIVEN STRING IS="+a);
// Returns the character (Unicode code point) at the specified
index.
System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
}
}

Output
java CodePointAtstring
GIVEN STRING IS=³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111
 
R

RedGrittyBrick

moonhkt said:
moonhkt said:
Below not work. [...]

char[] ch = new char[];
Because it doesn't compile.

What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".

Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Thank. I get below Example. But I can not get the UTF-8 char code.

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

class CodePointAtstring
{
public static void main(String[] args)
{
// Declaration of String
String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
//Displays the Actual String declared above
System.out.println("GIVEN STRING IS="+a);
// Returns the character (Unicode code point) at the specified
index.
System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
}
}

Output
java CodePointAtstring
GIVEN STRING IS=³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111


That seems completely reasonable to me because 252 = 0x00fc and 13527 =
0x34d7.

Nothing in your program has anything to do with UTF-8 encoding.
 
M

moonhkt

Hi All
I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.


moonhkt said:
moonhkt wrote:
Below not work.
[...]
    char[] ch = new char[];
Because it doesn't compile.
What exactly doesn't work. Do you get a wrong output, do you
get an exception (you ignore in the source you provided). A
bit more information would really help to be able to answer
more than "something will be wrong in your code".
Regards, Lothar
--
Lothar Kimmeringer                E-Mail: (e-mail address removed)
               PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)
Always remember: The answer is forty-two, there can only be wrong
                 questions!
Thank. I get below Example. But I can not get the UTF-8 char code.

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"






class CodePointAtstring
{
  public static void main(String[] args)
  {
    // Declaration of String
    String a="\u00fc" + "\u34d7"+ "Welcome to Rose india";
    //Displays the Actual String declared above
    System.out.println("GIVEN STRING IS="+a);
    //  Returns the character (Unicode code point) at the specified
index.
   System.out.println("Unicode code point at position 0 IN THE STRING
IS="+a.codePointAt(0));
   System.out.println("Unicode code point at position 1 IN THE STRING
IS="+a.codePointAt(1));
    System.out.println("Unicode code point at position 2 IN THE STRING
IS="+a.codePointAt(2));
   System.out.println("Unicode code point at position 3 IN THE STRING
IS="+a.codePointAt(3));
   System.out.println("Unicode code point at position 6 IN THE STRING
IS="+a.codePointAt(6));
  }
}
Output
java CodePointAtstring
GIVEN STRING IS=³?Welcome to Rose india
Unicode code point at position 0 IN THE STRING IS=252
Unicode code point at position 1 IN THE STRING IS=13527
Unicode code point at position 2 IN THE STRING IS=87
Unicode code point at position 3 IN THE STRING IS=101
Unicode code point at position 6 IN THE STRING IS=111

That seems completely reasonable to me because 252 = 0x00fc and 13527 =
0x34d7.

Nothing in your program has anything to do with UTF-8 encoding.

--
RGB- éš±è—被引用文字 -

- 顯示被引用文字 -- éš±è—被引用文字 -

- 顯示被引用文字 -
 
L

Lew

Please, do not top-post.
I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

'codePointAt()' doesn't print anything. How are you actually printing it?

'codePointAt()' returns an int, not a character.
<http://java.sun.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)>

Most methods that output an int show the int value, not the equivalent
character. If you want to display an int as a character, you have to use a
method that will do that. I don't know offhand of a method in the standard
API that does that, but perusal of the Javadocs might reveal one, otherwise
you'll have to code one yourself or find a third-party library that already
has such.
 
R

Roedy Green

What do you mean by "UTF-8 char code"? Strictly speaking there is no
such thing. You might mean "Unicode code-point" or "sequence of octets
in UTF8-encoding"

The point of an encoding is it hides the details of how 16-chars are
inserted into an 8-bit stream. All you are interested in the 16-bit
Java char value or perhaps the java codepoint value if you have 32-bit
chars embedded as well.
 
R

RedGrittyBrick

moonhkt said:
RedGrittyBrick said:
moonhkt said:
Lothar Kimmeringer wrote:
moonhkt wrote:

Below not work.

[...]
Because it doesn't compile. What exactly doesn't work. Do you
get a wrong output, do you get an exception (you ignore in the
source you provided). A bit more information would really help
to be able to answer more than "something will be wrong in your
code". Regards,

Thank. I get below Example. But I can not get the UTF-8 char
code.

What do you mean by "UTF-8 char code"? Strictly speaking there is
no such thing. You might mean "Unicode code-point" or "sequence of
octets in UTF8-encoding"

[...]

Nothing in your program has anything to do with UTF-8 encoding.
Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

// \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
// \u34d7 is a character in CJK Unified Ideographs Extension A.
// \uD834\uDD1E" are the surrogate pair for character U+1D11E.
// U+1D11E is MUSICAL SYMBOL G CLEF;
String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n",
a.codePointCount(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
}
}
}
-------------------------------------8<-----------------------------------
GIVEN STRING IS=ü㓗Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is ü
Character[1] is ã“—
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .
 
M

moonhkt

Yes. This is my want.
But my output is not same with you. You are correct.

Run in Jcreator 4.5 version
--------------------Configuration: <Default>--------------------
GIVEN STRING IS=羹?elcome to Rose India ??.
Length of string is 27
CodePoints in string is 26
Character[0] is ç¾¹
Character[1] is ??
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

Process completed.



RedGrittyBrick said:
moonhkt wrote:
Lothar Kimmeringer wrote:
moonhkt wrote:
Below not work.
[...]
Because it doesn't compile. What exactly doesn't work. Do you
get a wrong output, do you get an exception (you ignore in the
source you provided). A bit more information would really help
to be able to answer more than "something will be wrong in your
code". Regards,
Thank. I get below Example. But I can not get the UTF-8 char
code.
What do you mean by "UTF-8 char code"? Strictly speaking there is
no such thing. You might mean "Unicode code-point" or "sequence of
octets in UTF8-encoding"
[...]
Nothing in your program has anything to do with UTF-8 encoding.
Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
   public static void main(String[] args)
       throws UnsupportedEncodingException {

     // I want console output in UTF-8
     PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

     // \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
     // \u34d7 is a character in CJK Unified Ideographs Extension A.
     // \uD834\uDD1E" are the surrogate pair for character U+1D11E.
     // U+1D11E is MUSICAL SYMBOL G CLEF;
     String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

     int n = a.length();
     sysout.println("GIVEN STRING IS=" + a);
     sysout.printf("Length of string is %d%n", n);
     sysout.printf("CodePoints in string is %d%n",
         a.codePointCount(0,n));
     for (int i = 0; i < n; i++) {
       sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
     }
   }}

-------------------------------------8<-----------------------------------
GIVEN STRING IS=ü㓗Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is ü
Character[1] is ã“—
Character[2] is W
Character[3] is e
Character[4] is l
Character[5] is c
Character[6] is o
Character[7] is m
Character[8] is e
Character[9] is
Character[10] is t
Character[11] is o
Character[12] is
Character[13] is R
Character[14] is o
Character[15] is s
Character[16] is e
Character[17] is
Character[18] is I
Character[19] is n
Character[20] is d
Character[21] is i
Character[22] is a
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .
 
R

RedGrittyBrick

PLEASE DON'T TOP-POST, PLEASE PUT YOUR REPLY AT THE BOTTOM, BELOW ANY
QUOTED TEXT. THANKS!
RedGrittyBrick said:
moonhkt said:
Hi All I want output the Character in the string one by one.
Now,codePointAt just print the Code points value.

Why not use String's length() and CharAt() methods?

I assume you can disregard characters outside Unicode's Base
Multilingual Plane (BMP) - if not, I think you'll have to check for
surrogate pairs. Characters outside the BMP are too big for a char.

-------------------------------------8<-----------------------------------
public class UnicodeChars {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");

// \u00fc is LATIN SMALL LETTER U WITH DIAERESIS;
// \u34d7 is a character in CJK Unified Ideographs Extension A.
// \uD834\uDD1E" are the surrogate pair for character U+1D11E.
// U+1D11E is MUSICAL SYMBOL G CLEF;
String a = "\u00fc\u34d7Welcome to Rose India \uD834\uDD1E.";

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n",
a.codePointCount(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
}
}}

-------------------------------------8<-----------------------------------
GIVEN STRING IS=ü㓗Welcome to Rose India ð„ž.
Length of string is 27
CodePoints in string is 26
Character[0] is ü
Character[1] is ã“—
Character[2] is W
Character[3] is e [...]
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .
Yes. This is my want.
But my output is not same with you. You are correct.

Run in Jcreator 4.5 version

I am using Eclipse. To display UTF-8 encoded Unicode characters written
to the console, I had to configure Eclipse. Perhaps you need to
configure JCreator so that you can display Unicode characters?
GIVEN STRING IS=羹?elcome to Rose India ??.
Length of string is 27
CodePoints in string is 26
Character[0] is ç¾¹
Character[1] is ??
Character[2] is W
Character[3] is e [...]
Character[23] is
Character[24] is ?
Character[25] is ?
Character[26] is .

You used Google Groups to post. It seems Google Groups uses
quoted-printable to encode non-ASCII characters.
E.g. =3D=E7=BE=B9?=EE=A2=ADelcome ...
I find it hard to fathom how that sequence of octets was derived.
AFAIK \u00fc\uc3c should encode to octets c3 bc e3 93 97.
Perhaps Google Groups is hampering communications - As you seem to be a
user of Mozilla Firebird, have you tried using Mozilla Thunderbird to
read this newsgroup directly from your ISPs NNTP service?

I suspect your remaining problems are due to the configuration of
JCreator or your operating system.
 
L

Lew

PLEASE DON'T TOP-POST, PLEASE PUT YOUR REPLY AT THE BOTTOM, BELOW ANY
QUOTED TEXT. THANKS!

Actually, it's better to post inline, with comments interspersed with
quoted material.
 
L

Lothar Kimmeringer

moonhkt said:
I want output the Character in the string one by one.

If you mean by "output" printing it out on the console,
you have to make sure that the console is actually capable
of printing unicode-characters.

The ? on the second position indicates that it isn't, so
there is no way to print it out that way. The way the first
character is given out the console most likey runs with
CP850 commonly used with DOS-boxes in Europe.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,871
Messages
2,569,919
Members
46,172
Latest member
JamisonPat

Latest Threads

Top