Read utf-8 file return utf-16 coding hex string ?

M

moonhkt

Hi All
Why using utf-8, the hex value return 51cc and 6668 ?

od -cx utf8_file01.text

22e5 878c e699 a822 with " befor and after

http://www.fileformat.info/info/unicode/char/51cc/index.htm
http://www.fileformat.info/info/unicode/char/6668/index.htm

Output
......
101 ? 20940 HEX=51cc BIN=101000111001100
102 ? 26216 HEX=6668 BIN=110011001101000

Java program

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String.*;
import java.lang.Integer.*;
public class read_utf_line {
public static void main(String[] args) {
File aFile = new File("utf8_file01.text");
try {
System.out.println(aFile);
String str = "";
String hexstr = "";
String bystr = "";
int stlen= 0;
Integer val=0;
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile), "UTF8"));

while (( str = in.readLine()) != null )
{ stlen = str.length();
System.out.println(str.length());
for (int i = 0;i < stlen;++i) {
val = str.codePointAt(i);
hexstr = Integer.toHexString(val);
bystr = Integer.toBinaryString(val);

System.out.println(i + " " + str.substring(i,i+1)
+ " " + str.codePointAt(i)
+ " HEX=" + hexstr
+ " BIN=" + bystr
);
}
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

}
}
 
M

moonhkt

I don't understand the above.  Are you trying to suggest that the text
'with " befor and after' is part of the output of the "od" program?  If
so, why does it not appear to match up with the binary values written
out?  And if the characters you're concerned with are at index 101 and
102, why only eight bytes in the file?  And if the file is UTF-8, why
are you dumping its contents as shorts?  Why not just bytes?

Frankly, the whole question doesn't make much sense to me.  That said,
the basic answer to your question is, I believe: UTF-8 and UTF-16 are
different, so of course the bytes used to represent a character in a
UTF-8 file are going to look different from the bytes used to represent
the same character in a UTF-16 data structure.

Pete

System : AIX 5.3

Text file just have two utf-8 chinease character.
cat out_utf.text
凌晨

od -cx out_utf.text
0000000 207 214 231 \n
e587 8ce6 99a8 0a00
0000007

java to build utf-8 data, input using utf-16 value. I does not know
how to input utf-8 hex value.
My Question is input utf-16 hex value, when write to file with UTF8
codepage, the data will encode to UTF-8 ?
Do you know hwo to input hex value of utf-8 ? I tried \0xe5 not works.


import java.io.*;
public class build_utf01 {
public static void main(String[] args)
throws UnsupportedEncodingException {

// I want console output in UTF-8
PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
try {
File oFile = new File("out_utf.text");
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(oFile),"UTF8"));

/* http://www.fileformat.info/info/unicode/char/51cc/index.htm
UTF-8 (hex) 0xe5 0x87 0x8c (e5878c)
UTF-16 (hex) 0x51CC (51cc)
http://www.fileformat.info/info/unicode/char/6668/index.htm
UTF-16 (hex) U+6668
UTF-8 (hex) 0xe6 0x99 0xa8 (e699a8)
*/
String a = "\u51cc\u6668" ;

int n = a.length();
sysout.println("GIVEN STRING IS=" + a);
sysout.printf("Length of string is %d%n", n);
sysout.printf("CodePoints in string is %d%n", a.codePointCount
(0,n));
for (int i = 0; i < n; i++) {
sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
out.write(a.charAt(i));
}
out.newLine();
out.close() ;
} catch (IOException e) {
}
}

}


Output utf-8 enabled terminal
java build_utf01
GIVEN STRING IS=凌晨
Length of string is 2
CodePoints in string is 2
Character[0] is 凌
Character[1] is 晨
 
J

John B. Matthews

[...]
My Question is input utf-16 hex value, when write to file with UTF8
codepage, the data will encode to UTF-8 ?

When I run your program, I get this file content:

$ hd out_utf.text
000000: e5 87 8c e6 99 a8 0a ?..?.?.
Do you know hwo to input hex value of utf-8?

Do you mean like this?

String a = "\u51cc\u6668";
String b = new String(new byte[] {
(byte) 0xe5, (byte) 0x87, (byte) 0x8c,
(byte) 0xe6, (byte) 0x99, (byte) 0xa8
});
System.out.println("a.equals(b) is " + a.equals(b));

This prints "a.equals(b) is true".

For reference: $ cat ~/bin/hd
#!/usr/bin/hexdump -f
"%06.6_ax: " 16/1 "%02x " " "
16/1 "%_p" "\n"
 
R

RedGrittyBrick

moonhkt said:
Hi All
Why using utf-8, the hex value return 51cc and 6668 ?

Because those are the Unicode codepoints of the characters in the file.
od -cx utf8_file01.text

These are the byte values of the UTF8 encoding of the characters.
22e5 878c e699 a822 with " befor and after

^^ ^^^^
e5 87 8c = U+51CC

^^^^ ^^
e6 99 a8 = U+6668


As shown here:
http://www.fileformat.info/info/unicode/char/51cc/index.htm
http://www.fileformat.info/info/unicode/char/6668/index.htm



Output
.....
101 ? 20940 HEX=51cc BIN=101000111001100
102 ? 26216 HEX=6668 BIN=110011001101000

^^^^ Unicode *CodePoint*
System.out.println(i + " " + str.substring(i,i+1)
+ " " + str.codePointAt(i)

^^^^^^^^^^^ you retrieve a *CodePoint*
 
M

moonhkt

Because those are the Unicode codepoints of the characters in the file.




These are the byte values of the UTF8 encoding of the characters.


    ^^     ^^^^
    e5    87 8c   = U+51CC

                   ^^^^    ^^
                   e6 99   a8  = U+6668

As shown here:



                   ^^^^ Unicode *CodePoint*


                              ^^^^^^^^^^^ you retrieve a *CodePoint*

But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
87 8c.
What coding can handle this ?
 
M

markspace

moonhkt said:
But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
87 8c.
What coding can handle this ?


Oh, I see.

Try this:


package test;
import java.io.UnsupportedEncodingException;

public class UtfOut {
public static void main( String[] args )
throws UnsupportedEncodingException
{
String a = "\u51cc\u6668";

byte [] buf = a.getBytes( "UTF-8" );

for( byte b : buf ) {
System.out.printf( "%02X ", b );
}
System.out.println( );

}
}


You could also use a ByteArrayOutputStream.
 
M

moonhkt

UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
into 16 bit and 32 bit code sequences.

To see how the algorithm works seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/codepoint.html

Hi All
Thank for documents for UTF-8. Actually, My company want using
ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
UTF-8 character.

The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.
Any suggestion ?
 
R

RedGrittyBrick

moonhkt said:
Actually, My company want using
ISO8859-1 database to store UTF-8 data.

Your company should use a Unicode database to store Unicode data. The
Progress DBMS supports Unicode.
Currently, our EDI just handle
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file.

This seems crazy to me. The DBMS functions for working with CHAR
datatypes will do bad things if your have misled the DBMS into treating
UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
to fit 10 chars in a CHAR(10) field for example.
The Export file are mix ISO8859-1 chars and UTF-8 character.

Sorry to be so negative, but this seems a recipe for disaster.

The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

I think you'll have a false sense of optimism and discover bad effects
later.

Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

A 1998 vintage document suggests the Progress DBMS can support Unicode.
http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
in that presentation that I find troubling.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.

Unicode contains combining characters, not all sequences of Unicode
characters are valid.

Any suggestion ?

Reconsider :)
 
L

Lew

-moonhkt wrote:.
Thank for documents for UTF-8. Actually, My company want using
ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle

That statement doesn't make sense. What makes sense would be, "My company
wants to store characters with an ISO8859-1 encoding". There is not any such
thing, really, as "UTF-8 data". What there is is character data. Others
upthread have explained this; you might wish to review what people told you
about how data in a Java 'String' is always UTF-16. You read it into the
'String' using an encoding argument to the 'Reader' to understand the encoding
of the source, and you write it to the destination using whatever encoding in
the 'Writer' that you need.
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI

The term "UTF-8 data" has no meaning.
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
UTF-8 character.

You simply map the 'String' data to the database column using JDBC. The
connection and JDBC driver handle the encoding, AIUI.
The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

There are an *awful* lot of UTF-encoded characters, over 107,000. Most are
not encodable with ISO-8859-1, which only handles 256 characters.
Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.
Any suggestion ?

That'll be a rather large file.

Why don't you Google for character encoding and what different encodings can
handle?

Also:
<http://en.wikipedia.org/wiki/Unicode>
<http://en.wikipedia.org/wiki/ISO-8859-1>
 
M

moonhkt

Your company should use a Unicode database to store Unicode data. The
Progress DBMS supports Unicode.


This seems crazy to me. The DBMS functions for working with CHAR
datatypes will do bad things if your have misled the DBMS into treating
UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
to fit 10 chars in a CHAR(10) field for example.


Sorry to be so negative, but this seems a recipe for disaster.


I think you'll have a false sense of optimism and discover bad effects
later.


A 1998 vintage document suggests the Progress DBMS can support Unicode.http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
in that presentation that I find troubling.


Unicode contains combining characters, not all sequences of Unicode
characters are valid.


Reconsider :)

Thank for you reminder. But Our database already have Chinese/Japanese/
Korean code data on it.
Those data update by lookup program, e.g. When input PEN will get
Chinese GB2312 or BIG5 code.
We already ask Progress TS for this case, they also suggest using
UTF-8 Database.

But, we can not move to UTF-8 Database. We just some fields have this
case, those fields will not using substring,upcase or other string
operation to update those fields. Upto now, those CJK value without
any problem for over 10+ year.


For Unicode contains combining characters, is one of consideration.
 
M

moonhkt

Thank for you reminder. But Our database already have Chinese/Japanese/
Korean code data on it.
Those data update by lookup program, e.g. When input PEN will get
Chinese GB2312 or BIG5 code.
We already ask Progress TS for this case, they also suggest using
UTF-8 Database.

But, we can not move to UTF-8 Database. We just some fields have this
case, those fields will not using substring,upcase or other string
operation to update those fields. Upto now, those CJK value without
any problem for over 10+ year.

For Unicode contains combining characters, is one of consideration.

Why my testing using Java. I want to check what the byte value for my
output in Progress.
We want to check what value when export data by Progress.
For Chinese word "凌晨", using codepoints for UTF-16 51CC and 6668, for
Byte value are e5 87 8c e6 99 a8.

In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
we felt it is not awful to ISO8859-1 database. Actually, Database seem
to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
six byte.
 
A

Arved Sandstrom

Lew said:
-moonhkt wrote:.


That statement doesn't make sense. What makes sense would be, "My
company wants to store characters with an ISO8859-1 encoding". There is
not any such thing, really, as "UTF-8 data". What there is is character
data. Others upthread have explained this; you might wish to review
what people told you about how data in a Java 'String' is always
UTF-16. You read it into the 'String' using an encoding argument to the
'Reader' to understand the encoding of the source, and you write it to
the destination using whatever encoding in the 'Writer' that you need.


The term "UTF-8 data" has no meaning.
[ SNIP ]

That's a bit nitpicky for me. If you're going to get that precise then
there's no such thing as character data either, since characters are
also an interpretation of binary bytes and words. In this view there's
no difference between a Unicode file and a PNG file and a PDF file and
an ASCII file.

Since we do routinely describe files by the only useful interpretation
of them, why not UTF-8 data files?

AHS
 
M

markspace

moonhkt said:
In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
we felt it is not awful to ISO8859-1 database. Actually, Database seem
to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
six byte.

Correct. You can't fit six bytes into one. You can't store all UTF-8
characters into an ISO8859-1 file. Some (most) will get truncated.

For a 10 year old database, it's time to upgrade. Go with UTF-8 (or
UTF-16).
 
L

Lew

Arved said:
That's a bit nitpicky for me. If you're going to get that precise then
there's no such thing as character data either, since characters are
also an interpretation of binary bytes and words. In this view there's
no difference between a Unicode file and a PNG file and a PDF file and
an ASCII file.

Since we do routinely describe files by the only useful interpretation
of them, why not UTF-8 data files?

You are right, generally, but the OP evinced an understanding of the term that
was interfering with his ability to accomplish his goal. I suggest that
thinking of the data as just "characters" and segregating the concept of the
encoding will help him.

Once he's got the hang of it, then, yeah, go ahead and call it "UTF-8 data".
 
M

moonhkt

But you can store 6 bytes as 6 Latin-1 chars (as long as
the DB doesn't suppress the "invalid" values; most don't)

It just won't have the right semantics.

   BugBear

What is your problem ?

The six bytes , 3 for first character and next 3 bytes for seconding
character.
Actually, We tried import and export , and compare two file are same.

The next task, is Extended ascii code, 80 to FF, value is not part of
UTF-8. It is means that the Output file can not include 80 to FF bytes
value ?
And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
conversion to UTF-8. or some value in Extended ASCII code to UTF-8
conversion.

Below Extended ASCII code found in our Database, ISO8859-1.
0x85
0xA9
0xAE
 
L

Lew

What is your problem ?

How do you mean that question? I don't see any problem from him.
The six bytes , 3 for first character and next 3 bytes for seconding
character.
Actually, We tried import and export , and compare two file are same.

The next task, is Extended ascii code, 80 to FF, value is not part of
UTF-8. It is means that the Output file can not include 80 to FF bytes
value ?

No. Those bytes can appear, and sometimes will, in a UTF-8-encoded file.
And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
conversion to UTF-8. or some value in Extended ASCII code to UTF-8
conversion.

Once again, as many have mentioned, you can read into a 'String' from, say, an
ISO8859-1 source and write to a UTF-8 sink using the appropriated encoding
arguments to the constructors of your 'Reader' and your 'Writer'.
Below Extended ASCII code found in our Database, ISO8859-1.
0x85
0xA9
0xAE

So? Read them in using one encoding and write them out using another. Done.
Easy. End of story.

Why do you keep asking the same question over and over again after so many
have answered it? There must be some detail in the answers that isn't clear
to you. What exactly is that?
 
R

RedGrittyBrick

By which I believe bugbear means that if your database thinks the octest
are ISO-8859-1 whereas they are in reality UTF-8 then the databases
understanding of the meaning (semantics) of those octets is wrong.
That's all. The implication is that sorting (i.e. collation) and string
operations liike case shifting and substring operations will often act
incorrectly.
The six bytes , 3 for first character and next 3 bytes for seconding
character.

The number of bytes per character is anywhere between one and four, Some
characters will be represented by one byte, others by two bytes ...
Actually, We tried import and export , and compare two file are same.

Which is what your objective was. Job done?
The next task, is Extended ascii code, 80 to FF,

There are many different 8-bit character sets that are sometimes
labelled "extended ASCII". ISO-8859-1 is one. Windows Latin 1 is
another, Code page 850 another.
value is not part of UTF-8.

Yes it is! As Lew said, those byte values will appear in UTF-8 encoded
character data.


It is means that the Output file can not include 80 to FF bytes

Yes it can.
And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
conversion to UTF-8. or some value in Extended ASCII code to UTF-8
conversion.

0xBC is not "Fraction one quarter" in some "extended ASCII" character
sets. For example in Code Page 850 it is a "box drawing double up and
left" character. I guess when you say "extended ASCII" you are only
considering "ISO 8859-1"?
Below Extended ASCII code found in our Database, ISO8859-1.
0x85
0xA9
0xAE

Since you are using your ISO 8859-1 database as a generic byte-bucket,
you have to know what encoding was used to insert those byte sequences.

They don't look like a valid sequence in UTF-8 encoding.
AFAIK ellipsis copyright registered in UTF-8 would be C2 85 C2 A9 C2 AE

Maybe some of the columns in your ISO 8859-1 database do contain ISO
8859-1 encoded data, whilst other columns (or rows - eeek!) actually
contain UTF-8 encoded data.

If you don't know which columns/rows contain which encodings then you
have a problem.

In an earlier response I said that I view this as a recipe for disaster.
 
R

Roedy Green

Thank for documents for UTF-8. Actually, My company want using
ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
with UTF-8 Data can be import and processed loading to our database.
Then export the data to default codepage, IBM850, we found e5 87 8c
e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
UTF-8 character.

The next test is loading all possible UTF-8 character to our database
then export the loaded data into a file, for compare two file. If two
different, we may be proof that loading UTF-8 into ISO8859-1 database
without any of bad effect.

Our Database is Progress Database for Character mode run on AIX 5.3
Machine.

Next Task, try to build all possible UTF-8 Bit into file,for Loading
test.
Any suggestion ?

You lied to your database and partly got away with it.

Here's the problem.

If you just look at a stream of bytes, you can't tell for sure if it
is UTF-8 or ISO-8859-1. There is no special marker. A human can make
a pretty good guess, but it is still a guess. The database just
treats the string as a slew of bits. It stores them and regurgitates
them identically. It does not really matter what encoding they are.

UNLESS you start doing some ad hoc queries not using your Java code
that is aware of the deception.

Now when you say search for c^aro (the Esperanto word for cart), the
search engine is going to look for a UTF-8-like set of bits with an
accented c. It won't find them unless the database truly is UTF-8 or
it is one of those lucky situation where UTF-8 and ISO are the same.

Telling your database engine the truth has another advantage. It can
use a more optimal compression algorithm.

Usually you store your database in UTF-8. Some legacy apps may
request some other encoding, and the database will translate for it in
and out. However, if you have lied about any of the encodings, this
translation process will go nuts.

One of the functions of a database is to hide the actual
representation of the data. It serves it up any way you like it. This
makes it possible to change the internal representation of the
database without changing all the apps at the same time.

--
Roedy Green Canadian Mind Products
http://mindprod.com

You can’t have great software without a great team, and most software teams behave like dysfunctional families.
~ Jim McCarthy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top