Read utf-8 file return utf-16 coding hex string ?

Discussion in 'Java' started by moonhkt, Jan 29, 2010.

  1. moonhkt

    moonhkt Guest

    Hi All
    Why using utf-8, the hex value return 51cc and 6668 ?

    od -cx utf8_file01.text

    22e5 878c e699 a822 with " befor and after

    http://www.fileformat.info/info/unicode/char/51cc/index.htm
    http://www.fileformat.info/info/unicode/char/6668/index.htm

    Output
    ......
    101 ? 20940 HEX=51cc BIN=101000111001100
    102 ? 26216 HEX=6668 BIN=110011001101000

    Java program

    import java.nio.charset.Charset ;
    import java.io.*;
    import java.lang.String.*;
    import java.lang.Integer.*;
    public class read_utf_line {
    public static void main(String[] args) {
    File aFile = new File("utf8_file01.text");
    try {
    System.out.println(aFile);
    String str = "";
    String hexstr = "";
    String bystr = "";
    int stlen= 0;
    Integer val=0;
    BufferedReader in = new BufferedReader(
    new InputStreamReader(new FileInputStream(aFile), "UTF8"));

    while (( str = in.readLine()) != null )
    { stlen = str.length();
    System.out.println(str.length());
    for (int i = 0;i < stlen;++i) {
    val = str.codePointAt(i);
    hexstr = Integer.toHexString(val);
    bystr = Integer.toBinaryString(val);

    System.out.println(i + " " + str.substring(i,i+1)
    + " " + str.codePointAt(i)
    + " HEX=" + hexstr
    + " BIN=" + bystr
    );
    }
    }
    } catch (UnsupportedEncodingException e) {
    } catch (IOException e) {
    }

    }
    }
     
    moonhkt, Jan 29, 2010
    #1
    1. Advertising

  2. moonhkt

    moonhkt Guest

    On Jan 29, 3:59 pm, Peter Duniho <> wrote:
    > moonhkt wrote:
    > > Hi All
    > > Why using utf-8, the hex value return 51cc and 6668 ?

    >
    > > od -cx utf8_file01.text

    >
    > > 22e5    878c    e699    a822    with " befor and after

    >
    > I don't understand the above.  Are you trying to suggest that the text
    > 'with " befor and after' is part of the output of the "od" program?  If
    > so, why does it not appear to match up with the binary values written
    > out?  And if the characters you're concerned with are at index 101 and
    > 102, why only eight bytes in the file?  And if the file is UTF-8, why
    > are you dumping its contents as shorts?  Why not just bytes?
    >
    > Frankly, the whole question doesn't make much sense to me.  That said,
    > the basic answer to your question is, I believe: UTF-8 and UTF-16 are
    > different, so of course the bytes used to represent a character in a
    > UTF-8 file are going to look different from the bytes used to represent
    > the same character in a UTF-16 data structure.
    >
    > Pete


    System : AIX 5.3

    Text file just have two utf-8 chinease character.
    cat out_utf.text
    凌晨

    od -cx out_utf.text
    0000000 207 214 231 \n
    e587 8ce6 99a8 0a00
    0000007

    java to build utf-8 data, input using utf-16 value. I does not know
    how to input utf-8 hex value.
    My Question is input utf-16 hex value, when write to file with UTF8
    codepage, the data will encode to UTF-8 ?
    Do you know hwo to input hex value of utf-8 ? I tried \0xe5 not works.


    import java.io.*;
    public class build_utf01 {
    public static void main(String[] args)
    throws UnsupportedEncodingException {

    // I want console output in UTF-8
    PrintStream sysout = new PrintStream(System.out, true, "UTF-8");
    try {
    File oFile = new File("out_utf.text");
    BufferedWriter out = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream(oFile),"UTF8"));

    /* http://www.fileformat.info/info/unicode/char/51cc/index.htm
    UTF-8 (hex) 0xe5 0x87 0x8c (e5878c)
    UTF-16 (hex) 0x51CC (51cc)
    http://www.fileformat.info/info/unicode/char/6668/index.htm
    UTF-16 (hex) U+6668
    UTF-8 (hex) 0xe6 0x99 0xa8 (e699a8)
    */
    String a = "\u51cc\u6668" ;

    int n = a.length();
    sysout.println("GIVEN STRING IS=" + a);
    sysout.printf("Length of string is %d%n", n);
    sysout.printf("CodePoints in string is %d%n", a.codePointCount
    (0,n));
    for (int i = 0; i < n; i++) {
    sysout.printf("Character[%d] is %s%n", i, a.charAt(i));
    out.write(a.charAt(i));
    }
    out.newLine();
    out.close() ;
    } catch (IOException e) {
    }
    }

    }


    Output utf-8 enabled terminal
    java build_utf01
    GIVEN STRING IS=凌晨
    Length of string is 2
    CodePoints in string is 2
    Character[0] is 凌
    Character[1] is 晨
     
    moonhkt, Jan 29, 2010
    #2
    1. Advertising

  3. In article
    <>,
    moonhkt <> wrote:

    [...]
    > My Question is input utf-16 hex value, when write to file with UTF8
    > codepage, the data will encode to UTF-8 ?


    When I run your program, I get this file content:

    $ hd out_utf.text
    000000: e5 87 8c e6 99 a8 0a ?..?.?.

    > Do you know hwo to input hex value of utf-8?


    Do you mean like this?

    String a = "\u51cc\u6668";
    String b = new String(new byte[] {
    (byte) 0xe5, (byte) 0x87, (byte) 0x8c,
    (byte) 0xe6, (byte) 0x99, (byte) 0xa8
    });
    System.out.println("a.equals(b) is " + a.equals(b));

    This prints "a.equals(b) is true".

    For reference: $ cat ~/bin/hd
    #!/usr/bin/hexdump -f
    "%06.6_ax: " 16/1 "%02x " " "
    16/1 "%_p" "\n"

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Jan 29, 2010
    #3
  4. moonhkt wrote:
    > Hi All
    > Why using utf-8, the hex value return 51cc and 6668 ?


    Because those are the Unicode codepoints of the characters in the file.

    >
    > od -cx utf8_file01.text
    >


    These are the byte values of the UTF8 encoding of the characters.

    > 22e5 878c e699 a822 with " befor and after


    ^^ ^^^^
    e5 87 8c = U+51CC

    ^^^^ ^^
    e6 99 a8 = U+6668


    As shown here:

    > http://www.fileformat.info/info/unicode/char/51cc/index.htm
    > http://www.fileformat.info/info/unicode/char/6668/index.htm




    >
    > Output
    > .....
    > 101 ? 20940 HEX=51cc BIN=101000111001100
    > 102 ? 26216 HEX=6668 BIN=110011001101000


    ^^^^ Unicode *CodePoint*

    > System.out.println(i + " " + str.substring(i,i+1)
    > + " " + str.codePointAt(i)


    ^^^^^^^^^^^ you retrieve a *CodePoint*
    > + " HEX=" + hexstr
    > + " BIN=" + bystr
    > );



    --
    RGB
     
    RedGrittyBrick, Jan 29, 2010
    #4
  5. moonhkt

    moonhkt Guest

    On Jan 29, 8:09 pm, RedGrittyBrick <>
    wrote:
    > moonhkt wrote:
    > > Hi All
    > > Why using utf-8, the hex value return 51cc and 6668 ?

    >
    > Because those are the Unicode codepoints of the characters in the file.
    >
    >
    >
    > > od -cx utf8_file01.text

    >
    > These are the byte values of the UTF8 encoding of the characters.
    >
    > > 22e5    878c    e699    a822    with " befor and after

    >
    >     ^^     ^^^^
    >     e5    87 8c   = U+51CC
    >
    >                    ^^^^    ^^
    >                    e6 99   a8  = U+6668
    >
    > As shown here:
    >
    > >http://www.fileformat.info/info/unicode/char/51cc/index.htm
    > >http://www.fileformat.info/info/unicode/char/6668/index.htm

    >
    > > Output
    > > .....
    > > 101 ? 20940 HEX=51cc BIN=101000111001100
    > > 102 ? 26216 HEX=6668 BIN=110011001101000

    >
    >                    ^^^^ Unicode *CodePoint*
    >
    > >              System.out.println(i + " " + str.substring(i,i+1)
    > >               + " " + str.codePointAt(i)

    >
    >                               ^^^^^^^^^^^ you retrieve a *CodePoint*
    >
    > >               + " HEX=" + hexstr
    > >               + " BIN=" + bystr
    > >               );

    >
    > --
    > RGB


    But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
    87 8c.
    What coding can handle this ?
     
    moonhkt, Jan 29, 2010
    #5
  6. moonhkt

    markspace Guest

    moonhkt wrote:

    > But, I want Print out UTF-8 hex value How to Print ? e.g U+51CC to e5
    > 87 8c.
    > What coding can handle this ?



    Oh, I see.

    Try this:


    package test;
    import java.io.UnsupportedEncodingException;

    public class UtfOut {
    public static void main( String[] args )
    throws UnsupportedEncodingException
    {
    String a = "\u51cc\u6668";

    byte [] buf = a.getBytes( "UTF-8" );

    for( byte b : buf ) {
    System.out.printf( "%02X ", b );
    }
    System.out.println( );

    }
    }


    You could also use a ByteArrayOutputStream.
     
    markspace, Jan 29, 2010
    #6
  7. moonhkt

    Roedy Green Guest

    On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <>
    wrote, quoted or indirectly quoted someone who said :

    >Hi All
    >Why using utf-8, the hex value return 51cc and 6668 ?


    UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
    into 16 bit and 32 bit code sequences.

    To see how the algorithm works see
    http://mindprod.com/jgloss/utf.html
    http://mindprod.com/jgloss/codepoint.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Computers are useless. They can only give you answers.
    ~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)
     
    Roedy Green, Jan 30, 2010
    #7
  8. moonhkt

    moonhkt Guest

    On Jan 30, 5:51 pm, Roedy Green <>
    wrote:
    > On Thu, 28 Jan 2010 23:40:07 -0800 (PST), moonhkt <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    > >Hi All
    > >Why using utf-8, the hex value return 51cc and 6668 ?

    >
    > UTF-8 is a mixture of 8 bit chars, and magic 8-bit sequences that turn
    > into 16 bit and 32 bit code sequences.
    >
    > To see how the algorithm works seehttp://mindprod.com/jgloss/utf.htmlhttp://mindprod.com/jgloss/codepoint.html
    > --
    > Roedy Green Canadian Mind Productshttp://mindprod.com
    > Computers are useless. They can only give you answers.
    > ~ Pablo Picasso (born: 1881-10-25 died: 1973-04-08 at age: 91)


    Hi All
    Thank for documents for UTF-8. Actually, My company want using
    ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
    ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
    with UTF-8 Data can be import and processed loading to our database.
    Then export the data to default codepage, IBM850, we found e5 87 8c
    e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
    UTF-8 character.

    The next test is loading all possible UTF-8 character to our database
    then export the loaded data into a file, for compare two file. If two
    different, we may be proof that loading UTF-8 into ISO8859-1 database
    without any of bad effect.

    Our Database is Progress Database for Character mode run on AIX 5.3
    Machine.

    Next Task, try to build all possible UTF-8 Bit into file,for Loading
    test.
    Any suggestion ?
     
    moonhkt, Jan 30, 2010
    #8
  9. moonhkt wrote:

    > Actually, My company want using
    > ISO8859-1 database to store UTF-8 data.


    Your company should use a Unicode database to store Unicode data. The
    Progress DBMS supports Unicode.

    > Currently, our EDI just handle
    > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
    > with UTF-8 Data can be import and processed loading to our database.
    > Then export the data to default codepage, IBM850, we found e5 87 8c
    > e6 99 a8 in the file.


    This seems crazy to me. The DBMS functions for working with CHAR
    datatypes will do bad things if your have misled the DBMS into treating
    UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
    to fit 10 chars in a CHAR(10) field for example.

    > The Export file are mix ISO8859-1 chars and UTF-8 character.


    Sorry to be so negative, but this seems a recipe for disaster.


    > The next test is loading all possible UTF-8 character to our database
    > then export the loaded data into a file, for compare two file. If two
    > different, we may be proof that loading UTF-8 into ISO8859-1 database
    > without any of bad effect.


    I think you'll have a false sense of optimism and discover bad effects
    later.


    > Our Database is Progress Database for Character mode run on AIX 5.3
    > Machine.


    A 1998 vintage document suggests the Progress DBMS can support Unicode.
    http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
    in that presentation that I find troubling.


    > Next Task, try to build all possible UTF-8 Bit into file,for Loading
    > test.


    Unicode contains combining characters, not all sequences of Unicode
    characters are valid.


    > Any suggestion ?


    Reconsider :)

    --
    RGB
     
    RedGrittyBrick, Jan 30, 2010
    #9
  10. moonhkt

    Lew Guest

    -moonhkt wrote:.

    > Thank for documents for UTF-8. Actually, My company want using
    > ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle


    That statement doesn't make sense. What makes sense would be, "My company
    wants to store characters with an ISO8859-1 encoding". There is not any such
    thing, really, as "UTF-8 data". What there is is character data. Others
    upthread have explained this; you might wish to review what people told you
    about how data in a Java 'String' is always UTF-16. You read it into the
    'String' using an encoding argument to the 'Reader' to understand the encoding
    of the source, and you write it to the destination using whatever encoding in
    the 'Writer' that you need.

    > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI


    The term "UTF-8 data" has no meaning.

    > with UTF-8 Data can be import and processed loading to our database.
    > Then export the data to default codepage, IBM850, we found e5 87 8c
    > e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
    > UTF-8 character.


    You simply map the 'String' data to the database column using JDBC. The
    connection and JDBC driver handle the encoding, AIUI.
    <http://java.sun.com/javase/6/docs/api/java/sql/PreparedStatement.html#setString(int,%20java.lang.String)>

    > The next test is loading all possible UTF-8 character to our database
    > then export the loaded data into a file, for compare two file. If two
    > different, we may be proof that loading UTF-8 into ISO8859-1 database
    > without any of bad effect.


    There are an *awful* lot of UTF-encoded characters, over 107,000. Most are
    not encodable with ISO-8859-1, which only handles 256 characters.

    > Our Database is Progress Database for Character mode run on AIX 5.3
    > Machine.
    >
    > Next Task, try to build all possible UTF-8 Bit into file,for Loading
    > test.
    > Any suggestion ?


    That'll be a rather large file.

    Why don't you Google for character encoding and what different encodings can
    handle?

    Also:
    <http://en.wikipedia.org/wiki/Unicode>
    <http://en.wikipedia.org/wiki/ISO-8859-1>

    --
    Lew
     
    Lew, Jan 30, 2010
    #10
  11. moonhkt

    moonhkt Guest

    On Jan 31, 12:16 am, RedGrittyBrick <>
    wrote:
    > moonhkt wrote:
    > > Actually, My company want using
    > > ISO8859-1 database to store UTF-8 data.

    >
    > Your company should use a Unicode database to store Unicode data. The
    > Progress DBMS supports Unicode.
    >
    > > Currently, our EDI just handle
    > > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
    > > with UTF-8 Data can be import and processed loading to our database.
    > > Then export the data to default codepage, IBM850,  we found e5 87 8c
    > > e6 99 a8 in the file.

    >
    > This seems crazy to me. The DBMS functions for working with CHAR
    > datatypes will do bad things if your have misled the DBMS into treating
    > UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
    > to fit 10 chars in a CHAR(10) field for example.
    >
    > > The Export file are mix ISO8859-1 chars and UTF-8 character.

    >
    > Sorry to be so negative, but this seems a recipe for disaster.
    >
    > > The next test is loading all possible UTF-8 character to our database
    > > then export the loaded data into a file, for compare two file.  If two
    > > different, we may be proof that loading UTF-8 into ISO8859-1 database
    > > without any of bad effect.

    >
    > I think you'll have a false sense of optimism and discover bad effects
    > later.
    >
    > > Our Database is Progress Database for Character mode run on AIX 5.3
    > > Machine.

    >
    > A 1998 vintage document suggests the Progress DBMS can support Unicode.http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
    > in that presentation that I find troubling.
    >
    > > Next Task, try to build all possible UTF-8 Bit into file,for Loading
    > > test.

    >
    > Unicode contains combining characters, not all sequences of Unicode
    > characters are valid.
    >
    > > Any suggestion ?

    >
    > Reconsider :)
    >
    > --
    > RGB


    Thank for you reminder. But Our database already have Chinese/Japanese/
    Korean code data on it.
    Those data update by lookup program, e.g. When input PEN will get
    Chinese GB2312 or BIG5 code.
    We already ask Progress TS for this case, they also suggest using
    UTF-8 Database.

    But, we can not move to UTF-8 Database. We just some fields have this
    case, those fields will not using substring,upcase or other string
    operation to update those fields. Upto now, those CJK value without
    any problem for over 10+ year.


    For Unicode contains combining characters, is one of consideration.
     
    moonhkt, Jan 30, 2010
    #11
  12. moonhkt

    moonhkt Guest

    On Jan 31, 12:48 am, moonhkt <> wrote:
    > On Jan 31, 12:16 am, RedGrittyBrick <>
    > wrote:
    >
    >
    >
    > > moonhkt wrote:
    > > > Actually, My company want using
    > > > ISO8859-1 database to store UTF-8 data.

    >
    > > Your company should use a Unicode database to store Unicode data. The
    > > Progress DBMS supports Unicode.

    >
    > > > Currently, our EDI just handle
    > > > ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
    > > > with UTF-8 Data can be import and processed loading to our database.
    > > > Then export the data to default codepage, IBM850,  we found e5 87 8c
    > > > e6 99 a8 in the file.

    >
    > > This seems crazy to me. The DBMS functions for working with CHAR
    > > datatypes will do bad things if your have misled the DBMS into treating
    > > UTF-8 encoded data as if it were ISO 8859-1. You will no longer be able
    > > to fit 10 chars in a CHAR(10) field for example.

    >
    > > > The Export file are mix ISO8859-1 chars and UTF-8 character.

    >
    > > Sorry to be so negative, but this seems a recipe for disaster.

    >
    > > > The next test is loading all possible UTF-8 character to our database
    > > > then export the loaded data into a file, for compare two file.  If two
    > > > different, we may be proof that loading UTF-8 into ISO8859-1 database
    > > > without any of bad effect.

    >
    > > I think you'll have a false sense of optimism and discover bad effects
    > > later.

    >
    > > > Our Database is Progress Database for Character mode run on AIX 5.3
    > > > Machine.

    >
    > > A 1998 vintage document suggests the Progress DBMS can support Unicode.http://unicode.org/iuc/iuc13/c12/slides.ppt. Though there's a few items
    > > in that presentation that I find troubling.

    >
    > > > Next Task, try to build all possible UTF-8 Bit into file,for Loading
    > > > test.

    >
    > > Unicode contains combining characters, not all sequences of Unicode
    > > characters are valid.

    >
    > > > Any suggestion ?

    >
    > > Reconsider :)

    >
    > > --
    > > RGB

    >
    > Thank for you reminder. But Our database already have Chinese/Japanese/
    > Korean code data on it.
    > Those data update by lookup program, e.g. When input PEN will get
    > Chinese GB2312 or BIG5 code.
    > We already ask Progress TS for this case, they also suggest using
    > UTF-8 Database.
    >
    > But, we can not move to UTF-8 Database. We just some fields have this
    > case, those fields will not using substring,upcase or other string
    > operation to update those fields. Upto now, those CJK value without
    > any problem for over 10+ year.
    >
    > For Unicode contains combining characters, is one of consideration.


    Why my testing using Java. I want to check what the byte value for my
    output in Progress.
    We want to check what value when export data by Progress.
    For Chinese word "凌晨", using codepoints for UTF-16 51CC and 6668, for
    Byte value are e5 87 8c e6 99 a8.

    In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
    we felt it is not awful to ISO8859-1 database. Actually, Database seem
    to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
    six byte.
     
    moonhkt, Jan 30, 2010
    #12
  13. Lew wrote:
    > -moonhkt wrote:.
    >
    >> Thank for documents for UTF-8. Actually, My company want using
    >> ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle

    >
    > That statement doesn't make sense. What makes sense would be, "My
    > company wants to store characters with an ISO8859-1 encoding". There is
    > not any such thing, really, as "UTF-8 data". What there is is character
    > data. Others upthread have explained this; you might wish to review
    > what people told you about how data in a Java 'String' is always
    > UTF-16. You read it into the 'String' using an encoding argument to the
    > 'Reader' to understand the encoding of the source, and you write it to
    > the destination using whatever encoding in the 'Writer' that you need.
    >
    >> ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI

    >
    > The term "UTF-8 data" has no meaning.

    [ SNIP ]

    That's a bit nitpicky for me. If you're going to get that precise then
    there's no such thing as character data either, since characters are
    also an interpretation of binary bytes and words. In this view there's
    no difference between a Unicode file and a PNG file and a PDF file and
    an ASCII file.

    Since we do routinely describe files by the only useful interpretation
    of them, why not UTF-8 data files?

    AHS
     
    Arved Sandstrom, Jan 30, 2010
    #13
  14. moonhkt

    markspace Guest

    moonhkt wrote:

    > In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
    > we felt it is not awful to ISO8859-1 database. Actually, Database seem
    > to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
    > six byte.


    Correct. You can't fit six bytes into one. You can't store all UTF-8
    characters into an ISO8859-1 file. Some (most) will get truncated.

    For a 10 year old database, it's time to upgrade. Go with UTF-8 (or
    UTF-16).
     
    markspace, Jan 30, 2010
    #14
  15. moonhkt

    Lew Guest

    Lew wrote:
    >> The term "UTF-8 data" has no meaning.


    Arved Sandstrom wrote:
    > That's a bit nitpicky for me. If you're going to get that precise then
    > there's no such thing as character data either, since characters are
    > also an interpretation of binary bytes and words. In this view there's
    > no difference between a Unicode file and a PNG file and a PDF file and
    > an ASCII file.
    >
    > Since we do routinely describe files by the only useful interpretation
    > of them, why not UTF-8 data files?


    You are right, generally, but the OP evinced an understanding of the term that
    was interfering with his ability to accomplish his goal. I suggest that
    thinking of the data as just "characters" and segregating the concept of the
    encoding will help him.

    Once he's got the hang of it, then, yeah, go ahead and call it "UTF-8 data".

    --
    Lew
     
    Lew, Jan 30, 2010
    #15
  16. moonhkt

    moonhkt Guest

    On Feb 1, 5:47 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > markspace wrote:
    > > moonhkt wrote:

    >
    > >> In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
    > >> we felt it is not awful to ISO8859-1 database. Actually, Database seem
    > >> to be handle 0x00 to 0xFF characters.  The number of byte for 凌晨 to be
    > >> six byte.

    >
    > > Correct.  You can't fit six bytes into one.  You can't store all UTF-8
    > > characters into an ISO8859-1 file.  Some (most) will get truncated..

    >
    > But you can store 6 bytes as 6 Latin-1 chars (as long as
    > the DB doesn't suppress the "invalid" values; most don't)
    >
    > It just won't have the right semantics.
    >
    >    BugBear


    What is your problem ?

    The six bytes , 3 for first character and next 3 bytes for seconding
    character.
    Actually, We tried import and export , and compare two file are same.

    The next task, is Extended ascii code, 80 to FF, value is not part of
    UTF-8. It is means that the Output file can not include 80 to FF bytes
    value ?
    And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
    conversion to UTF-8. or some value in Extended ASCII code to UTF-8
    conversion.

    Below Extended ASCII code found in our Database, ISO8859-1.
    0x85
    0xA9
    0xAE
     
    moonhkt, Feb 3, 2010
    #16
  17. moonhkt

    Lew Guest

    bugbear wrote:
    >> But you can store 6 bytes as 6 Latin-1 chars (as long as
    >> the DB doesn't suppress the "invalid" values; most don't)
    >>
    >> It just won't have the right semantics.


    moonhkt wrote:
    > What is your problem ?


    How do you mean that question? I don't see any problem from him.

    > The six bytes , 3 for first character and next 3 bytes for seconding
    > character.
    > Actually, We tried import and export , and compare two file are same.
    >
    > The next task, is Extended ascii code, 80 to FF, value is not part of
    > UTF-8. It is means that the Output file can not include 80 to FF bytes
    > value ?


    No. Those bytes can appear, and sometimes will, in a UTF-8-encoded file.

    > And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
    > conversion to UTF-8. or some value in Extended ASCII code to UTF-8
    > conversion.


    Once again, as many have mentioned, you can read into a 'String' from, say, an
    ISO8859-1 source and write to a UTF-8 sink using the appropriated encoding
    arguments to the constructors of your 'Reader' and your 'Writer'.

    > Below Extended ASCII code found in our Database, ISO8859-1.
    > 0x85
    > 0xA9
    > 0xAE


    So? Read them in using one encoding and write them out using another. Done.
    Easy. End of story.

    Why do you keep asking the same question over and over again after so many
    have answered it? There must be some detail in the answers that isn't clear
    to you. What exactly is that?

    --
    Lew
     
    Lew, Feb 3, 2010
    #17
  18. moonhkt wrote:
    > bugbear wrote:
    >> markspace wrote:
    >>> moonhkt wrote:
    >>>> In Progress, viewed the inputted data by UTF-8 terminal as a 凌晨. So,
    >>>> we felt it is not awful to ISO8859-1 database. Actually, Database seem
    >>>> to be handle 0x00 to 0xFF characters. The number of byte for 凌晨 to be
    >>>> six byte.
    >>> Correct. You can't fit six bytes into one. You can't store all UTF-8
    >>> characters into an ISO8859-1 file. Some (most) will get truncated.

    >> But you can store 6 bytes as 6 Latin-1 chars (as long as
    >> the DB doesn't suppress the "invalid" values; most don't)
    >>
    >> It just won't have the right semantics.
    >>


    By which I believe bugbear means that if your database thinks the octest
    are ISO-8859-1 whereas they are in reality UTF-8 then the databases
    understanding of the meaning (semantics) of those octets is wrong.
    That's all. The implication is that sorting (i.e. collation) and string
    operations liike case shifting and substring operations will often act
    incorrectly.

    > The six bytes , 3 for first character and next 3 bytes for seconding
    > character.


    The number of bytes per character is anywhere between one and four, Some
    characters will be represented by one byte, others by two bytes ...

    > Actually, We tried import and export , and compare two file are same.


    Which is what your objective was. Job done?

    >
    > The next task, is Extended ascii code, 80 to FF,


    There are many different 8-bit character sets that are sometimes
    labelled "extended ASCII". ISO-8859-1 is one. Windows Latin 1 is
    another, Code page 850 another.

    > value is not part of UTF-8.


    Yes it is! As Lew said, those byte values will appear in UTF-8 encoded
    character data.


    It is means that the Output file can not include 80 to FF bytes
    > value ?


    Yes it can.

    > And handle 0xBC, Fraction one quarter, 0xBD,Fraction one half
    > conversion to UTF-8. or some value in Extended ASCII code to UTF-8
    > conversion.


    0xBC is not "Fraction one quarter" in some "extended ASCII" character
    sets. For example in Code Page 850 it is a "box drawing double up and
    left" character. I guess when you say "extended ASCII" you are only
    considering "ISO 8859-1"?

    >
    > Below Extended ASCII code found in our Database, ISO8859-1.
    > 0x85
    > 0xA9
    > 0xAE
    >


    Since you are using your ISO 8859-1 database as a generic byte-bucket,
    you have to know what encoding was used to insert those byte sequences.

    They don't look like a valid sequence in UTF-8 encoding.
    AFAIK ellipsis copyright registered in UTF-8 would be C2 85 C2 A9 C2 AE

    Maybe some of the columns in your ISO 8859-1 database do contain ISO
    8859-1 encoded data, whilst other columns (or rows - eeek!) actually
    contain UTF-8 encoded data.

    If you don't know which columns/rows contain which encodings then you
    have a problem.

    In an earlier response I said that I view this as a recipe for disaster.
     
    RedGrittyBrick, Feb 3, 2010
    #18
  19. moonhkt

    Roedy Green Guest

    On Sat, 30 Jan 2010 07:23:55 -0800 (PST), moonhkt <>
    wrote, quoted or indirectly quoted someone who said :

    >Thank for documents for UTF-8. Actually, My company want using
    >ISO8859-1 database to store UTF-8 data. Currently, our EDI just handle
    >ISO8859-1 codepage. We want to test import UTF-8 data. One type EDI
    >with UTF-8 Data can be import and processed loading to our database.
    >Then export the data to default codepage, IBM850, we found e5 87 8c
    >e6 99 a8 in the file. The Export file are mix ISO8859-1 chars and
    >UTF-8 character.
    >
    >The next test is loading all possible UTF-8 character to our database
    >then export the loaded data into a file, for compare two file. If two
    >different, we may be proof that loading UTF-8 into ISO8859-1 database
    >without any of bad effect.
    >
    >Our Database is Progress Database for Character mode run on AIX 5.3
    >Machine.
    >
    >Next Task, try to build all possible UTF-8 Bit into file,for Loading
    >test.
    >Any suggestion ?


    You lied to your database and partly got away with it.

    Here's the problem.

    If you just look at a stream of bytes, you can't tell for sure if it
    is UTF-8 or ISO-8859-1. There is no special marker. A human can make
    a pretty good guess, but it is still a guess. The database just
    treats the string as a slew of bits. It stores them and regurgitates
    them identically. It does not really matter what encoding they are.

    UNLESS you start doing some ad hoc queries not using your Java code
    that is aware of the deception.

    Now when you say search for c^aro (the Esperanto word for cart), the
    search engine is going to look for a UTF-8-like set of bits with an
    accented c. It won't find them unless the database truly is UTF-8 or
    it is one of those lucky situation where UTF-8 and ISO are the same.

    Telling your database engine the truth has another advantage. It can
    use a more optimal compression algorithm.

    Usually you store your database in UTF-8. Some legacy apps may
    request some other encoding, and the database will translate for it in
    and out. However, if you have lied about any of the encodings, this
    translation process will go nuts.

    One of the functions of a database is to hide the actual
    representation of the data. It serves it up any way you like it. This
    makes it possible to change the internal representation of the
    database without changing all the apps at the same time.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    You can’t have great software without a great team, and most software teams behave like dysfunctional families.
    ~ Jim McCarthy
     
    Roedy Green, Feb 5, 2010
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,223
    Neredbojias
    Aug 19, 2005
  2. Bengt Richter
    Replies:
    6
    Views:
    473
    Juha Autero
    Aug 19, 2003
  3. jack
    Replies:
    4
    Views:
    589
  4. tim

    hex string to hex value

    tim, Nov 22, 2005, in forum: Python
    Replies:
    8
    Views:
    18,879
  5. tim
    Replies:
    2
    Views:
    1,566
    Dennis Lee Bieber
    Nov 23, 2005
Loading...

Share This Page