strings - reading utf8 characters such as japanese. how?

Discussion in 'Java' started by stefoid, Jul 3, 2006.

  1. stefoid

    stefoid Guest

    Hi. Ive got a problem. I have some code that takes a text file and
    breaks it into an array of substrings for displaying the text truncated
    to fit the screen width on word boundaries.

    It just looks for the spaces.

    trouble is, it crashes out with japenese text. There is a part of the
    code that looks at the next character to see if it is a space:

    ch = str.substring(offset, offset + 1);
    isSpace = false;

    // return when a new line is reached
    if (ch.equals("\n"))
    return offset+1;

    currentWidth += font.stringWidth(ch);

    if (ch.equals(" "))
    isSpace = true;

    and if it isnt a space, it adds the width of the character (in pixels)
    , and keeps going until it does find a space.

    the problem with this is it assumes that each byte is a characater. In
    utf8, up to 3 bytes could be one character, so this code is trying to
    find the widths of characters representing each byte in a utf8
    sequence, rather than the width of the utf8 character as a whole.

    my additional problem is this is iAppli code, so I am limited to a 30K
    codebase, and I have hit the limit, so I cant write any more lines of
    code - I just have to change the existing code such that it doesnt
    generate any more bytecode.

    what can I do to the above code so that I can count widths of utf8
    characters instead of asc characters, without writing too much extra
    code - I need existing java library functions to do it for me, but I
    dont know what that fucntionality is.
    stefoid, Jul 3, 2006
    #1
    1. Advertising

  2. stefoid wrote:

    > Hi. Ive got a problem. I have some code that takes a text file and
    > breaks it into an array of substrings for displaying the text truncated
    > to fit the screen width on word boundaries.
    >
    > It just looks for the spaces.
    >
    > trouble is, it crashes out with japenese text. There is a part of the
    > code that looks at the next character to see if it is a space:
    >
    > ch = str.substring(offset, offset + 1);
    > isSpace = false;
    >
    > // return when a new line is reached
    > if (ch.equals("\n"))
    > return offset+1;
    >
    > currentWidth += font.stringWidth(ch);
    >
    > if (ch.equals(" "))
    > isSpace = true;
    >
    > and if it isnt a space, it adds the width of the character (in pixels)
    > , and keeps going until it does find a space.
    >
    > the problem with this is it assumes that each byte is a characater. In
    > utf8, up to 3 bytes could be one character, so this code is trying to
    > find the widths of characters representing each byte in a utf8
    > sequence, rather than the width of the utf8 character as a whole.
    >
    > my additional problem is this is iAppli code, so I am limited to a 30K
    > codebase, and I have hit the limit, so I cant write any more lines of
    > code - I just have to change the existing code such that it doesnt
    > generate any more bytecode.
    >
    > what can I do to the above code so that I can count widths of utf8
    > characters instead of asc characters, without writing too much extra
    > code - I need existing java library functions to do it for me, but I
    > dont know what that fucntionality is.


    have a look at:
    http://javaalmanac.com/egs/java.nio.charset/ConvertChar.html
    Damian Driscoll, Jul 3, 2006
    #2
    1. Advertising

  3. stefoid

    Chris Uppal Guest

    stefoid wrote:

    > what can I do to the above code so that I can count widths of utf8
    > characters instead of asc characters, without writing too much extra
    > code - I need existing java library functions to do it for me, but I
    > dont know what that fucntionality is.


    Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
    it -- I would put it somewhere between impossible and dangerously difficult and
    confusing.

    If you want to load your information into /text/, let Java decode the external
    UTF-8 into Strings (of characters, already decoded as they are read in). If,
    possibly for space reasons, you have to work in UTF-8 internally, then you'd be
    far better off keeping the data in byte[] arrays.

    -- chris
    Chris Uppal, Jul 3, 2006
    #3
  4. stefoid

    Oliver Wong Guest

    "stefoid" <> wrote in message
    news:...
    > Hi. Ive got a problem. I have some code that takes a text file and
    > breaks it into an array of substrings for displaying the text truncated
    > to fit the screen width on word boundaries.
    >
    > It just looks for the spaces.
    >
    > trouble is, it crashes out with japenese text. There is a part of the
    > code that looks at the next character to see if it is a space:
    >
    > ch = str.substring(offset, offset + 1);
    > isSpace = false;
    >
    > // return when a new line is reached
    > if (ch.equals("\n"))
    > return offset+1;
    >
    > currentWidth += font.stringWidth(ch);
    >
    > if (ch.equals(" "))
    > isSpace = true;
    >
    > and if it isnt a space, it adds the width of the character (in pixels)
    > , and keeps going until it does find a space.


    How about something like:

    <pseudoCode>
    StringTokenizer st = new StringTokenizer(str, " \n", true);
    int offset = 0;
    while (st.hasMoreTokens()) {
    String token = st.nextToken();
    if (token.equals(" ")) {
    /*do whatever you gotta do with spaces here.*/
    offset++;
    } else if (token.equals("\n")) {
    return offset;
    } else {
    currentWidth = font.stringWidth(token);
    offset += token.length();
    }
    }
    </pseudoCode>

    You'll avoid breaking up the string into its individual codepoints,
    potentially splitting a character in two.

    >
    > the problem with this is it assumes that each byte is a characater. In
    > utf8, up to 3 bytes could be one character, so this code is trying to
    > find the widths of characters representing each byte in a utf8
    > sequence, rather than the width of the utf8 character as a whole.


    Actually, it assumes each (Java) char is a (semantic) character. A Java char
    is 16 bits long, and Java Strings are internally stored in UTF-16, so a
    semantic character might be spread over 2 java char (32 bits).

    >
    > my additional problem is this is iAppli code, so I am limited to a 30K
    > codebase, and I have hit the limit, so I cant write any more lines of
    > code - I just have to change the existing code such that it doesnt
    > generate any more bytecode.


    Sounds rough. Can't really help you with this.

    >
    > what can I do to the above code so that I can count widths of utf8
    > characters instead of asc characters, without writing too much extra
    > code - I need existing java library functions to do it for me, but I
    > dont know what that fucntionality is.


    See above. Since you're working with Unicode, you might want to use the
    Character.isWhiteSpace() method, isntead of the String.equals(" ") method. I
    believe the Japanese whitespace has a different unicode value than the ASCII
    whitespace.

    - Oliver
    Oliver Wong, Jul 3, 2006
    #4
  5. stefoid

    stefoid Guest

    Re: strings - reading utf8 characters such as japanese. how?

    good question. An iAppli is something like an applet, designed to go
    into a cutdown java virtual machine to fit inside mobile devices. The
    available java libraires are greatly restricted - I have lang.string
    and lang.character to choose from (that relate to this problem). In
    addition to the 30K codebase limit, which i have reached - seriously, I
    am like 2 bytes off the maximum.

    this is the only part of the code where I have to recognize individual
    characters. everything else is just read a string and output it to the
    screen, which works fine for utf8, cos its null terminated.




    Chris Uppal wrote:
    > stefoid wrote:
    >
    > > what can I do to the above code so that I can count widths of utf8
    > > characters instead of asc characters, without writing too much extra
    > > code - I need existing java library functions to do it for me, but I
    > > dont know what that fucntionality is.

    >
    > Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
    > it -- I would put it somewhere between impossible and dangerously difficult and
    > confusing.
    >
    > If you want to load your information into /text/, let Java decode the external
    > UTF-8 into Strings (of characters, already decoded as they are read in). If,
    > possibly for space reasons, you have to work in UTF-8 internally, then you'd be
    > far better off keeping the data in byte[] arrays.
    >
    > -- chris
    stefoid, Jul 4, 2006
    #5
  6. stefoid

    Chris Uppal Guest

    Re: strings - reading utf8 characters such as japanese. how?

    [reorderd to remove top-posting]

    stefoid wrote:

    [me:]
    > > Why are you working in UTF8 using Java Strings ? Indeed /how/ are you
    > > doing it -- I would put it somewhere between impossible and dangerously
    > > difficult and confusing.
    > >

    > good question. An iAppli is something like an applet, designed to go
    > into a cutdown java virtual machine to fit inside mobile devices. The
    > available java libraires are greatly restricted - I have lang.string
    > and lang.character to choose from (that relate to this problem). In
    > addition to the 30K codebase limit, which i have reached - seriously, I
    > am like 2 bytes off the maximum.
    >
    > this is the only part of the code where I have to recognize individual
    > characters. everything else is just read a string and output it to the
    > screen, which works fine for utf8, cos its null terminated.


    But you haven't really answered my question. I'll try again:

    Are you saying that your iAppli doesn't support byte[] arrays ? I find that
    impossible to believe.

    Are you handling your UTF-8 data as binary (in byte[] arrays) or are you
    somehow stuffing UTF-8 encoded data into Java Strings ? If the latter then
    (a) why ? and (b) how ?

    When you read your data in, why don't you use the Java-provided stuff to decode
    the UTF-8 into native (decoded) Java Strings ? I could understand that you
    might want to stick with UTF-8 encoded data for space reasons, but then it
    doesn't make sense that you'd put that data into Strings (16 bits per
    character), which would double the space requirement over byte[] arrays for the
    same data. (Unless you stuffed two bytes into each Java char -- which would be
    downright perverse ;-)

    Maybe this implementation lacks the character encoding stuff found everwhere in
    real Java ? If not then why are you not using it ? If it does, then I suspect
    you are hosed.

    -- chris
    Chris Uppal, Jul 4, 2006
    #6
  7. stefoid

    Oliver Wong Guest

    Re: strings - reading utf8 characters such as japanese. how?

    "stefoid" <> wrote in message
    news:...
    > good question. An iAppli is something like an applet, designed to go
    > into a cutdown java virtual machine to fit inside mobile devices. The
    > available java libraires are greatly restricted - I have lang.string
    > and lang.character to choose from (that relate to this problem).


    Maybe you should have mentioned this when you wrote

    <quote>
    I need existing java library functions to do it for me, but I
    dont know what that fucntionality is.
    </quote>

    else you're wasting people's times coming up with solutions that won't
    solve your problem.

    > In
    > addition to the 30K codebase limit, which i have reached - seriously, I
    > am like 2 bytes off the maximum.
    >
    > this is the only part of the code where I have to recognize individual
    > characters. everything else is just read a string and output it to the
    > screen, which works fine for utf8, cos its null terminated.


    My concern right now is that you might not know what you're talking
    about. Where are you getting the string data from? What is the type of the
    parameter of that string data? Is it String? Byte[]? byte[]? Something else?

    What makes you believe it is UTF-8 encoded? What makes you think it's
    null terminated?

    I don't want to start explaining how to convert UTF-8 binary data
    stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
    what is nescessary to solve your problem.

    - Oliver
    Oliver Wong, Jul 4, 2006
    #7
  8. stefoid

    stefoid Guest

    Re: strings - reading utf8 characters such as japanese. how?

    yeah, youre right, sorry I didnt mention that. I think youre also
    right in that I dont have a firm grasp of java strings, internal
    coding, etc...

    This is the code that is used to read the utf8 text resources into
    strings:

    " dis = Connector.openDataInputStream(resourcePath);
    text = new byte[bytes];
    dis.readFully(text, 0, bytes);
    dis.close();
    return new String(text); "

    I didnt write it, but I wrote the code that uses the strings, and since
    the strings passed to my stuff seemed to print OK, I was happy to
    ignore where they came from. Now that guy has gone, and the strings
    are in japanese and problems begin.

    Actually I have re-written the code that truncates the strings and
    solved my original problem. Its very inefficient, but it uses less
    lines of code than the original and still works, so I save bytes of
    code which is a godsend.

    However, I have noticed another problem - the start of every utf8
    encoded string resource starts with an unwanted 'dot' character which
    does not appear in the original text files. (whether it has passed
    through my truncating code or not - it still happens) I have tracked
    this down to (I think) the fact that java uses a modified utf8 encoding
    scheme, and the text files I am inputting are generated with Word which
    will be writing them in normal utf8. I assume thats the problem,
    anyway. I have yet to work out how to fix it. I am looking for a
    convert program that will convert the the utf8 text files to modified
    utf8 format .. seems easiest and preserves precious bytes of code.

    any help appreciated.

    Oliver Wong wrote:
    > "stefoid" <> wrote in message
    > news:...
    > > good question. An iAppli is something like an applet, designed to go
    > > into a cutdown java virtual machine to fit inside mobile devices. The
    > > available java libraires are greatly restricted - I have lang.string
    > > and lang.character to choose from (that relate to this problem).

    >
    > Maybe you should have mentioned this when you wrote
    >
    > <quote>
    > I need existing java library functions to do it for me, but I
    > dont know what that fucntionality is.
    > </quote>
    >
    > else you're wasting people's times coming up with solutions that won't
    > solve your problem.
    >
    > > In
    > > addition to the 30K codebase limit, which i have reached - seriously, I
    > > am like 2 bytes off the maximum.
    > >
    > > this is the only part of the code where I have to recognize individual
    > > characters. everything else is just read a string and output it to the
    > > screen, which works fine for utf8, cos its null terminated.

    >
    > My concern right now is that you might not know what you're talking
    > about. Where are you getting the string data from? What is the type of the
    > parameter of that string data? Is it String? Byte[]? byte[]? Something else?
    >
    > What makes you believe it is UTF-8 encoded? What makes you think it's
    > null terminated?
    >
    > I don't want to start explaining how to convert UTF-8 binary data
    > stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
    > what is nescessary to solve your problem.
    >
    > - Oliver
    stefoid, Jul 5, 2006
    #8
  9. stefoid

    stefoid Guest

    Re: strings - reading utf8 characters such as japanese. how?

    I should add, here is what the cldc has available (cut down java for
    wireless devices and pdas)

    java.io:
    Interfaces
    --------
    DataInput
    DataOutput

    Classes
    -------
    ByteArrayInputStream
    ByteArrayOutputStream
    DataInputStream
    DataOutputStream
    InputStream
    InputStreamReader
    OutputStream
    OutputStreamWriter
    PrintStream
    Reader
    Writer


    java.lang:
    Classes
    ---------
    Boolean
    Byte
    Character
    Class
    Double
    Float
    Integer
    Long
    Math
    Object
    Runtime
    Short
    String
    StringBuffer
    System
    Thread
    Throwable

    and something called microedition connectors API:

    Interfaces
    ---------
    Connection
    ContentConnection
    Datagram
    DatagramConnection
    InputConnection
    OutputConnection
    StreamConnection
    StreamConnectionNotifier
    Classes
    ----------
    Connector
    stefoid, Jul 5, 2006
    #9
  10. stefoid

    Oliver Wong Guest

    Re: strings - reading utf8 characters such as japanese. how?

    "stefoid" <> wrote in message
    news:...
    > This is the code that is used to read the utf8 text resources into
    > strings:
    >
    > " dis = Connector.openDataInputStream(resourcePath);
    > text = new byte[bytes];
    > dis.readFully(text, 0, bytes);
    > dis.close();
    > return new String(text); "
    >
    > I didnt write it, but I wrote the code that uses the strings, and since
    > the strings passed to my stuff seemed to print OK, I was happy to
    > ignore where they came from. Now that guy has gone, and the strings
    > are in japanese and problems begin.


    The problem is that you're using the default encoding instead of
    specifying the encoding to be UTF-8.

    >
    > Actually I have re-written the code that truncates the strings and
    > solved my original problem. Its very inefficient, but it uses less
    > lines of code than the original and still works, so I save bytes of
    > code which is a godsend.


    I don't know if it's relevant, but I haven't seen "the code that
    truncates the string".

    >
    > However, I have noticed another problem - the start of every utf8
    > encoded string resource starts with an unwanted 'dot' character which
    > does not appear in the original text files. (whether it has passed
    > through my truncating code or not - it still happens) I have tracked
    > this down to (I think) the fact that java uses a modified utf8 encoding
    > scheme, and the text files I am inputting are generated with Word which
    > will be writing them in normal utf8. I assume thats the problem,
    > anyway. I have yet to work out how to fix it. I am looking for a
    > convert program that will convert the the utf8 text files to modified
    > utf8 format .. seems easiest and preserves precious bytes of code.


    UTF-8 encoded files sometimes have byte-ordering mark (BOM) at the
    beginning. Incidentally, Java doesn't use UTF-8 internally; it uses (a
    modified) UTF-16. The two formats are significantly different. I think if
    you use a reader, and specify the encoding as UTF-8, it'll take care of
    handling the BOM for you.

    >
    > any help appreciated.
    >
    > Oliver Wong wrote:
    >> "stefoid" <> wrote in message
    >> news:...
    >> > good question. An iAppli is something like an applet, designed to go
    >> > into a cutdown java virtual machine to fit inside mobile devices. The
    >> > available java libraires are greatly restricted - I have lang.string
    >> > and lang.character to choose from (that relate to this problem).

    >>
    >> Maybe you should have mentioned this when you wrote
    >>
    >> <quote>
    >> I need existing java library functions to do it for me, but I
    >> dont know what that fucntionality is.
    >> </quote>
    >>
    >> else you're wasting people's times coming up with solutions that
    >> won't
    >> solve your problem.
    >>
    >> > In
    >> > addition to the 30K codebase limit, which i have reached - seriously, I
    >> > am like 2 bytes off the maximum.
    >> >
    >> > this is the only part of the code where I have to recognize individual
    >> > characters. everything else is just read a string and output it to the
    >> > screen, which works fine for utf8, cos its null terminated.

    >>
    >> My concern right now is that you might not know what you're talking
    >> about. Where are you getting the string data from? What is the type of
    >> the
    >> parameter of that string data? Is it String? Byte[]? byte[]? Something
    >> else?
    >>
    >> What makes you believe it is UTF-8 encoded? What makes you think it's
    >> null terminated?
    >>
    >> I don't want to start explaining how to convert UTF-8 binary data
    >> stuffed into Java Strings into "normal" Java Strings, unless I'm sure
    >> that's
    >> what is nescessary to solve your problem.
    >>
    >> - Oliver

    >


    "stefoid" <> wrote in message
    news:...
    >I should add, here is what the cldc has available (cut down java for
    > wireless devices and pdas)

    [most of it snipped]
    >
    > InputStreamReader


    Right, so after you get your DataInputStream, you should wrap it around an
    InputStreamReader. I don't know if the constructors on CLDC are the same as
    JavaSE, but in JavaSE, it'd look like this:

    <code>
    InputStream is = /*get your input stream somehow. In your case, it looks
    like Connector.openDataInputStream(resourcePath)*/
    InputStreamReader isr = new InputStreamReader(is, "UTF-8");
    </code>

    From there, you use the isr.read() method to read 1 character at a time
    (note that a character is a 16 bit value, and not an 8 bit value). If
    ..read() returns -1, that means it reached the end of the stream.

    Normally, in JavaSE, you'd also wrap your InputStreamReader into a
    BufferedReader. In addition to improving performance via buffering,
    BufferedReader also provides a convenience method readLine() which will
    return a whole line of text to you, instead of only 1 character at a time.
    Unfortunately, BufferedReader wasn't in the list of classes you provided, so
    you might have to construct the string manually from the individual
    characters.

    - Oliver
    Oliver Wong, Jul 5, 2006
    #10
  11. stefoid

    Oliver Wong Guest

    Re: strings - reading utf8 characters such as japanese. how?

    "Oliver Wong" <> wrote in message
    news:q3Qqg.130953$771.26288@edtnps89...
    >
    > I don't know if it's relevant, but I haven't seen "the code that
    > truncates the string".


    Cancel that. I just realized that you're referring to the code in your
    first post, where you play around with fonts and string widths.

    - Oliver
    Oliver Wong, Jul 5, 2006
    #11
  12. stefoid

    Oliver Wong Guest

    Re: strings - reading utf8 characters such as japanese. how?

    "Oliver Wong" <> wrote in message
    news:q3Qqg.130953$771.26288@edtnps89...
    > "stefoid" <> wrote in message
    > news:...
    >> This is the code that is used to read the utf8 text resources into
    >> strings:
    >>
    >> " dis = Connector.openDataInputStream(resourcePath);
    >> text = new byte[bytes];
    >> dis.readFully(text, 0, bytes);
    >> dis.close();
    >> return new String(text); "
    >>


    Actually, I took another look at the String API. Again, this is from
    J2SE, so I don't know if it'll work for you, but apparently you can specify
    the charset to use in the String constructor as well. So you might be able
    to replace the last line with:

    return new String(text, "UTF-8");

    - Oliver
    Oliver Wong, Jul 5, 2006
    #12
  13. stefoid

    stefoid Guest

    Re: strings - reading utf8 characters such as japanese. how?

    Thanks Oliver.

    I did find some example code somewhere that suggested using a reader
    and specifying "UTF-8". I tried that, and it didnt make any difference
    - I still get the weird character at the start of every string.

    I think it makes sense that there could be something weird at the start
    of the text file. I may have to get a hex editor onto it. I printed
    out the hex bytes I obtained from the string in the code and it looks
    like UTF-8 to me (roughly).



    Oliver Wong wrote:
    > "Oliver Wong" <> wrote in message
    > news:q3Qqg.130953$771.26288@edtnps89...
    > > "stefoid" <> wrote in message
    > > news:...
    > >> This is the code that is used to read the utf8 text resources into
    > >> strings:
    > >>
    > >> " dis = Connector.openDataInputStream(resourcePath);
    > >> text = new byte[bytes];
    > >> dis.readFully(text, 0, bytes);
    > >> dis.close();
    > >> return new String(text); "
    > >>

    >
    > Actually, I took another look at the String API. Again, this is from
    > J2SE, so I don't know if it'll work for you, but apparently you can specify
    > the charset to use in the String constructor as well. So you might be able
    > to replace the last line with:
    >
    > return new String(text, "UTF-8");
    >
    > - Oliver
    stefoid, Jul 6, 2006
    #13
  14. stefoid

    Chris Uppal Guest

    Re: strings - reading utf8 characters such as japanese. how?

    stefoid wrote:

    > I think it makes sense that there could be something weird at the start
    > of the text file. I may have to get a hex editor onto it. I printed
    > out the hex bytes I obtained from the string in the code and it looks
    > like UTF-8 to me (roughly).


    Can you post the byte values ?

    It could be a BOM (Byte Order Mark) -- they are not recommended for use with
    8-bit encodings like UTF-8, but some software adds one to the beginning of each
    file anyway.

    If it is a BOM, U+FEFF, then the first three bytes of the UTF-8 file will be

    0xEF 0xBB 0xBF

    That's the bytes of the /file/, not whatever ends up in Java after it's been
    decoded.

    If it is a BOM, then the easiest thing to do is just ignore it.

    -- chris
    Chris Uppal, Jul 6, 2006
    #14
  15. stefoid

    ddimitrov Guest

    Re: strings - reading utf8 characters such as japanese. how?

    I haven't done mobile Java for a long time, but as far as I remember
    the encoding for iApplis is ShiftJIS. Internally Java still uses
    unicode representation, but you have to make sure that all your
    resources are encoded in ShiftJIS and you might have to specify the
    propper encoding when you read and write the strings to the scratchpad.
    ddimitrov, Jul 6, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph
    Replies:
    2
    Views:
    5,570
  2. =?Utf-8?B?TXVoYW1tYWQgQWhtYWQ=?=

    Japanese Characters Show up as ??? when email is sent from aspx page

    =?Utf-8?B?TXVoYW1tYWQgQWhtYWQ=?=, Apr 28, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    920
  3. dinesh
    Replies:
    0
    Views:
    557
    dinesh
    Dec 7, 2004
  4. gry
    Replies:
    2
    Views:
    704
    Alf P. Steinbach
    Mar 13, 2012
  5. Cameron Simpson
    Replies:
    0
    Views:
    61
    Cameron Simpson
    Mar 9, 2014
Loading...

Share This Page