strings - reading utf8 characters such as japanese. how?

S

stefoid

Hi. Ive got a problem. I have some code that takes a text file and
breaks it into an array of substrings for displaying the text truncated
to fit the screen width on word boundaries.

It just looks for the spaces.

trouble is, it crashes out with japenese text. There is a part of the
code that looks at the next character to see if it is a space:

ch = str.substring(offset, offset + 1);
isSpace = false;

// return when a new line is reached
if (ch.equals("\n"))
return offset+1;

currentWidth += font.stringWidth(ch);

if (ch.equals(" "))
isSpace = true;

and if it isnt a space, it adds the width of the character (in pixels)
, and keeps going until it does find a space.

the problem with this is it assumes that each byte is a characater. In
utf8, up to 3 bytes could be one character, so this code is trying to
find the widths of characters representing each byte in a utf8
sequence, rather than the width of the utf8 character as a whole.

my additional problem is this is iAppli code, so I am limited to a 30K
codebase, and I have hit the limit, so I cant write any more lines of
code - I just have to change the existing code such that it doesnt
generate any more bytecode.

what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.
 
D

Damian Driscoll

stefoid said:
Hi. Ive got a problem. I have some code that takes a text file and
breaks it into an array of substrings for displaying the text truncated
to fit the screen width on word boundaries.

It just looks for the spaces.

trouble is, it crashes out with japenese text. There is a part of the
code that looks at the next character to see if it is a space:

ch = str.substring(offset, offset + 1);
isSpace = false;

// return when a new line is reached
if (ch.equals("\n"))
return offset+1;

currentWidth += font.stringWidth(ch);

if (ch.equals(" "))
isSpace = true;

and if it isnt a space, it adds the width of the character (in pixels)
, and keeps going until it does find a space.

the problem with this is it assumes that each byte is a characater. In
utf8, up to 3 bytes could be one character, so this code is trying to
find the widths of characters representing each byte in a utf8
sequence, rather than the width of the utf8 character as a whole.

my additional problem is this is iAppli code, so I am limited to a 30K
codebase, and I have hit the limit, so I cant write any more lines of
code - I just have to change the existing code such that it doesnt
generate any more bytecode.

what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.

have a look at:
http://javaalmanac.com/egs/java.nio.charset/ConvertChar.html
 
C

Chris Uppal

stefoid said:
what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.

Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
it -- I would put it somewhere between impossible and dangerously difficult and
confusing.

If you want to load your information into /text/, let Java decode the external
UTF-8 into Strings (of characters, already decoded as they are read in). If,
possibly for space reasons, you have to work in UTF-8 internally, then you'd be
far better off keeping the data in byte[] arrays.

-- chris
 
O

Oliver Wong

stefoid said:
Hi. Ive got a problem. I have some code that takes a text file and
breaks it into an array of substrings for displaying the text truncated
to fit the screen width on word boundaries.

It just looks for the spaces.

trouble is, it crashes out with japenese text. There is a part of the
code that looks at the next character to see if it is a space:

ch = str.substring(offset, offset + 1);
isSpace = false;

// return when a new line is reached
if (ch.equals("\n"))
return offset+1;

currentWidth += font.stringWidth(ch);

if (ch.equals(" "))
isSpace = true;

and if it isnt a space, it adds the width of the character (in pixels)
, and keeps going until it does find a space.

How about something like:

<pseudoCode>
StringTokenizer st = new StringTokenizer(str, " \n", true);
int offset = 0;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (token.equals(" ")) {
/*do whatever you gotta do with spaces here.*/
offset++;
} else if (token.equals("\n")) {
return offset;
} else {
currentWidth = font.stringWidth(token);
offset += token.length();
}
}
</pseudoCode>

You'll avoid breaking up the string into its individual codepoints,
potentially splitting a character in two.
the problem with this is it assumes that each byte is a characater. In
utf8, up to 3 bytes could be one character, so this code is trying to
find the widths of characters representing each byte in a utf8
sequence, rather than the width of the utf8 character as a whole.

Actually, it assumes each (Java) char is a (semantic) character. A Java char
is 16 bits long, and Java Strings are internally stored in UTF-16, so a
semantic character might be spread over 2 java char (32 bits).
my additional problem is this is iAppli code, so I am limited to a 30K
codebase, and I have hit the limit, so I cant write any more lines of
code - I just have to change the existing code such that it doesnt
generate any more bytecode.

Sounds rough. Can't really help you with this.
what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.

See above. Since you're working with Unicode, you might want to use the
Character.isWhiteSpace() method, isntead of the String.equals(" ") method. I
believe the Japanese whitespace has a different unicode value than the ASCII
whitespace.

- Oliver
 
S

stefoid

good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem). In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.




Chris said:
stefoid said:
what can I do to the above code so that I can count widths of utf8
characters instead of asc characters, without writing too much extra
code - I need existing java library functions to do it for me, but I
dont know what that fucntionality is.

Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing
it -- I would put it somewhere between impossible and dangerously difficult and
confusing.

If you want to load your information into /text/, let Java decode the external
UTF-8 into Strings (of characters, already decoded as they are read in). If,
possibly for space reasons, you have to work in UTF-8 internally, then you'd be
far better off keeping the data in byte[] arrays.

-- chris
 
C

Chris Uppal

[reorderd to remove top-posting]

stefoid wrote:

[me:]
good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem). In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.

But you haven't really answered my question. I'll try again:

Are you saying that your iAppli doesn't support byte[] arrays ? I find that
impossible to believe.

Are you handling your UTF-8 data as binary (in byte[] arrays) or are you
somehow stuffing UTF-8 encoded data into Java Strings ? If the latter then
(a) why ? and (b) how ?

When you read your data in, why don't you use the Java-provided stuff to decode
the UTF-8 into native (decoded) Java Strings ? I could understand that you
might want to stick with UTF-8 encoded data for space reasons, but then it
doesn't make sense that you'd put that data into Strings (16 bits per
character), which would double the space requirement over byte[] arrays for the
same data. (Unless you stuffed two bytes into each Java char -- which would be
downright perverse ;-)

Maybe this implementation lacks the character encoding stuff found everwhere in
real Java ? If not then why are you not using it ? If it does, then I suspect
you are hosed.

-- chris
 
O

Oliver Wong

stefoid said:
good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem).

Maybe you should have mentioned this when you wrote

<quote>
I need existing java library functions to do it for me, but I
dont know what that fucntionality is.
</quote>

else you're wasting people's times coming up with solutions that won't
solve your problem.
In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.

My concern right now is that you might not know what you're talking
about. Where are you getting the string data from? What is the type of the
parameter of that string data? Is it String? Byte[]? byte[]? Something else?

What makes you believe it is UTF-8 encoded? What makes you think it's
null terminated?

I don't want to start explaining how to convert UTF-8 binary data
stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
what is nescessary to solve your problem.

- Oliver
 
S

stefoid

yeah, youre right, sorry I didnt mention that. I think youre also
right in that I dont have a firm grasp of java strings, internal
coding, etc...

This is the code that is used to read the utf8 text resources into
strings:

" dis = Connector.openDataInputStream(resourcePath);
text = new byte[bytes];
dis.readFully(text, 0, bytes);
dis.close();
return new String(text); "

I didnt write it, but I wrote the code that uses the strings, and since
the strings passed to my stuff seemed to print OK, I was happy to
ignore where they came from. Now that guy has gone, and the strings
are in japanese and problems begin.

Actually I have re-written the code that truncates the strings and
solved my original problem. Its very inefficient, but it uses less
lines of code than the original and still works, so I save bytes of
code which is a godsend.

However, I have noticed another problem - the start of every utf8
encoded string resource starts with an unwanted 'dot' character which
does not appear in the original text files. (whether it has passed
through my truncating code or not - it still happens) I have tracked
this down to (I think) the fact that java uses a modified utf8 encoding
scheme, and the text files I am inputting are generated with Word which
will be writing them in normal utf8. I assume thats the problem,
anyway. I have yet to work out how to fix it. I am looking for a
convert program that will convert the the utf8 text files to modified
utf8 format .. seems easiest and preserves precious bytes of code.

any help appreciated.

Oliver said:
stefoid said:
good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem).

Maybe you should have mentioned this when you wrote

<quote>
I need existing java library functions to do it for me, but I
dont know what that fucntionality is.
</quote>

else you're wasting people's times coming up with solutions that won't
solve your problem.
In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.

My concern right now is that you might not know what you're talking
about. Where are you getting the string data from? What is the type of the
parameter of that string data? Is it String? Byte[]? byte[]? Something else?

What makes you believe it is UTF-8 encoded? What makes you think it's
null terminated?

I don't want to start explaining how to convert UTF-8 binary data
stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's
what is nescessary to solve your problem.

- Oliver
 
S

stefoid

I should add, here is what the cldc has available (cut down java for
wireless devices and pdas)

java.io:
Interfaces
--------
DataInput
DataOutput

Classes
-------
ByteArrayInputStream
ByteArrayOutputStream
DataInputStream
DataOutputStream
InputStream
InputStreamReader
OutputStream
OutputStreamWriter
PrintStream
Reader
Writer


java.lang:
Classes
---------
Boolean
Byte
Character
Class
Double
Float
Integer
Long
Math
Object
Runtime
Short
String
StringBuffer
System
Thread
Throwable

and something called microedition connectors API:

Interfaces
---------
Connection
ContentConnection
Datagram
DatagramConnection
InputConnection
OutputConnection
StreamConnection
StreamConnectionNotifier
Classes
 
O

Oliver Wong

stefoid said:
This is the code that is used to read the utf8 text resources into
strings:

" dis = Connector.openDataInputStream(resourcePath);
text = new byte[bytes];
dis.readFully(text, 0, bytes);
dis.close();
return new String(text); "

I didnt write it, but I wrote the code that uses the strings, and since
the strings passed to my stuff seemed to print OK, I was happy to
ignore where they came from. Now that guy has gone, and the strings
are in japanese and problems begin.

The problem is that you're using the default encoding instead of
specifying the encoding to be UTF-8.
Actually I have re-written the code that truncates the strings and
solved my original problem. Its very inefficient, but it uses less
lines of code than the original and still works, so I save bytes of
code which is a godsend.

I don't know if it's relevant, but I haven't seen "the code that
truncates the string".
However, I have noticed another problem - the start of every utf8
encoded string resource starts with an unwanted 'dot' character which
does not appear in the original text files. (whether it has passed
through my truncating code or not - it still happens) I have tracked
this down to (I think) the fact that java uses a modified utf8 encoding
scheme, and the text files I am inputting are generated with Word which
will be writing them in normal utf8. I assume thats the problem,
anyway. I have yet to work out how to fix it. I am looking for a
convert program that will convert the the utf8 text files to modified
utf8 format .. seems easiest and preserves precious bytes of code.

UTF-8 encoded files sometimes have byte-ordering mark (BOM) at the
beginning. Incidentally, Java doesn't use UTF-8 internally; it uses (a
modified) UTF-16. The two formats are significantly different. I think if
you use a reader, and specify the encoding as UTF-8, it'll take care of
handling the BOM for you.
any help appreciated.

Oliver said:
stefoid said:
good question. An iAppli is something like an applet, designed to go
into a cutdown java virtual machine to fit inside mobile devices. The
available java libraires are greatly restricted - I have lang.string
and lang.character to choose from (that relate to this problem).

Maybe you should have mentioned this when you wrote

<quote>
I need existing java library functions to do it for me, but I
dont know what that fucntionality is.
</quote>

else you're wasting people's times coming up with solutions that
won't
solve your problem.
In
addition to the 30K codebase limit, which i have reached - seriously, I
am like 2 bytes off the maximum.

this is the only part of the code where I have to recognize individual
characters. everything else is just read a string and output it to the
screen, which works fine for utf8, cos its null terminated.

My concern right now is that you might not know what you're talking
about. Where are you getting the string data from? What is the type of
the
parameter of that string data? Is it String? Byte[]? byte[]? Something
else?

What makes you believe it is UTF-8 encoded? What makes you think it's
null terminated?

I don't want to start explaining how to convert UTF-8 binary data
stuffed into Java Strings into "normal" Java Strings, unless I'm sure
that's
what is nescessary to solve your problem.

- Oliver

stefoid said:
I should add, here is what the cldc has available (cut down java for
wireless devices and pdas) [most of it snipped]

InputStreamReader

Right, so after you get your DataInputStream, you should wrap it around an
InputStreamReader. I don't know if the constructors on CLDC are the same as
JavaSE, but in JavaSE, it'd look like this:

<code>
InputStream is = /*get your input stream somehow. In your case, it looks
like Connector.openDataInputStream(resourcePath)*/
InputStreamReader isr = new InputStreamReader(is, "UTF-8");
</code>

From there, you use the isr.read() method to read 1 character at a time
(note that a character is a 16 bit value, and not an 8 bit value). If
..read() returns -1, that means it reached the end of the stream.

Normally, in JavaSE, you'd also wrap your InputStreamReader into a
BufferedReader. In addition to improving performance via buffering,
BufferedReader also provides a convenience method readLine() which will
return a whole line of text to you, instead of only 1 character at a time.
Unfortunately, BufferedReader wasn't in the list of classes you provided, so
you might have to construct the string manually from the individual
characters.

- Oliver
 
O

Oliver Wong

Oliver Wong said:
I don't know if it's relevant, but I haven't seen "the code that
truncates the string".

Cancel that. I just realized that you're referring to the code in your
first post, where you play around with fonts and string widths.

- Oliver
 
O

Oliver Wong

Oliver Wong said:
stefoid said:
This is the code that is used to read the utf8 text resources into
strings:

" dis = Connector.openDataInputStream(resourcePath);
text = new byte[bytes];
dis.readFully(text, 0, bytes);
dis.close();
return new String(text); "

Actually, I took another look at the String API. Again, this is from
J2SE, so I don't know if it'll work for you, but apparently you can specify
the charset to use in the String constructor as well. So you might be able
to replace the last line with:

return new String(text, "UTF-8");

- Oliver
 
S

stefoid

Thanks Oliver.

I did find some example code somewhere that suggested using a reader
and specifying "UTF-8". I tried that, and it didnt make any difference
- I still get the weird character at the start of every string.

I think it makes sense that there could be something weird at the start
of the text file. I may have to get a hex editor onto it. I printed
out the hex bytes I obtained from the string in the code and it looks
like UTF-8 to me (roughly).



Oliver said:
Oliver Wong said:
stefoid said:
This is the code that is used to read the utf8 text resources into
strings:

" dis = Connector.openDataInputStream(resourcePath);
text = new byte[bytes];
dis.readFully(text, 0, bytes);
dis.close();
return new String(text); "

Actually, I took another look at the String API. Again, this is from
J2SE, so I don't know if it'll work for you, but apparently you can specify
the charset to use in the String constructor as well. So you might be able
to replace the last line with:

return new String(text, "UTF-8");

- Oliver
 
C

Chris Uppal

stefoid said:
I think it makes sense that there could be something weird at the start
of the text file. I may have to get a hex editor onto it. I printed
out the hex bytes I obtained from the string in the code and it looks
like UTF-8 to me (roughly).

Can you post the byte values ?

It could be a BOM (Byte Order Mark) -- they are not recommended for use with
8-bit encodings like UTF-8, but some software adds one to the beginning of each
file anyway.

If it is a BOM, U+FEFF, then the first three bytes of the UTF-8 file will be

0xEF 0xBB 0xBF

That's the bytes of the /file/, not whatever ends up in Java after it's been
decoded.

If it is a BOM, then the easiest thing to do is just ignore it.

-- chris
 
D

ddimitrov

I haven't done mobile Java for a long time, but as far as I remember
the encoding for iApplis is ShiftJIS. Internally Java still uses
unicode representation, but you have to make sure that all your
resources are encoded in ShiftJIS and you might have to specify the
propper encoding when you read and write the strings to the scratchpad.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,186
Latest member
vinaykumar_nevatia

Latest Threads

Top