converting unicode to UTF-8

P

peter10

Hi everybody,

I would like to convert unicode text (coming from a swing JTextPane -
I think that is unicode by default!?) to UTF-8. I tried the code
underneath, but the xml-database I am using still complains about
wrong characters (error message: "Invalid byte 2 of 3-byte UTF-8
sequence").

ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeUTF(text_input);
String text_output = out.toString("UTF-8");

Can anybody tell me what the mistake is that I am making???

Thanks a lot for your help!

Peter
 
S

Steve Horsley

peter10 said:
Hi everybody,

I would like to convert unicode text (coming from a swing JTextPane -
I think that is unicode by default!?) to UTF-8. I tried the code
underneath, but the xml-database I am using still complains about
wrong characters (error message: "Invalid byte 2 of 3-byte UTF-8
sequence").

ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeUTF(text_input);
String text_output = out.toString("UTF-8");

Can anybody tell me what the mistake is that I am making???

Thanks a lot for your help!

Peter

Look closely at the docs for writeUTF and you will find that it
also writes a 2-byte binary length indicator at the front. I guess
this is the problem. I suggest that you use an OutputStreamWriter
instead, like this:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);
out.write(text_input);

Steve
 
B

Boudewijn Dijkstra

peter10 said:
Hi everybody,

I would like to convert unicode text (coming from a swing JTextPane -
I think that is unicode by default!?) to UTF-8.

The U in UTF stands for 'Unicode', so you want to convert Unicode to Unicode.
 
C

Chris Uppal

peter10 said:
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
dataOut.writeUTF(text_input);

The first problem here is that writeUTF8() does /NOT/ write UTF-8. It's an
incredibly, unbelievably, stupidly, misleadingly-named method. What it does is
write a two-byte character count (as Steve has already mentioned) followed by
some bytes that represent the string in a format that is (conceptually) related
to, but completely incompatible with, UTF-8.

UTF-8 is a a way of taking a stream/string of Unicode characters (and Java
Strings can be viewed as such, although the correspondence is not as close as
it looks), and representing them as bytes in a binary stream or similar. In
Java that conversion is ultimately provided by a "charset", specifically the
one named "UTF-8". Probably the easiest way for you to use that would be
either to ask your String for its
aString.getBytes("UTF-8");
or to use an OutputStreamWriter constructed with a 'charsetname' of "UTF-8".

-- chris
 
C

Chris Smith

Steve Horsley said:
Look closely at the docs for writeUTF and you will find that it
also writes a 2-byte binary length indicator at the front. I guess
this is the problem. I suggest that you use an OutputStreamWriter
instead, like this:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);
out.write(text_input);

Since UTF-8 was explicitly requested, that should be:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos, "UTF-8");
out.write(text_input);

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
P

peter10

Hallo!

Thanks to your code-snippet and with the getEncoding()-method of the
OutputStreamWriter I found out that the encoding that is apparently
being used inside the JTextPane is "Cp1252".

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);
out.write(input_string);
String encoding = out.getEncoding();

Now I have two - maybe stupid - questions:
1) How is that possible if the sun-documentation about Documents (used
in JTextPanes) reads as follows:

"To support internationalization, the Swing text model uses unicode
characters..." ???

2) how do I get a String out of the OutputStreamWriter as there is no
getText() method available?

Thanks for any help!

Peter
 
B

Boudewijn Dijkstra

peter10 said:
Hallo!

Thanks to your code-snippet and with the getEncoding()-method of the
OutputStreamWriter I found out that the encoding that is apparently
being used inside the JTextPane is "Cp1252".

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(baos);
out.write(input_string);
String encoding = out.getEncoding();

Now I have two - maybe stupid - questions:
1) How is that possible if the sun-documentation about Documents (used
in JTextPanes) reads as follows:

"To support internationalization, the Swing text model uses unicode
characters..." ???

OutputStreamWriter by default uses the *platform* default encoding, not the
Swing default encoding.
2) how do I get a String out of the OutputStreamWriter as there is no
getText() method available?

If you want the string back, you'd get the original input_string back. I
recommend you use input_string.getBytes("UTF-8") instead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top