how do I read and write a file using UTF8?

S

Stryder

Hi. How do I read and write a file using UTF8?

I have a file that's UTF-8 - for development purposes it just consists
of a space, an mdash (hex 8212) and another space. I'm using the
following Java code...

import java.io.*;

class UTF8 {
public static void main(String[] args) throws Exception {
System.setProperty("file.encoding", "UTF-8");

File xmlFile = new File("mdash.txt");
FileInputStream fileInputStream = new FileInputStream
(xmlFile);
byte[] fileBufferByteArray = new byte[(int) xmlFile.length()];
fileInputStream.read(fileBufferByteArray);
String fileBufferString = new String(fileBufferByteArray,
"UTF-8");
PrintWriter p = new PrintWriter(System.out);
p.print(fileBufferString);
p.close();
}
}

and running it like this...

java UTF8 > x

but x always ends up containing " ? " (a space, a question mark, then
a space). How can I make this work?

Thanks in advance for your help!

Ralph
 
S

Stryder

See the InputStreamReader class for converting binary, encoding-specific  
data into equivalent Java text types (Char and String).

There is an equivalent OutputStreamWriter for going the other direction.

Awesome. Thanks a lot!
 
L

Lothar Kimmeringer

Stryder said:
Awesome. Thanks a lot!

And when reading and writing text from and to files you always
should use InputStreamReader and OutputStreamReader with specifying
a concrete encoding. Otherwise you might break your application
when running it on a different system.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
R

Roedy Green

How do I read and write a file using UTF8?

there are two different things is Java called UTF-8.
One are counted strings written with DataOutputStream.
The other is a text file encoded in UTF-8.

You can generate the code for either at

http://mindprod.com/applet/fileio.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
L

Lothar Kimmeringer

Mark said:
No, on windows it's Latin1, for sure.

cp1252 != iso latin1 (8859_1)
cp1252 is sometimes called Windows Latin1 but if you say Latin1
in general ISO-8859-1 is meant


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
L

Lothar Kimmeringer

RedGrittyBrick said:
I can confirm that after
System.setProperty("file.encoding", "US-ASCII");

All the io classes I tried still wrote UTF-8 by default (on 32-bit
Windows Vista).

The System Property is read on startup and kept inside. So if
you set the property by doing
java -Dfile.encoding=ASCII MyClass
you would get different results.
Perhaps the default is UTF-8 for all platforms that support UTF-8?

No. Windows is Cp1252, MacOS is something (but not UTF-8), Linux
nowerdays is mostly UTF-8 but sometimes 8859_1


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
J

John B. Matthews

Lothar Kimmeringer said:
The System Property is read on startup and kept inside. So if
you set the property by doing
java -Dfile.encoding=ASCII MyClass
you would get different results.


No. Windows is Cp1252, MacOS is something (but not UTF-8), Linux
nowerdays is mostly UTF-8 but sometimes 8859_1

On Mac OS, file.encoding defaults to "MacRoman":

<http://en.wikipedia.org/wiki/Mac_OS_Roman>
 
M

Mark Space

Lothar said:
cp1252 != iso latin1 (8859_1)
cp1252 is sometimes called Windows Latin1 but if you say Latin1
in general ISO-8859-1 is meant


Yeah, figures Windows would do their own thing....
 
A

Arne Vajhøj

Roedy said:
there are two different things is Java called UTF-8.
One are counted strings written with DataOutputStream.
The other is a text file encoded in UTF-8.

The fact that there exist a method called writeUTF
that writes a 2 byte length and the UTF-8 encoding bytes
of a String does not mean that Java consider that format
UTF-8.

(there is a method writeInt that writes int's in network
order - that does not mean that Java consider int to be
in network order)

Arne
 
A

Arne Vajhøj

Lothar said:
cp1252 != iso latin1 (8859_1)

It is not the exact same.

There is a difference for the C1 characters.

But ISO-8859-1, CP-1252, DECMCS and ISO-8859-15 are all close
enough to work for most text.

Arne
 
R

Roedy Green

The fact that there exist a method called writeUTF
that writes a 2 byte length and the UTF-8 encoding bytes
of a String does not mean that Java consider that format
UTF-8.

(there is a method writeInt that writes int's in network
order - that does not mean that Java consider int to be
in network order)

These are fine distinctions unlikely to be appreciated by a newbie. I
don't know the full circumstances of his problem. Either method might
be what he is looking for so I mentioned both.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
A

Arne Vajhøj

Roedy said:
These are fine distinctions unlikely to be appreciated by a newbie. I
don't know the full circumstances of his problem. Either method might
be what he is looking for so I mentioned both.

Most likely the original poster do not care the least.

I just think the phrase "there are two different things is Java
called UTF-8. One are counted strings written with DataOutputStream."
is accurate.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,821
Messages
2,569,748
Members
45,726
Latest member
RaleighAll

Latest Threads

Top