Reading MacRoman

C

cndc

Hi,

I have a textfile created on a Macintosh and its encoding is
MacRoman. Unfortunately, I'm having difficulty working with this
encoding. As a test case, I wrote this simple class that should read
in the MacRoman file and produce an ISO8859-1 file:

import java.io.*;
import java.nio.charset.*;

class Cheesy {
public static void main(String[] args) {
int i;
for(i = 0; i < args.length ; i++) {
try {
InputStreamReader r = new InputStreamReader(new FileInputStream(args), "MacRoman");
OutputStreamWriter o = new OutputStreamWriter(System.out, "8859_1");
int c;
while( (c = r.read() ) != -1) {
o.write(c);
}
} catch(IOException e) {
System.err.println(e.toString());
}
}
}
}

Sadly, however, many of the weird characters in MacRoman continue to
be converted to question marks as opposed to their normal character.

Am I doing something wrong?

Thank you,
Elizabeth
 
J

Jon Skeet

Sadly, however, many of the weird characters in MacRoman continue to
be converted to question marks as opposed to their normal character.

Am I doing something wrong?

Well, are the characters you're reading actually *in* ISO-8859-1?
 
C

cndc

Jon said:
Well, are the characters you're reading actually *in* ISO-8859-1?

Hi Jon,

No. They're in MacRoman format. The idea of the code is to convert
the input stream from MacRoman and send it out in ISO-8859-1.

Elizabeth
 
C

cndc

Jon said:
I know they are originally - but my point was to ask whether or not
the actual character is in the ISO-8859-1 set as well.

I'm not sure whether or not it is.

I have text file that was generated on a Macintosh and, having looked
at it the Macintosh's character numberings, I have determined that it
uses the MacRoman charset. I'd like to be able to work with this data
internally but due to the different charsets, some kind of translation
is necessary.
But my point is that you can't convert a character which doesn't
even *exist* in ISO-8859-1 into a value in that character encoding.
Which unicode character is it you're trying to convert?

I'd like to change some of characters used in MacRoman to character
entities, such as 0xD2 to &ldquo;, for example.

Does reading a file with the its charset parameter set not
automatically convert the incoming stream to some kind of normalized,
internal format?

Thank you for your help,

Elizabeth
 
J

Jon A. Cruz

cndc said:
Jon writes:




I'm not sure whether or not it is.

I have text file that was generated on a Macintosh and, having looked
at it the Macintosh's character numberings, I have determined that it
uses the MacRoman charset. I'd like to be able to work with this data
internally but due to the different charsets, some kind of translation
is necessary.

Use Unicode.


I'd like to change some of characters used in MacRoman to character
entities, such as 0xD2 to &ldquo;, for example.

Then do that before writing.

Does reading a file with the its charset parameter set not
automatically convert the incoming stream to some kind of normalized,
internal format?

It does convert it.
The internal format as far as the Java programmer is concerned is always
Unicode.

So, in Java, all char's are Unicode. Once you read properly, that's what
you'll have.

Here's the "official" MacRoman Unicode mapping.

http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

However... you shouldn't need that.

If something has a Unicode value over 255, then it won't map into
Latin-1 (AKA ISO-8859-1). Simple, huh?

This might help in those cases:
http://www.w3.org/TR/REC-html40/sgml/entities.html
 
C

cndc

Jon said:
It does convert it. The internal format as far as the Java
programmer is concerned is always Unicode.

So, in Java, all char's are Unicode. Once you read properly, that's
what you'll have.

Here's the "official" MacRoman Unicode mapping.

http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

However... you shouldn't need that.

If something has a Unicode value over 255, then it won't map into
Latin-1 (AKA ISO-8859-1). Simple, huh?

This might help in those cases:
http://www.w3.org/TR/REC-html40/sgml/entities.html

Thank you both Jons. Yes, it is very nice how Java converts it into
Unicode right from the get go.

Elizabeth
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top