length of char in bits differs on Win/Linux and Mac

B

Bart Rider

Hi all,

last week i had to write a little homework program like open
a file and count all characters present in this file. I did it
using a counting array of size 256 and increasing the specific
chars position by one, if i've read that character from the
file.
The file itself was opened via a FileReader/BufferedReader and
the lines were read by readLine()

Now i observed the following. The character 'ä' stored in the
char variable c and used to access the counting array:
countingArray[c]++
caused no problems on windows/linux computers, but on macs,
where the value 8240 (0x2030) was assigned with this char.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

Using a mac even a double cast like
countingArray[(char)(int)c]++
did not work. And (c & 0xFF) was no option either, because now
i would match the 'ä' to '0' (0x30).

I solved the problem by using a try-catch-block and counting
'other' characters through it. :)

Best regards,
Bart
 
T

Thomas Hawtin

Bart said:
Now i observed the following. The character 'ä' stored in the
char variable c and used to access the counting array:
countingArray[c]++
caused no problems on windows/linux computers, but on macs,
where the value 8240 (0x2030) was assigned with this char.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

Windows is probably using a single byte character encoding (probably
Cp1252 or similar), whereas Linux and Macs are probably using UTF-8,
which encodes ASCII characters as ASCII, but characters with codes of
128 or higher as seguences of two or more bytes.

http://en.wikipedia.org/wiki/UTF-8

On Linux I believe by default uses the LANG environment variable. If you
type echo $LANG you should see something like en_US.UTF-8 printed. You
can get back to old fashioned character sets with export LANG=C (as it's
an environment variable, it wont apply to Java processes run from other
shell processes).

Tom Hawtin
 
T

Thomas Weidenfeller

Bart said:
Now i observed the following. The character 'ä' stored in the
char variable c and used to access the counting array:
countingArray[c]++
caused no problems on windows/linux computers, but on macs,
where the value 8240 (0x2030) was assigned with this char.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

You were just lucky on Windows with your algorithm, and you used the
wrong encoding for reading on the Mac.

You were lucky on Windows, because Java uses Unicode for all characters.
Current Unicode standards support characters with code points beyond
2^16 (Unicode is not a 16 character standard) - although you have
trouble with Unicode beyond 2^16 in Java. But whatever Java version you
use, your 256 wide array could have fallen any time. You were lucky,
because your input didn't contain any character beyond the Latin-1
range. If it would, your code would have blown up on Windows already.

Regarding the Mac result: You used the wrong encoding. When you read
text data into Java, Java needs to know in what encoding that data
comes, so it can be translated to Java's internal Unicode. You did use
an encoding (implicitly or explicitly) which triggered the translation
of some input data to the Unicode code point 0x2030. Since 0x2030 is the
Unicode code point for the permille sign, and not for a-umlaut, the
conversion was wrong.

You need to fix the encoding which you use for reading the data. All
your casting and and bit-masking is nonsense, it will not fix the
encoding problem.

In general, even if you had fixed the encoding problem, your original
algorithm was faulty. It failed for everything beyond code point 255,
which are roughly 96000 possible characters your algorithm doesn't
cover. Your original algorithm just handled about 1/377th of all valid
input values.

You only partly fixed that with the counting of 'other' characters,
partly only because ...
I solved the problem by using a try-catch-block and counting
'other' characters through it. :)

.... using exceptions to handle valid input data is bad. A simple
comparison if a code point is greater 255 would be the right thing to do
here.

/Thomas
 
A

alexandre_paterson

Bart said:
Hi all,

last week i had to write a little homework program like open
a file and count all characters present in this file.

Apparently that's not what your program is trying to do: your
program seems to be trying to count how many occurence of each
character appears in the file.

The billion-dollar question: what is the encoding of the file
containing the characters you want to count?

I did it
using a counting array of size 256 and increasing the specific
chars position by one, if i've read that character from the
file.

It could work the way you programmed it if you knew for sure
that your source file contains characters that could be mapped
to ISO-Latin-1 chars when "decoded"/recoded to Unicode.

If a Java char is between 0 and 127 you know that it is an
ASCII character (and hence also an ISO-Latin-1 character).

If a Java char is between 160 and 255 you know that you
have an ISO-Latin-1 character (128 through 159 being
control codes).

If you read a file by specifying a wrong encoding (or by using
a default encoding that doesn't match your file's encoding),
you'll read meaningless char values...

If you read a file specifying a correct encoding, while having
your file containing characters not belonging in the ISO-Latin-1
range (which is completely legal), some of your char *will*
be greater than 255 and hence your broken code *will*
throw ArrayIndexOfOutBoundsExceptions.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

"char" in Java is always 16 bit wide (which is unfortunate btw
since since Unicode 3.1 this is not wide enough to represent
every Unicode code points, but this another topic).

Your question shows one thing: you need to read on Java's
primitive char type and on the various character encodings.

Using a mac even a double cast like
countingArray[(char)(int)c]++
nonsense...


did not work. And (c & 0xFF) was no option either, because now
i would match the 'ä' to '0' (0x30).

0x2030 & 0xff gives indeed 0x30...

'ä' can be represented in ISO-Latin-1 and in Unicode by the value
0x00e4 (it cannot be represented in ASCII).

The problem is that you're using FileReader, which is using the
default platform's encoding (in this case "MACROMAN"), on a
file that is encoded using ISO-8859-1 encoding, hence the
conversion of 0x00e4 to 0x2030.

You should use an InputStreamReader and specify the correct
encoding:

InputStream is = new FileInputStream("/home/public/dl/tmp.txt");
InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");

I solved the problem by using a try-catch-block and counting
'other' characters through it. :)

Using exceptions for flow control is a seriously broken way of
programming in Java...

You want to read on "encoding", you want to know what is
the encoding of the file you're trying to read, you want to
know what your platform's default encoding is, you want to
understand what the char primitive in Java is, you want
to know that ISO-Latin-1 (aka ISO-8859-1) is a superset
of ASCII (using the same code for the same characters) and
you want to know that Unicode is a superset of the
ISO-Latin-1 characters (using the same "codepoint" [though
this is Unicode-specific terminology] for same characters).

As a last note, ASCII (aka US-ASCII) defines the position of
128 characters, not 256 as many people believe.

Hope it helps,

Alex
 
A

alexandre_paterson

Hi Thomas,

two really minor nitpicks...

(I thought the same "nonsense" about the OP's double cast ;)


Thomas Weidenfeller wrote:
....
Regarding the Mac result: You used the wrong encoding. When you read
text data into Java, Java needs to know in what encoding that data
comes, so it can be translated to Java's internal Unicode. You did use
an encoding (implicitly or explicitly) which triggered the translation
of some input data to the Unicode code point 0x2030. Since 0x2030 is the
Unicode code point for the permille sign, and not for a-umlaut, the
conversion was wrong.

yup, wrong conversion because FileReader use the platform's default
encoding, "MACROMAN" in his case, to read a file that is not encoded
in MACROMAN.

... using exceptions to handle valid input data is bad. A simple
comparison if a code point is greater 255 would be the right thing to do
here.

The right thing to do here would be to use an InputStreamReader and
specify the correct file encoding (ie ISO-8859-1).
 
T

Thomas Weidenfeller

The right thing to do here would be to use an InputStreamReader and
specify the correct file encoding (ie ISO-8859-1).

Only if one knows that the input is indeed ISO-8859-1 - which the OP
didn't tell us. If the input data contains data which, if correctly
decoded, maps to Unicode code point greater 255 you are back to the same
problem. Th usage of an 'other' counter is IHMO a good idea.

/Thomas
 
O

Oliver Wong

Apparently that's not what your program is trying to do: your
program seems to be trying to count how many occurence of each
character appears in the file.

This threw me off too. To the OP: Please be very precise about what your
program is supposed to do, or else I'll be very confused and my advice will
probably be less effective.

Perhaps the OP isn't trying to read characters at all, but instead is
reading in bytes. That is, the reader could stick with an array of size 256,
and read in one byte at a time, counting how often each byte appears in a
file. That would remove the need for an encoding all together, as well as
that "others" variable mentioned upthread.

- Oliver
 
R

Rogan Dawes

Oliver said:
This threw me off too. To the OP: Please be very precise about what
your program is supposed to do, or else I'll be very confused and my
advice will probably be less effective.


Perhaps the OP isn't trying to read characters at all, but instead is
reading in bytes. That is, the reader could stick with an array of size
256, and read in one byte at a time, counting how often each byte
appears in a file. That would remove the need for an encoding all
together, as well as that "others" variable mentioned upthread.

- Oliver

As an additional aside, given that the OP will be potentially dealing
with far more characters than just 256, but possibly quite sparsely
distributed, the better data structure would probably be a
Map<Character, Integer>

Assuming he really IS interested in chars, not bytes, that is.

FWIW.

Rogan
 
B

Bart Rider

Rogan said:
As an additional aside, given that the OP will be potentially dealing
with far more characters than just 256, but possibly quite sparsely
distributed, the better data structure would probably be a
Map<Character, Integer>

Assuming he really IS interested in chars, not bytes, that is.

FWIW.

Rogan

Thanks a lot for all your replies. They helped me a lot to
understand what are the flaws in my little programm.

Actually I really thought char is only 8 bit wide (I come from
c programming, where char is a replacement for byte ...)
But now, with your hints on Unicode and character mapping I
have to look closer to every file I read and what I intend to
do with it.

Thanks again,
Bart
 
C

Chris Uppal

Rogan said:
As an additional aside, given that the OP will be potentially dealing
with far more characters than just 256, but possibly quite sparsely
distributed, the better data structure would probably be a
Map<Character, Integer>

Or maybe even an int[] array for the first 127 code points and a Map<Character,
Integer> to handle the overflow.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top