length of char in bits differs on Win/Linux and Mac

Discussion in 'Java' started by Bart Rider, May 29, 2006.

  1. Bart Rider

    Bart Rider Guest

    Hi all,

    last week i had to write a little homework program like open
    a file and count all characters present in this file. I did it
    using a counting array of size 256 and increasing the specific
    chars position by one, if i've read that character from the
    file.
    The file itself was opened via a FileReader/BufferedReader and
    the lines were read by readLine()

    Now i observed the following. The character 'ä' stored in the
    char variable c and used to access the counting array:
    countingArray[c]++
    caused no problems on windows/linux computers, but on macs,
    where the value 8240 (0x2030) was assigned with this char.

    It seems to me, that char on mac computers is 16bit wide.
    Is this true?

    Using a mac even a double cast like
    countingArray[(char)(int)c]++
    did not work. And (c & 0xFF) was no option either, because now
    i would match the 'ä' to '0' (0x30).

    I solved the problem by using a try-catch-block and counting
    'other' characters through it. :)

    Best regards,
    Bart
     
    Bart Rider, May 29, 2006
    #1
    1. Advertising

  2. Bart Rider wrote:
    >
    > Now i observed the following. The character 'ä' stored in the
    > char variable c and used to access the counting array:
    > countingArray[c]++
    > caused no problems on windows/linux computers, but on macs,
    > where the value 8240 (0x2030) was assigned with this char.
    >
    > It seems to me, that char on mac computers is 16bit wide.
    > Is this true?


    Windows is probably using a single byte character encoding (probably
    Cp1252 or similar), whereas Linux and Macs are probably using UTF-8,
    which encodes ASCII characters as ASCII, but characters with codes of
    128 or higher as seguences of two or more bytes.

    http://en.wikipedia.org/wiki/UTF-8

    On Linux I believe by default uses the LANG environment variable. If you
    type echo $LANG you should see something like en_US.UTF-8 printed. You
    can get back to old fashioned character sets with export LANG=C (as it's
    an environment variable, it wont apply to Java processes run from other
    shell processes).

    Tom Hawtin
    --
    Unemployed English Java programmer
    http://jroller.com/page/tackline/
     
    Thomas Hawtin, May 29, 2006
    #2
    1. Advertising

  3. Bart Rider wrote:
    > Now i observed the following. The character 'ä' stored in the
    > char variable c and used to access the counting array:
    > countingArray[c]++
    > caused no problems on windows/linux computers, but on macs,
    > where the value 8240 (0x2030) was assigned with this char.
    >
    > It seems to me, that char on mac computers is 16bit wide.
    > Is this true?


    You were just lucky on Windows with your algorithm, and you used the
    wrong encoding for reading on the Mac.

    You were lucky on Windows, because Java uses Unicode for all characters.
    Current Unicode standards support characters with code points beyond
    2^16 (Unicode is not a 16 character standard) - although you have
    trouble with Unicode beyond 2^16 in Java. But whatever Java version you
    use, your 256 wide array could have fallen any time. You were lucky,
    because your input didn't contain any character beyond the Latin-1
    range. If it would, your code would have blown up on Windows already.

    Regarding the Mac result: You used the wrong encoding. When you read
    text data into Java, Java needs to know in what encoding that data
    comes, so it can be translated to Java's internal Unicode. You did use
    an encoding (implicitly or explicitly) which triggered the translation
    of some input data to the Unicode code point 0x2030. Since 0x2030 is the
    Unicode code point for the permille sign, and not for a-umlaut, the
    conversion was wrong.

    You need to fix the encoding which you use for reading the data. All
    your casting and and bit-masking is nonsense, it will not fix the
    encoding problem.

    In general, even if you had fixed the encoding problem, your original
    algorithm was faulty. It failed for everything beyond code point 255,
    which are roughly 96000 possible characters your algorithm doesn't
    cover. Your original algorithm just handled about 1/377th of all valid
    input values.

    You only partly fixed that with the counting of 'other' characters,
    partly only because ...

    > I solved the problem by using a try-catch-block and counting
    > 'other' characters through it. :)


    .... using exceptions to handle valid input data is bad. A simple
    comparison if a code point is greater 255 would be the right thing to do
    here.

    /Thomas
    --
    The comp.lang.java.gui FAQ:
    ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
    http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
     
    Thomas Weidenfeller, May 29, 2006
    #3
  4. Bart Rider

    Guest

    Bart Rider wrote:
    > Hi all,
    >
    > last week i had to write a little homework program like open
    > a file and count all characters present in this file.


    Apparently that's not what your program is trying to do: your
    program seems to be trying to count how many occurence of each
    character appears in the file.

    The billion-dollar question: what is the encoding of the file
    containing the characters you want to count?


    > I did it
    > using a counting array of size 256 and increasing the specific
    > chars position by one, if i've read that character from the
    > file.


    It could work the way you programmed it if you knew for sure
    that your source file contains characters that could be mapped
    to ISO-Latin-1 chars when "decoded"/recoded to Unicode.

    If a Java char is between 0 and 127 you know that it is an
    ASCII character (and hence also an ISO-Latin-1 character).

    If a Java char is between 160 and 255 you know that you
    have an ISO-Latin-1 character (128 through 159 being
    control codes).

    If you read a file by specifying a wrong encoding (or by using
    a default encoding that doesn't match your file's encoding),
    you'll read meaningless char values...

    If you read a file specifying a correct encoding, while having
    your file containing characters not belonging in the ISO-Latin-1
    range (which is completely legal), some of your char *will*
    be greater than 255 and hence your broken code *will*
    throw ArrayIndexOfOutBoundsExceptions.


    > It seems to me, that char on mac computers is 16bit wide.
    > Is this true?


    "char" in Java is always 16 bit wide (which is unfortunate btw
    since since Unicode 3.1 this is not wide enough to represent
    every Unicode code points, but this another topic).

    Your question shows one thing: you need to read on Java's
    primitive char type and on the various character encodings.


    > Using a mac even a double cast like
    > countingArray[(char)(int)c]++


    nonsense...


    > did not work. And (c & 0xFF) was no option either, because now
    > i would match the 'ä' to '0' (0x30).


    0x2030 & 0xff gives indeed 0x30...

    'ä' can be represented in ISO-Latin-1 and in Unicode by the value
    0x00e4 (it cannot be represented in ASCII).

    The problem is that you're using FileReader, which is using the
    default platform's encoding (in this case "MACROMAN"), on a
    file that is encoded using ISO-8859-1 encoding, hence the
    conversion of 0x00e4 to 0x2030.

    You should use an InputStreamReader and specify the correct
    encoding:

    InputStream is = new FileInputStream("/home/public/dl/tmp.txt");
    InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");


    > I solved the problem by using a try-catch-block and counting
    > 'other' characters through it. :)


    Using exceptions for flow control is a seriously broken way of
    programming in Java...

    You want to read on "encoding", you want to know what is
    the encoding of the file you're trying to read, you want to
    know what your platform's default encoding is, you want to
    understand what the char primitive in Java is, you want
    to know that ISO-Latin-1 (aka ISO-8859-1) is a superset
    of ASCII (using the same code for the same characters) and
    you want to know that Unicode is a superset of the
    ISO-Latin-1 characters (using the same "codepoint" [though
    this is Unicode-specific terminology] for same characters).

    As a last note, ASCII (aka US-ASCII) defines the position of
    128 characters, not 256 as many people believe.

    Hope it helps,

    Alex
     
    , May 29, 2006
    #4
  5. Bart Rider

    Guest

    Hi Thomas,

    two really minor nitpicks...

    (I thought the same "nonsense" about the OP's double cast ;)


    Thomas Weidenfeller wrote:
    ....
    > Regarding the Mac result: You used the wrong encoding. When you read
    > text data into Java, Java needs to know in what encoding that data
    > comes, so it can be translated to Java's internal Unicode. You did use
    > an encoding (implicitly or explicitly) which triggered the translation
    > of some input data to the Unicode code point 0x2030. Since 0x2030 is the
    > Unicode code point for the permille sign, and not for a-umlaut, the
    > conversion was wrong.


    yup, wrong conversion because FileReader use the platform's default
    encoding, "MACROMAN" in his case, to read a file that is not encoded
    in MACROMAN.


    > > I solved the problem by using a try-catch-block and counting
    > > 'other' characters through it. :)

    >
    > ... using exceptions to handle valid input data is bad. A simple
    > comparison if a code point is greater 255 would be the right thing to do
    > here.


    The right thing to do here would be to use an InputStreamReader and
    specify the correct file encoding (ie ISO-8859-1).
     
    , May 29, 2006
    #5
  6. wrote:
    > The right thing to do here would be to use an InputStreamReader and
    > specify the correct file encoding (ie ISO-8859-1).


    Only if one knows that the input is indeed ISO-8859-1 - which the OP
    didn't tell us. If the input data contains data which, if correctly
    decoded, maps to Unicode code point greater 255 you are back to the same
    problem. Th usage of an 'other' counter is IHMO a good idea.

    /Thomas
    --
    The comp.lang.java.gui FAQ:
    ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
    http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
     
    Thomas Weidenfeller, May 29, 2006
    #6
  7. Bart Rider

    Oliver Wong Guest

    <> wrote in message
    news:...
    > Bart Rider wrote:
    > > Hi all,
    > >
    > > last week i had to write a little homework program like open
    > > a file and count all characters present in this file.

    >
    > Apparently that's not what your program is trying to do: your
    > program seems to be trying to count how many occurence of each
    > character appears in the file.


    This threw me off too. To the OP: Please be very precise about what your
    program is supposed to do, or else I'll be very confused and my advice will
    probably be less effective.

    > > I did it
    > > using a counting array of size 256 and increasing the specific
    > > chars position by one, if i've read that character from the
    > > file.


    Perhaps the OP isn't trying to read characters at all, but instead is
    reading in bytes. That is, the reader could stick with an array of size 256,
    and read in one byte at a time, counting how often each byte appears in a
    file. That would remove the need for an encoding all together, as well as
    that "others" variable mentioned upthread.

    - Oliver
     
    Oliver Wong, May 29, 2006
    #7
  8. Bart Rider

    Rogan Dawes Guest

    Oliver Wong wrote:
    >
    > <> wrote in message
    > news:...
    >> Bart Rider wrote:
    >> > Hi all,
    >> >
    >> > last week i had to write a little homework program like open
    >> > a file and count all characters present in this file.

    >>
    >> Apparently that's not what your program is trying to do: your
    >> program seems to be trying to count how many occurence of each
    >> character appears in the file.

    >
    > This threw me off too. To the OP: Please be very precise about what
    > your program is supposed to do, or else I'll be very confused and my
    > advice will probably be less effective.
    >
    >> > I did it
    >> > using a counting array of size 256 and increasing the specific
    >> > chars position by one, if i've read that character from the
    >> > file.

    >
    > Perhaps the OP isn't trying to read characters at all, but instead is
    > reading in bytes. That is, the reader could stick with an array of size
    > 256, and read in one byte at a time, counting how often each byte
    > appears in a file. That would remove the need for an encoding all
    > together, as well as that "others" variable mentioned upthread.
    >
    > - Oliver


    As an additional aside, given that the OP will be potentially dealing
    with far more characters than just 256, but possibly quite sparsely
    distributed, the better data structure would probably be a
    Map<Character, Integer>

    Assuming he really IS interested in chars, not bytes, that is.

    FWIW.

    Rogan
     
    Rogan Dawes, May 30, 2006
    #8
  9. Bart Rider

    Bart Rider Guest

    Rogan Dawes wrote:
    > Oliver Wong wrote:
    >
    >>
    >> <> wrote in message
    >> news:...
    >>
    >>> Bart Rider wrote:
    >>> > Hi all,
    >>> >
    >>> > last week i had to write a little homework program like open
    >>> > a file and count all characters present in this file.
    >>>
    >>> Apparently that's not what your program is trying to do: your
    >>> program seems to be trying to count how many occurence of each
    >>> character appears in the file.

    >>
    >>
    >> This threw me off too. To the OP: Please be very precise about what
    >> your program is supposed to do, or else I'll be very confused and my
    >> advice will probably be less effective.
    >>
    >>> > I did it
    >>> > using a counting array of size 256 and increasing the specific
    >>> > chars position by one, if i've read that character from the
    >>> > file.

    >>
    >>
    >> Perhaps the OP isn't trying to read characters at all, but instead
    >> is reading in bytes. That is, the reader could stick with an array of
    >> size 256, and read in one byte at a time, counting how often each byte
    >> appears in a file. That would remove the need for an encoding all
    >> together, as well as that "others" variable mentioned upthread.
    >>
    >> - Oliver

    >
    >
    > As an additional aside, given that the OP will be potentially dealing
    > with far more characters than just 256, but possibly quite sparsely
    > distributed, the better data structure would probably be a
    > Map<Character, Integer>
    >
    > Assuming he really IS interested in chars, not bytes, that is.
    >
    > FWIW.
    >
    > Rogan


    Thanks a lot for all your replies. They helped me a lot to
    understand what are the flaws in my little programm.

    Actually I really thought char is only 8 bit wide (I come from
    c programming, where char is a replacement for byte ...)
    But now, with your hints on Unicode and character mapping I
    have to look closer to every file I read and what I intend to
    do with it.

    Thanks again,
    Bart
     
    Bart Rider, May 30, 2006
    #9
  10. Bart Rider

    Chris Uppal Guest

    Rogan Dawes wrote:

    > As an additional aside, given that the OP will be potentially dealing
    > with far more characters than just 256, but possibly quite sparsely
    > distributed, the better data structure would probably be a
    > Map<Character, Integer>


    Or maybe even an int[] array for the first 127 code points and a Map<Character,
    Integer> to handle the overflow.

    -- chris
     
    Chris Uppal, May 30, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. lovecreatesbeauty
    Replies:
    1
    Views:
    1,071
    Ian Collins
    May 9, 2006
  2. davidb
    Replies:
    0
    Views:
    769
    davidb
    Sep 1, 2006
  3. Ioannis Vranos
    Replies:
    11
    Views:
    765
    Ioannis Vranos
    Mar 28, 2008
  4. Ioannis Vranos

    Padding bits and char, unsigned char, signed char

    Ioannis Vranos, Mar 28, 2008, in forum: C Programming
    Replies:
    6
    Views:
    620
    Ben Bacarisse
    Mar 29, 2008
  5. Krist
    Replies:
    6
    Views:
    762
    Arne Vajhøj
    May 7, 2010
Loading...

Share This Page