converting from one charset encoding to another ...

Discussion in 'Java' started by Albretch Mueller, Nov 23, 2009.

  1. Sometime ago I coded some methods to charset re-encoding. Say you get
    files in kirillic, “KOI8-R” and you want them as UTF-8

    What I did was basically opening an InputStreamReader(FileInputStream
    FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
    went InputStreamReader.read(char[] chrBffr) and
    OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till it
    hit an EOF

    That works just fine, yet I wonder if there are better/faster ways to
    do that using channels/memory mapped files

    Also where can you get actual files with different types fo encodings
    to test these methods.

    Thanks
    lbrtchx
    {comp.lang.java.programmer}
    Albretch Mueller, Nov 23, 2009
    #1
    1. Advertising

  2. Albretch Mueller wrote:
    > Sometime ago I coded some methods to charset re-encoding. Say you
    > get
    > files in kirillic, “KOI8-R” and you want them as UTF-8
    >
    > What I did was basically opening an
    > InputStreamReader(FileInputStream
    > FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
    > went InputStreamReader.read(char[] chrBffr) and
    > OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
    > it
    > hit an EOF
    >
    > That works just fine, yet I wonder if there are better/faster ways
    > to
    > do that using channels/memory mapped files
    >
    > Also where can you get actual files with different types fo
    > encodings
    > to test these methods.


    You can create them easily enough with a FileWriter that writes to an
    OutputStreamWriter of the desired encoding.
    Mike Schilling, Nov 23, 2009
    #2
    1. Advertising

  3. On Nov 23, 5:54 am, "Mike Schilling" <>
    wrote:
    > Albretch Mueller wrote:
    > >  Sometime ago I coded some methods to charset re-encoding. Say you
    > > get
    > > files in kirillic, “KOI8-R” and you want them as UTF-8

    >
    > >  What I did was basically opening an
    > > InputStreamReader(FileInputStream
    > > FIS, String aEncoding1) and an OutputStreamWriter(FOS, “UTF-8”) and
    > > went InputStreamReader.read(char[] chrBffr) and
    > > OutputStreamWriter.write(chrBffr, 0, iRdByts) in a while loop till
    > > it
    > > hit an EOF

    >
    > >  That works just fine, yet I wonder if there are better/faster ways
    > > to
    > > do that using channels/memory mapped files

    >
    > >  Also where can you get actual files with different types fo
    > > encodings
    > > to test these methods.

    >
    > You can create them easily enough with a FileWriter that writes to an
    > OutputStreamWriter of the desired encoding.

    ~
    After checking the API I don't see what the difference would be
    between a plain reader and a FileOutputStream. What is it?

    Thank you
    lbrtchx
    Albretch Mueller, Nov 23, 2009
    #3
  4. Albretch Mueller

    Lew Guest

    Albretch Mueller wrote:
    > After checking the API I don't see what the difference would be
    > between a plain reader and a FileOutputStream. What is it?


    I'll assume you either meant a "plain writer" or a 'FileInputStream', but the
    question remains what you mean by a "plain reader/writer".

    'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

    --
    Lew
    Lew, Nov 23, 2009
    #4
  5. Albretch Mueller wrote:
    >>
    >> You can create them easily enough with a FileWriter that writes to
    >> an
    >> OutputStreamWriter of the desired encoding.

    > ~
    > After checking the API I don't see what the difference would be
    > between a plain reader and a FileOutputStream. What is it?


    A Writer converts from characters (Unicode) to whatever encoding it
    was created with. an OutputStream just outputs bytes with no
    conversion being done..
    Mike Schilling, Nov 23, 2009
    #5
  6. Albretch Mueller

    Roedy Green Guest

    On Sun, 22 Nov 2009 19:02:36 -0800 (PST), Albretch Mueller
    <> wrote, quoted or indirectly quoted someone who
    said :

    >
    > That works just fine, yet I wonder if there are better/faster ways to
    >do that using channels/memory mapped files


    The thing I don't understand, is nio uses ordinary file i/o
    underneath. So how is it faster if you don't do something stupid with
    ordinary file i/o in a case where caching would not help?
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Finding a bug is a sign you were asleep a the switch when coding. Stop debugging, and go back over your code line by line.
    Roedy Green, Nov 23, 2009
    #6
  7. > I'll assume you either meant a "plain writer" or a 'FileInputStream'
    ~
    ;-)
    ~
    > 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.

    ~
    but once you write to a file as I am doing it all becomes a stream of
    bytes anyway, till you eventually reopen the file using a Reader and
    specifying the charset to interpret chuncks of bytes as they are being
    read into an array of chars, and as specified by the API:
    ~
    http://java.sun.com/javase/6/docs/api/java/lang/Character.html
    ~
    "The Java 2 platform uses the UTF-16 representation in char arrays
    and in the String and StringBuffer classes."
    ~
    So I think there is no real fancifulness in converting streams from
    and to char sets as long as your OS/Java supports both encodings, it
    is by nature a serial process.
    ~
    Thank you
    lbrtchx
    Albretch Mueller, Nov 23, 2009
    #7
  8. Albretch Mueller

    Lew Guest

    Lew wrote:
    >> 'Reader's and 'Writer's deal with encoded 'char's. Streams deal with raw bytes.


    Albretch Mueller wrote:
    > but once you write to a file as I am doing it all becomes a stream of
    > bytes anyway, till you eventually reopen the file using a Reader and
    > specifying the charset to interpret chuncks of bytes as they are being
    > read into an array of chars, and as specified by the API:


    The exact bytes written through a Writer depend on the encoding used. If you
    use a Reader with a different encoding, you'll get garbage.

    --
    Lew
    Lew, Nov 24, 2009
    #8
  9. On Nov 24, 1:45 am, Lew <> wrote:
    > Lew wrote:
    > >> 'Reader's and 'Writer's deal with encoded 'char's.  Streams deal with raw bytes.

    > Albretch Mueller wrote:
    > >  but once you write to a file as I am doing it all becomes a stream of
    > > bytes anyway, till you eventually reopen the file using a Reader and
    > > specifying the charset to interpret chuncks of bytes as they are being
    > > read into an array of chars, and as specified by the API:

    >
    > The exact bytes written through a Writer depend on the encoding used.  If you
    > use a Reader with a different encoding, you'll get garbage.
    >
    > --
    > Lew


    OK, you have made me wonder about what to do when you don't know the
    encoding of a file you got. As long as I know this is not taken care
    by Readers even though some heuristics may be used

    So, what do you do in those situations?

    Thank you
    lbrtchx
    Albretch Mueller, Nov 25, 2009
    #9
  10. Albretch Mueller

    Lew Guest

    Albretch Mueller wrote:
    >> Lew wrote:
    >> The exact bytes written through a Writer depend on the encoding used. If you
    >> use a Reader with a different encoding, you'll get garbage.
    >>
    >> --
    >> Lew


    Don't quote sigs.

    > OK, you have made me wonder about what to do when you don't know the
    > encoding of a file you got. As long as I know this is not taken care
    > by Readers even though some heuristics may be used
    >
    > So, what do you do in those situations?


    The editor in Rational Software Architect, an IDE built on Eclipse, simply
    reports that the file is not in the specified encoding. I haven't looked at
    its source, but I guess it notices illegal code points. Other editors just
    display the wrong thing.

    --
    Lew
    Don't quote sigs.
    Lew, Nov 25, 2009
    #10
  11. Albretch Mueller wrote:

    >
    > OK, you have made me wonder about what to do when you don't know
    > the
    > encoding of a file you got. As long as I know this is not taken care
    > by Readers even though some heuristics may be used


    Readers assume that what you tell them is true. (If you don't create
    a Reader with an explicit charset, it uses the platform's default.)
    Mike Schilling, Nov 25, 2009
    #11
  12. Albretch Mueller

    Roedy Green Guest

    On Wed, 25 Nov 2009 00:18:17 -0800 (PST), Albretch Mueller
    <> wrote, quoted or indirectly quoted someone who
    said :

    > OK, you have made me wonder about what to do when you don't know the
    >encoding of a file you got. As long as I know this is not taken care
    >by Readers even though some heuristics may be used


    see http://mindprod.com/applet/encodingrecogniser.html

    http://mindprod.com/project/encodingidentification.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I mean the word proof not in the sense of the lawyers, who set two half proofs equal to a whole one, but in the sense of a mathematician, where half proof = 0, and it is demanded for proof that every doubt becomes impossible.
    ~ Carl Friedrich Gauss
    Roedy Green, Nov 25, 2009
    #12
  13. Albretch Mueller

    Arne Vajhøj Guest

    Albretch Mueller wrote:
    > OK, you have made me wonder about what to do when you don't know the
    > encoding of a file you got. As long as I know this is not taken care
    > by Readers even though some heuristics may be used
    >
    > So, what do you do in those situations?


    Ask for a specification.

    The same sequence of bytes can be several different sequences of
    chars depending on encoding.

    A specification is necessary.

    Arne
    Arne Vajhøj, Nov 25, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. J.P.Jarolim
    Replies:
    0
    Views:
    1,039
    J.P.Jarolim
    Feb 27, 2004
  2. James
    Replies:
    2
    Views:
    3,368
    Michael Borgwardt
    Jul 1, 2004
  3. Replies:
    5
    Views:
    711
    Oliver Wong
    Aug 7, 2007
  4. Replies:
    2
    Views:
    347
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    244
    optimistx
    Aug 15, 2008
Loading...

Share This Page