trying to parse lines of files with non-ASCII chars

Discussion in 'Java' started by lbrtchx@hotmail.com, Dec 23, 2006.

  1. Guest

    I have some text data in a file I need to parse.
    ..
    the file's data contains characters such as accents, ntildes, ...
    ..
    if I go "cat file" I can see all characters fine in the source file,
    but after I parse the data and save it in another file using:
    ..
    // - - - - - - - - - - - - - - - - - - - - - - - - - -
    String aEnc = "UTF-8";
    // __
    FileOutputStream FOStrm = new FileOutputStream((new File(aOFlNm)));

    OutputStreamWriter OStrmRdr = new OutputStreamWriter(FOStrm, aEnc);

    BffrWrtr = new BufferedWriter(OStrmRdr);
    // __
    FileInputStream FIStrm = new FileInputStream(Fl);
    InputStreamReader IStrmRdr = new InputStreamReader(FIStrm, aEnc);
    BffrRdr = new BufferedReader(IStrmRdr);
    // __
    aRdLn = BffrRdr.readLine();
    while(aRdLn != null){
    // . . .
    aRdLn = BffrRdr.readLine();
    }
    // __
    BffrWrtr.flush(); BffrWrtr.close();
    BffrRdr.close();
    // - - - - - - - - - - - - - - - - - - - - - - - - - -
    ..
    I don't see the non-ASCII characters right in the file, but all kinds
    of weird chars
    ..
    How can I fix this problem?
    ..
    thanks
    lbrtchx
     
    , Dec 23, 2006
    #1
    1. Advertising

  2. hiwa Guest

    wrote:
    > I have some text data in a file I need to parse.
    > .
    > the file's data contains characters such as accents, ntildes, ...
    > .
    > if I go "cat file" I can see all characters fine in the source file,
    > but after I parse the data and save it in another file using:
    > .
    > // - - - - - - - - - - - - - - - - - - - - - - - - - -
    > String aEnc = "UTF-8";
    > // __
    > FileOutputStream FOStrm = new FileOutputStream((new File(aOFlNm)));
    >
    > OutputStreamWriter OStrmRdr = new OutputStreamWriter(FOStrm, aEnc);
    >
    > BffrWrtr = new BufferedWriter(OStrmRdr);
    > // __
    > FileInputStream FIStrm = new FileInputStream(Fl);
    > InputStreamReader IStrmRdr = new InputStreamReader(FIStrm, aEnc);
    > BffrRdr = new BufferedReader(IStrmRdr);
    > // __
    > aRdLn = BffrRdr.readLine();
    > while(aRdLn != null){
    > // . . .
    > aRdLn = BffrRdr.readLine();
    > }
    > // __
    > BffrWrtr.flush(); BffrWrtr.close();
    > BffrRdr.close();
    > // - - - - - - - - - - - - - - - - - - - - - - - - - -
    > .
    > I don't see the non-ASCII characters right in the file, but all kinds
    > of weird chars
    > .
    > How can I fix this problem?
    > .
    > thanks
    > lbrtchx


    String aEnc = "UTF-8"; // !! use "UTF8" for java.io classes

    FileOutputStream FOStrm = new FileOutputStream((new File(aOFlNm)));
    OutputStreamWriter OStrmRdr = new OutputStreamWriter(FOStrm, aEnc);
    BffrWrtr = new BufferedWriter(OStrmRdr);

    FileInputStream FIStrm = new FileInputStream(Fl);
    // !! your input file may not be UTF-8, actually ...
    InputStreamReader IStrmRdr = new InputStreamReader(FIStrm, aEnc);
    BffrRdr = new BufferedReader(IStrmRdr);

    aRdLn = BffrRdr.readLine();
    while(aRdLn != null){
    aRdLn = BffrRdr.readLine(); // !! aRdLn is/are discarded ...
    }
    BffrWrtr.flush(); BffrWrtr.close();
    BffrRdr.close();
     
    hiwa, Dec 23, 2006
    #2
    1. Advertising

  3. Guest

    > // !! use "UTF8" for java.io classes
    : Well, actually I had tried both "UTF8" and "UTF-8"and java appears
    to be taken both as the same
    ..
    > // !! your input file may not be UTF-8, actually ...

    : This is the very first thing I checked using KDE's kate
    ..
    > // !! aRdLn is/are discarded ...

    : What do you mean? What I posted was some extract from my actual code
    ..
    the problem I am having might be related to the BOM "byte order
    marker" under Linux/Knoppix, but I am not sure about it
    ..
    I see there was a most despised SUN bug that was declared as "Closed,
    will not be fixed"
    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
    ..
    // __ I am using the following JVM
    sh-3.1# java -version
    java version "1.4.2_11"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_11-b06)
    Java HotSpot(TM) Client VM (build 1.4.2_11-b06, mixed mode)
    ..
    // __ Defaul encoding is "ANSI_X3.4-1968"
    String aDefEnc = System.getProperty("file.encoding");
    System.out.println("// __ aDefEnc=" + aDefEnc);
    // __ aDefEnc=ANSI_X3.4-1968
    ..
    // __ i I use -Dfile.encoding=UTF-8 as JVM parameter
    String aDefEnc = System.getProperty("file.encoding");
    System.out.println("// __ aDefEnc=" + aDefEnc);
    // __ aDefEnc=UTF-8
    // __ OStrmRdr.getEncoding()=UTF8
    ..
    // __ if I use aEnc="UTF-8";
    sh-3.1# java k_killed08Test
    // __ OStrmRdr.getEncoding()=UTF8
    ..
    // __ if I use aEnc="UTF8";
    sh-3.1# java k_killed08Test
    // __ OStrmRdr.getEncoding()=UTF8
    ..
    // __ if I use some non-sense string aEnc="8FTU";
    java.io.UnsupportedEncodingException: 8FTU
    at sun.io.Converters.getConverterClass(Converters.java:215)
    at sun.io.Converters.newConverter(Converters.java:248)
    at
    sun.io.CharToByteConverter.getConverter(CharToByteConverter.java:64)
    at sun.nio.cs.StreamEncoder$ConverterSE.<init>(StreamEncoder.java:189)
    at sun.nio.cs.StreamEncoder$ConverterSE.<init>(StreamEncoder.java:172)
    at
    sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:72)
    at java.io_OutputStreamWriter.<init>(OutputStreamWriter.java:82)
    at k_killed08Test.parse(k_killed08Test.java:54)
    at k_killed08Test.main(k_killed08Test.java:26)
    ..
    lbrtchx
     
    , Dec 23, 2006
    #3
  4. hiwa Guest

    Does your document really have BOM?
     
    hiwa, Dec 24, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TOXiC
    Replies:
    5
    Views:
    1,259
    TOXiC
    Jan 31, 2007
  2. softwarepearls_com
    Replies:
    10
    Views:
    4,755
    fylia
    Feb 26, 2009
  3. Vlastimil Brom
    Replies:
    1
    Views:
    881
    John Nagle
    Aug 22, 2010
  4. bruce
    Replies:
    38
    Views:
    277
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    98
Loading...

Share This Page