How to identify File encoding in Java?

Discussion in 'Java' started by Perma, Apr 17, 2007.

  1. Perma

    Perma Guest

    Hi,
    I have a Java program which polls a directory for incoming files
    (zipped and text).
    When a new file comes, I read it and post it's outcome.

    Here I have some encoding problems. The text files are usually UTF-8,
    so I hard-code the encoding to UTF-8:

    Code extract:
    ....
    // trying to read the file "myFile"
    FileInputStream fi = new FileInputStream(myFile);
    InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
    UTF-8, how can I do this dynamically?
    ....

    I was expecting the zipped files to be UTF-8 as well, but it turned
    out not to be, so I get an:
    MalformedInputException at
    sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8

    So I have to handle the two separately and it troubles my code.

    I guess there's a smart way of doing this.
    Hope someone can give me some hint on this! :)

    Regards, Per Magnus
    Perma, Apr 17, 2007
    #1
    1. Advertising

  2. On 17 Apr 2007 09:25:47 -0700, Perma wrote:
    > I was expecting the zipped files to be UTF-8 as well, but it turned
    > out not to be, so I get an:


    > MalformedInputException at
    > sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
    >
    > So I have to handle the two separately and it troubles my code.


    Character encoding is an attribute that describes how characters are
    represented as bytes in *text* files. Classes that are "readers" read
    from a stream and apply the specified encoding to obtain the text.
    This operation is only meaninful when applied to text files.

    Zipped files are binary, not text, so they don't contain characters
    and there is no character encoding. To read a zipped file, use an
    InputStream (to read raw bytes), or e.g. a ZipInputStream to get the
    unzipped contents.

    /gordon

    --
    Gordon Beaton, Apr 17, 2007
    #2
    1. Advertising

  3. Perma

    Kai Schwebke Guest

    Perma schrieb:
    > I was expecting the zipped files to be UTF-8 as well, but it turned
    > out not to be, so I get an:
    > MalformedInputException at
    > sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
    >
    > So I have to handle the two separately and it troubles my code.
    >
    > I guess there's a smart way of doing this.
    > Hope someone can give me some hint on this! :)


    There is no 100% solution if you have to guess the encoding.
    But you can exploit the fact, that input which may be interpreted
    as UTF-8-encoded without error, in almost all cases is actually
    UTF-8 encoded.


    So just try do apply UTF-8, catch the exception or search for
    the invalid marker character (defaults to \uFFFD) and apply
    an alternate charset on error.



    Kai
    Kai Schwebke, Apr 17, 2007
    #3
  4. "Gordon Beaton" <> wrote in message
    news:4624f80e$0$24610$...
    >
    > Zipped files are binary, not text, so they don't contain characters
    > and there is no character encoding.


    That is, zip files are binary. Zipped files (the files that were processed
    to create the zip file) may be binary or text.
    Mike Schilling, Apr 17, 2007
    #4
  5. Perma

    Oliver Wong Guest

    "Perma" <> wrote in message
    news:...
    > Hi,
    > I have a Java program which polls a directory for incoming files
    > (zipped and text).
    > When a new file comes, I read it and post it's outcome.
    >
    > Here I have some encoding problems. The text files are usually UTF-8,
    > so I hard-code the encoding to UTF-8:
    >
    > Code extract:
    > ...
    > // trying to read the file "myFile"
    > FileInputStream fi = new FileInputStream(myFile);
    > InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
    > UTF-8, how can I do this dynamically?
    > ...
    >
    > I was expecting the zipped files to be UTF-8 as well, but it turned
    > out not to be, so I get an:
    > MalformedInputException at
    > sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8
    >
    > So I have to handle the two separately and it troubles my code.
    >
    > I guess there's a smart way of doing this.
    > Hope someone can give me some hint on this! :)


    Maybe I'm misunderstanding something, but zip files are NOT text files
    encoded via the UTF-8 encoding. In fact, they're not text files at all,
    but binary files. Thus the question of "which encoding?" never has a
    chance to come up at all.

    - Oliver
    Oliver Wong, Apr 17, 2007
    #5
  6. Perma

    Perma Guest

    I am aware of that a zip file is not a text file, but it contains a
    text file.
    When unzipping the text file, I need to figure out what encoding it
    is.

    >From the postings above, it sounds to me as if there is no method

    which can tell me whether it is "UTF-8" or not, so perhaps the best
    solution is to try to read the file as UTF-8, and handling non-UTF-8
    files by re-reading them in the Exception handling method.
    Example:

    InputStreamReader ir
    try {
    // trying to read as utf-8
    ir = new InputStreamReader(fi, "UTF8");
    } catch (Exception e {
    // couldn't read file as UTF-8, therefore reading it as nont-UTF-8
    encoding
    ir = new InputStreamReader(fi);
    }

    Thank you for responses!
    And please update if you have some alternative solutions to this.

    -Per Magnus

    On 18 Apr, 00:09, "Oliver Wong" <> wrote:
    > "Perma" <> wrote in message
    >
    > news:...
    >
    >
    >
    > > Hi,
    > > I have a Java program which polls a directory for incoming files
    > > (zipped and text).
    > > When a new file comes, I read it and post it's outcome.

    >
    > > Here I have some encoding problems. The text files are usually UTF-8,
    > > so I hard-code the encoding to UTF-8:

    >
    > > Code extract:
    > > ...
    > > // trying to read the file "myFile"
    > > FileInputStream fi = new FileInputStream(myFile);
    > > InputStreamReader ir = new InputStreamReader(fi, "UTF8"); // hardcoded
    > > UTF-8, how can I do this dynamically?
    > > ...

    >
    > > I was expecting the zipped files to be UTF-8 as well, but it turned
    > > out not to be, so I get an:
    > > MalformedInputException at
    > > sun.io.ByteToCharUTF8.convert([BII[CII)I(ByteToCharUTF8

    >
    > > So I have to handle the two separately and it troubles my code.

    >
    > > I guess there's a smart way of doing this.
    > > Hope someone can give me some hint on this! :)

    >
    > Maybe I'm misunderstanding something, but zip files are NOT text files
    > encoded via the UTF-8 encoding. In fact, they're not text files at all,
    > but binary files. Thus the question of "which encoding?" never has a
    > chance to come up at all.
    >
    > - Oliver
    Perma, Apr 24, 2007
    #6
  7. Perma wrote:
    > I am aware of that a zip file is not a text file, but it contains a
    > text file. When unzipping the text file, I need to figure out what encoding it
    > is.
    >

    Fire up your ZIP utility and take a good look at the file headers it
    lists from the zip archive. What you see there is about all it knows
    about compressed files. I don't think it knows or cares what's in the
    file apart from what's implied by the filename extension.

    I think you'll be better off extracting the file as a bytestream
    because I'm 99% certain that's what the zip archiver thought it had
    compressed.

    Don't forget that ASCII text, Unicode wordprocessor files,
    JPEG images and binary executables are all the same to the ZIP archiver.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
    Martin Gregorie, Apr 24, 2007
    #7
  8. Perma

    Joined:
    Aug 3, 2010
    Messages:
    1
    Use the below code to read from a text file which are delimited with pipe and | and output the tokens or string


    try{

    FileReader infile=new FileReader(file);


    bufRdr = new BufferedReader(new FileReader(file));
    } catch (FileNotFoundException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    String line = null;

    int row = 0;int col = 0; //read each line of text file
    try {
    while((line = bufRdr.readLine()) != null && rowcount<1 ){
    StringTokenizer st = new StringTokenizer(line,"|");

    while (st.hasMoreTokens()) {

    // System.out.println( "NextToken " +st.nextToken());

    // valArray[col]=st.nextToken();

    System.out.println( "NextToken " +st.nextToken());


    //System.out.println( "Columns " +rowcount);
    }
    }
    , Aug 4, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shilpa
    Replies:
    1
    Views:
    728
    Brendan Green
    Mar 22, 2006
  2. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,797
    Jon Skeet [C# MVP]
    Jun 9, 2004
  3. Replies:
    1
    Views:
    23,315
    Real Gagnon
    Oct 8, 2004
  4. Roedy Green
    Replies:
    0
    Views:
    367
    Roedy Green
    Sep 16, 2008
  5. Replies:
    2
    Views:
    353
Loading...

Share This Page