TextStreamReader with transparent unicode BOM Support

Discussion in 'Java' started by X_AWemner_X, Jul 2, 2003.

  1. X_AWemner_X

    X_AWemner_X Guest

    Ok, here is a teaser for all java io coders. Make us all happy and create a
    filterreader with proper unicode bom support.

    As you know, _we_ have tell InputStreamReader what unicode charset to use
    for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
    UTF-16 keyword and skip first bytes, but still we must tell it to use
    UTF-16. but fails with UTF-8 files.

    Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
    cannot read it properly.

    http://www.unicode.org/unicode/faq/utf_bom.html#22

    Now, do you have a streamreader which support BOMs fully transparently,
    something like?

    String defaultEnc = "UTF-8"; // java default is ISO-8859-1
    Reader in = new BestUnicodeTextStreamReader(new
    FileInputStream("myfile.txt"), defaultEnc);
    -> this class would recognize all BOM marks automatically and used it. If no
    BOM were found, then use given defaultEnc value.

    I am sure we n00b coders would love to use such reader implementation.
     
    X_AWemner_X, Jul 2, 2003
    #1
    1. Advertising

  2. "X_AWemner_X" <> writes:
    > Ok, here is a teaser for all java io coders. Make us all happy and create a
    > filterreader with proper unicode bom support.

    [...]
    >
    > I am sure we n00b coders would love to use such reader implementation.
    >


    Doesn't your company (that's ZenPark, isn't it?) have an own software
    development department that can do such remittance work? Please tell
    me where I should send the bill for the following rough and inefficient
    sketch to? :)

    I have left out all exception handling and minor details:

    class UnicodeReader implements Reader {
    PushbackInputStream internalIn;
    InputStreamReader internalOut = null;
    String defaultEnc;

    private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16

    UnicodeReader(InputStream in, String defaultEnc) {
    internalIn = new PushbackInputStream(in, BOM_SIZE);
    this.defaultEnc = defaultEnc;
    }

    protected void init() {
    if(internalOut != null) {
    return;
    }

    byte bom[] = new byte[BOM_SIZE];
    int n;
    int pos = 0;
    while(pos < BOM_SIZE &&
    (n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
    {
    pos += n;
    }
    internalIn.unread(bom, 0, pos);
    String encoding = ... // evaluate the content of bom[] here
    // revert to defaultEnc if nothing found
    internalOut = new InputStreamReader(internalIn, encoding);
    }

    //
    // For all methods in interface Reader, implement each method as:
    //
    // method(...) {
    // init();
    // internalOut.method(...);
    // }
    //
    }


    /Thomas
     
    Thomas Weidenfeller, Jul 2, 2003
    #2
    1. Advertising

  3. X_AWemner_X

    Roedy Green Guest

    On Wed, 2 Jul 2003 12:27:56 +0300, "X_AWemner_X" <> wrote
    or quoted :

    >Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
    >cannot read it properly.


    see http://mindprod.com/jgloss/encoding.html

    What happens if you use UTF-8 or UTF-16 encoding on the code
    suggested by the File IO amanuensis at
    http://mindprod.com/fileio.html?

    Java is not smart enough to flip between 8-16 automatically, but is it
    smart enough to deal with endian markers, both BE and LE.

    Ideally this should be implemented as yet another encoding:
    Unicode-8-16. Does anyone know how you insert your own encoding into
    the official list? You can't pass any parameters to the encoding such
    as your preferred default big/little endian, so you must create
    variant names for all the combinations.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jul 2, 2003
    #3
  4. Thx for the good tip, I was not aware of PushPackInputStream class. It
    made everything really simple to do. Here is the implementation what you
    suggested.


    /**
    Original pseudocode : Thomas Weidenfeller
    Implementation tweaked: Aki Nieminen

    http://www.unicode.org/unicode/faq/utf_bom.html
    BOMs:
    00 00 FE FF = UTF-32, big-endian
    FF FE 00 00 = UTF-32, little-endian
    FE FF = UTF-16, big-endian
    FF FE = UTF-16, little-endian
    EF BB BF = UTF-8

    Win2k Notepad:
    Unicode format = UTF-16LE
    ***/

    import java.io.*;

    /**
    * Generic unicode textreader, which will use BOM mark
    * to identify the encoding to be used.
    */
    public class UnicodeReader extends Reader {
    PushbackInputStream internalIn;
    InputStreamReader internalIn2 = null;
    String defaultEnc;

    private static final int BOM_SIZE = 4;

    UnicodeReader(InputStream in, String defaultEnc) {
    internalIn = new PushbackInputStream(in, BOM_SIZE);
    this.defaultEnc = defaultEnc;
    }

    public String getDefaultEncoding() {
    return defaultEnc;
    }

    public String getEncoding() {
    if (internalIn2 == null) return null;
    return internalIn2.getEncoding();
    }

    /**
    * Read-ahead four bytes and check for BOM marks. Extra bytes are
    * unread back to the stream, only BOM bytes are skipped.
    */
    protected void init() throws IOException {
    if (internalIn2 != null) return;

    String encoding;
    byte bom[] = new byte[BOM_SIZE];
    int n, unread;
    n = internalIn.read(bom, 0, bom.length);

    if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
    (bom[2] == (byte)0xBF) ) {
    encoding = "UTF-8";
    unread = n - 3;
    } else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
    encoding = "UTF-16BE";
    unread = n - 2;
    } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
    encoding = "UTF-16LE";
    unread = n - 2;
    } else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
    (bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
    encoding = "UTF-32BE";
    unread = n - 4;
    } else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
    (bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
    encoding = "UTF-32LE";
    unread = n - 4;
    } else {
    // Unicode BOM mark not found, unread all bytes
    encoding = defaultEnc;
    unread = n;
    }
    // System.out.println("read=" + n + ", unread=" + unread);

    if (unread > 0) internalIn.unread(bom, (n - unread), unread);
    else if (unread < -1) internalIn.unread(bom, 0, 0);

    // Use given encoding
    if (encoding == null) {
    internalIn2 = new InputStreamReader(internalIn);
    } else {
    internalIn2 = new InputStreamReader(internalIn, encoding);
    }
    }

    public void close() throws IOException {
    init();
    internalIn2.close();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
    init();
    return internalIn2.read(cbuf, off, len);
    }

    }


    > I have left out all exception handling and minor details:
    >
    > class UnicodeReader implements Reader {
    > PushbackInputStream internalIn;
    > InputStreamReader internalOut = null;
    > String defaultEnc;


    <...clip clip...>
     
    NoName NoName, Jul 3, 2003
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. X_AWieminer_X
    Replies:
    2
    Views:
    12,516
    JomoFrodo
    Aug 18, 2011
  2. Jean-Marc Autexier
    Replies:
    2
    Views:
    3,781
    Jean-Marc Autexier
    Aug 30, 2003
  3. netnews.comcast.net

    Javadoc fails on BOM

    netnews.comcast.net, Jul 12, 2004, in forum: Java
    Replies:
    2
    Views:
    704
    Thomas Weidenfeller
    Jul 12, 2004
  4. Francis Girard

    Unicode BOM marks

    Francis Girard, Mar 7, 2005, in forum: Python
    Replies:
    9
    Views:
    535
    Steve Horsley
    Mar 14, 2005
  5. Petr Prikryl
    Replies:
    0
    Views:
    316
    Petr Prikryl
    Mar 14, 2007
Loading...

Share This Page