TextStreamReader with transparent unicode BOM Support

X_AWemner_X · Jul 2, 2003

Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support.

As you know, _we_ have tell InputStreamReader what unicode charset to use
for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
UTF-16 keyword and skip first bytes, but still we must tell it to use
UTF-16. but fails with UTF-8 files.

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

http://www.unicode.org/unicode/faq/utf_bom.html#22

Now, do you have a streamreader which support BOMs fully transparently,
something like?

String defaultEnc = "UTF-8"; // java default is ISO-8859-1
Reader in = new BestUnicodeTextStreamReader(new
FileInputStream("myfile.txt"), defaultEnc);
-> this class would recognize all BOM marks automatically and used it. If no
BOM were found, then use given defaultEnc value.

I am sure we n00b coders would love to use such reader implementation.

Thomas Weidenfeller · Jul 2, 2003

X_AWemner_X said:
Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support. [...]

I am sure we n00b coders would love to use such reader implementation.

Doesn't your company (that's ZenPark, isn't it?) have an own software
development department that can do such remittance work? Please tell
me where I should send the bill for the following rough and inefficient
sketch to?

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

protected void init() {
if(internalOut != null) {
return;
}

byte bom[] = new byte[BOM_SIZE];
int n;
int pos = 0;
while(pos < BOM_SIZE &&
(n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
{
pos += n;
}
internalIn.unread(bom, 0, pos);
String encoding = ... // evaluate the content of bom[] here
// revert to defaultEnc if nothing found
internalOut = new InputStreamReader(internalIn, encoding);
}

//
// For all methods in interface Reader, implement each method as:
//
// method(...) {
// init();
// internalOut.method(...);
// }
//
}

/Thomas

Roedy Green · Jul 2, 2003

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

see http://mindprod.com/jgloss/encoding.html

What happens if you use UTF-8 or UTF-16 encoding on the code
suggested by the File IO amanuensis at
http://mindprod.com/fileio.html?

Java is not smart enough to flip between 8-16 automatically, but is it
smart enough to deal with endian markers, both BE and LE.

Ideally this should be implemented as yet another encoding:
Unicode-8-16. Does anyone know how you insert your own encoding into
the official list? You can't pass any parameters to the encoding such
as your preferred default big/little endian, so you must create
variant names for all the combinations.

NoName NoName · Jul 3, 2003

Thx for the good tip, I was not aware of PushPackInputStream class. It
made everything really simple to do. Here is the implementation what you
suggested.

/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen

http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8

Win2k Notepad:
Unicode format = UTF-16LE
***/

import java.io.*;

/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;

private static final int BOM_SIZE = 4;

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

public String getDefaultEncoding() {
return defaultEnc;
}

public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}

/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;

String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = internalIn.read(bom, 0, bom.length);

if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);

if (unread > 0) internalIn.unread(bom, (n - unread), unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);

// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn, encoding);
}
}

public void close() throws IOException {
init();
internalIn2.close();
}

public int read(char[] cbuf, int off, int len) throws IOException {
init();
return internalIn2.read(cbuf, off, len);
}

}

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

<...clip clip...>

Unicode BOM marks	9	Mar 7, 2005
How to write UTF-16 with BOM in little endian Von: Jean-Marc Autexier <[email protected]> Datum: Samst	2	Aug 30, 2003
Write UTF-8 BOM marker char8s) at the start of file?	2	Jul 15, 2003
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Detecteing Unicode encodings	2	Aug 21, 2004
Unicode fonts in Java	3	Mar 19, 2007
Unicode/utf-8 data in SQL Server	4	Aug 8, 2006
Use of Unicode in Python 2.5 source code literals	3	May 3, 2009

TextStreamReader with transparent unicode BOM Support

X_AWemner_X

Thomas Weidenfeller

Roedy Green

NoName NoName

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads