TextStreamReader with transparent unicode BOM Support

X

X_AWemner_X

Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support.

As you know, _we_ have tell InputStreamReader what unicode charset to use
for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
UTF-16 keyword and skip first bytes, but still we must tell it to use
UTF-16. but fails with UTF-8 files.

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

http://www.unicode.org/unicode/faq/utf_bom.html#22

Now, do you have a streamreader which support BOMs fully transparently,
something like?

String defaultEnc = "UTF-8"; // java default is ISO-8859-1
Reader in = new BestUnicodeTextStreamReader(new
FileInputStream("myfile.txt"), defaultEnc);
-> this class would recognize all BOM marks automatically and used it. If no
BOM were found, then use given defaultEnc value.

I am sure we n00b coders would love to use such reader implementation.
 
T

Thomas Weidenfeller

X_AWemner_X said:
Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support. [...]

I am sure we n00b coders would love to use such reader implementation.

Doesn't your company (that's ZenPark, isn't it?) have an own software
development department that can do such remittance work? Please tell
me where I should send the bill for the following rough and inefficient
sketch to? :)

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

protected void init() {
if(internalOut != null) {
return;
}

byte bom[] = new byte[BOM_SIZE];
int n;
int pos = 0;
while(pos < BOM_SIZE &&
(n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
{
pos += n;
}
internalIn.unread(bom, 0, pos);
String encoding = ... // evaluate the content of bom[] here
// revert to defaultEnc if nothing found
internalOut = new InputStreamReader(internalIn, encoding);
}

//
// For all methods in interface Reader, implement each method as:
//
// method(...) {
// init();
// internalOut.method(...);
// }
//
}


/Thomas
 
R

Roedy Green

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

see http://mindprod.com/jgloss/encoding.html

What happens if you use UTF-8 or UTF-16 encoding on the code
suggested by the File IO amanuensis at
http://mindprod.com/fileio.html?

Java is not smart enough to flip between 8-16 automatically, but is it
smart enough to deal with endian markers, both BE and LE.

Ideally this should be implemented as yet another encoding:
Unicode-8-16. Does anyone know how you insert your own encoding into
the official list? You can't pass any parameters to the encoding such
as your preferred default big/little endian, so you must create
variant names for all the combinations.
 
N

NoName NoName

Thx for the good tip, I was not aware of PushPackInputStream class. It
made everything really simple to do. Here is the implementation what you
suggested.


/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen

http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8

Win2k Notepad:
Unicode format = UTF-16LE
***/

import java.io.*;

/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;

private static final int BOM_SIZE = 4;

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

public String getDefaultEncoding() {
return defaultEnc;
}

public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}

/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;

String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = internalIn.read(bom, 0, bom.length);

if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);

if (unread > 0) internalIn.unread(bom, (n - unread), unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);

// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn, encoding);
}
}

public void close() throws IOException {
init();
internalIn2.close();
}

public int read(char[] cbuf, int off, int len) throws IOException {
init();
return internalIn2.read(cbuf, off, len);
}

}

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

<...clip clip...>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,121
Latest member
LowellMcGu
Top