changing Reader midstream

B

Blah Blah

i have what seems to be a fairly common problem, but i can't seem to find an
answer... i have a java program which opens a connection to a webpage and
seeks to read the webpage in. if the charset is defined in the header, then
no problem - i simply set up my InputStreamReader using the correct Charset
and away i go. but if the charset is defined in a META tag, then i want to
be able to recognize that.

a good example of why this is important: Yahoo! Japan uses EUC-JP encoding,
and specifies it in the header. Excite Japan uses Shift_JIS encoding, and
specifies it in the META tag.

here's some pseudocode:

URL url = new URL(webpage);
URLConnection conn = url.openConnection();
if (conn.contentType() != null)
charset = getCharset(conn.contentType()); // yay!
else
charset = ISO-8859-1; // default

InputStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(in,charset));
while (get data from reader) {
if (data is META tag && content-type specifies a new charset) {
reader = new BufferedReader(new InputStreamReader(in,newCharset));
// this is the problem
}
}

this, unfortunately, does not seem to work. is there a better (i.e.,
standard) way to do this?

thanks!

daniel
 
B

Blah Blah

Blah Blah said:
i have what seems to be a fairly common problem, but i can't seem to find an
answer... i have a java program which opens a connection to a webpage and
seeks to read the webpage in. if the charset is defined in the header, then
no problem - i simply set up my InputStreamReader using the correct Charset
and away i go. but if the charset is defined in a META tag, then i want to
be able to recognize that.

a good example of why this is important: Yahoo! Japan uses EUC-JP encoding,
and specifies it in the header. Excite Japan uses Shift_JIS encoding, and
specifies it in the META tag.

here's some pseudocode:

URL url = new URL(webpage);
URLConnection conn = url.openConnection();
if (conn.contentType() != null)
charset = getCharset(conn.contentType()); // yay!
else
charset = ISO-8859-1; // default

InputStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(in,charset));
while (get data from reader) {
if (data is META tag && content-type specifies a new charset) {
reader = new BufferedReader(new InputStreamReader(in,newCharset));
// this is the problem
}
}

this, unfortunately, does not seem to work. is there a better (i.e.,
standard) way to do this?

thanks!

daniel

After some experimenting it turns out that this code is logically correct,
and in fact works if I don't use the InputStreamReader. It looks like the
InputStreamReader does some internal buffering, which is causing a couple of
hundred bytes to be skipped. Which isn't good. So the question is, is the
only way to fix this to write my own version of InputStreamReader? I need
the ability to process multibyte (and multiple encoding) text. It seems like
every webspider would have to have solved this problem. Any thoughts?
Anyone?

I'd really like to avoid reinventing the wheel...

daniel
 
J

John C. Bollinger

Blah said:
After some experimenting it turns out that this code is logically correct,
and in fact works if I don't use the InputStreamReader. It looks like the
InputStreamReader does some internal buffering, which is causing a couple of
hundred bytes to be skipped.

I don't know about internal buffering, and I would be a bit miffed if it
were true. You are, however, explicitly buffering the input _external_
to the InputStreamReader by wrapping the ISR in a BufferedReader. I
imagine that is where the buffering is happening.

It is generally better to put the buffer as close to the input as
possible anyway, and if you did so then you would have a fairly easy
solution available (see below).
Which isn't good. So the question is, is the
only way to fix this to write my own version of InputStreamReader?

No. Consider that the charset specified by the content type, whether in
the HTTP header or in an HTML meta tag, applies to the _entire_ transfer
entity. Therefore, if you encounter one specified in a meta tag then
you in principle need to reread the entire entity from the beginning,
using the specified charset.

So how about this: wrap the InputStream from your Connection in a
BufferedInputStream with a suitably big buffer. (A wise web designer
would put the meta tag we're discussing first in the HTML header anyway,
so perhaps the buffer doesn't need to be all that big.) The first thing
you do to the buffered stream is to mark() it so that you can later
reset() it to the beginning (provided you haven't read so much data as
to discard the mark). You wrap the buffered stream in your
InputStreamReader, and read happily away. If you happen to read a new
charset specification from a meta tag on your first time through then
you reset the buffered stream, create a new ISR around it with the right
encoding, and read through again, ignoring the charset specification on
the second pass.

You have to decide what happens if there are multiple content types
specified; I imagine there is a defined behavior but I don't know
offhand what it is.


John Bollinger
(e-mail address removed)
 
B

Blah Blah

After some experimenting it turns out that this code is logically
correct,
I don't know about internal buffering, and I would be a bit miffed if it
were true. You are, however, explicitly buffering the input _external_
to the InputStreamReader by wrapping the ISR in a BufferedReader. I
imagine that is where the buffering is happening.

well, i've solved the problem, and although the BufferedReader added to it a
little, the primary issue was the way in which the InputStreamReader
converts bytes to chars. it reads in a buffer of bytes (in the 256-512
range), converts them using the java.nio.charset.Charset that you gave it
(or rather, the CharsetDecoder it gets from that), and then feeds them to
you until it needs to get more. so although it's not buffering for the sake
of buffering, it does read in more than it's ready to use.

solution: create a BufferedInputStream with a large capacity, then mark and
reset if necessary. OR, rewrite InputStreamReader to use a smaller buffer,
and accept a change in the Charset midway through. since memory is an issue,
i opted for the latter, and it appears to be working just fine (although
there are some performance issues).

daniel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top