B
Blah Blah
i have what seems to be a fairly common problem, but i can't seem to find an
answer... i have a java program which opens a connection to a webpage and
seeks to read the webpage in. if the charset is defined in the header, then
no problem - i simply set up my InputStreamReader using the correct Charset
and away i go. but if the charset is defined in a META tag, then i want to
be able to recognize that.
a good example of why this is important: Yahoo! Japan uses EUC-JP encoding,
and specifies it in the header. Excite Japan uses Shift_JIS encoding, and
specifies it in the META tag.
here's some pseudocode:
URL url = new URL(webpage);
URLConnection conn = url.openConnection();
if (conn.contentType() != null)
charset = getCharset(conn.contentType()); // yay!
else
charset = ISO-8859-1; // default
InputStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(in,charset));
while (get data from reader) {
if (data is META tag && content-type specifies a new charset) {
reader = new BufferedReader(new InputStreamReader(in,newCharset));
// this is the problem
}
}
this, unfortunately, does not seem to work. is there a better (i.e.,
standard) way to do this?
thanks!
daniel
answer... i have a java program which opens a connection to a webpage and
seeks to read the webpage in. if the charset is defined in the header, then
no problem - i simply set up my InputStreamReader using the correct Charset
and away i go. but if the charset is defined in a META tag, then i want to
be able to recognize that.
a good example of why this is important: Yahoo! Japan uses EUC-JP encoding,
and specifies it in the header. Excite Japan uses Shift_JIS encoding, and
specifies it in the META tag.
here's some pseudocode:
URL url = new URL(webpage);
URLConnection conn = url.openConnection();
if (conn.contentType() != null)
charset = getCharset(conn.contentType()); // yay!
else
charset = ISO-8859-1; // default
InputStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(in,charset));
while (get data from reader) {
if (data is META tag && content-type specifies a new charset) {
reader = new BufferedReader(new InputStreamReader(in,newCharset));
// this is the problem
}
}
this, unfortunately, does not seem to work. is there a better (i.e.,
standard) way to do this?
thanks!
daniel