XML inside a web page and encoding


6

6real

Dear all,

I have a strange behavior regarding what I do and to be honnest I
don't how to solve my issu because I am not familiar with encoding
issues.

here is what i would like to do :
1 - parse an HTML file
2 - Extract a part of this page which is an XML
3 - Store this file in a database

It seems simple but I met an encoding issu.
The web page is defined with ISO-8859-1 charset
The XML header (when extracted) is specify UTF-8 as encoding charset.

Here is my code snippet to parse the web page :

URL url = new URL(getURLToUpdate());
URLConnection urlconn = url.openConnection();

Log.d("MGR", "open url");

Document doc = null;

try {
// isolate the kml part
String page =
FormatUtility.slurp(urlconn.getInputStream());

// index of KML start and stop
int indexStartKML =
page.indexOf(Constant.TAG_KML_START);
int indexStopKML =
page.indexOf(Constant.TAG_KML_STOP);

String kml = page.substring(indexStartKML,
indexStopKML + 6);

// Remove the CDATA information
kml = kml.replace("<![CDATA[", "");
kml = kml.replace("]]>", "");

DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();

InputSource inStream = new InputSource();
inStream.setCharacterStream(new StringReader(kml));

doc = db.parse(inStream);

Here is the slup() method :
public static String slurp (InputStream in) throws IOException {
StringBuffer out = new StringBuffer();
byte[] b = new byte[4096];
for (int n; (n = in.read(b)) != -1;) {
out.append(new String(b, 0, n));
}
return out.toString();
}

I try to force the encoding but with no success. I don't know where to
search now either when I load the page from input stream, when I
convert the stream into String. ?.

Any help or idea will be highly appreciated !

Thanks for reading, (this is for an freeware ;-) ) !

C.

PS : This is the response header of the web page :

Date Tue, 29 Jul 2008 21:16:23 GMT
Server Apache
X-Powered-By PHP/5.1.4
Expires Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control no-store, no-cache, must-revalidate, post-check=0, pre-
check=0
Pragma no-cache
Keep-Alive timeout=15, max=99
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top