Change/ignore XML encoding?

T

Travis Bell

Hey guys,

I think I am missing something very basic here. I have an XML request,
using the following code as an example:

require "rubygems"
require "xml/libxml"

movie = "sin+city"
search_url =
'http://www.movie-xml.com/interfaces/getmovie.php?moviename='
url = search_url+movie
doc = XML::Document.file(url)

Now, with most of the XML results I get from movie-xml.com, the default
utf-8 is fine since there are no non-utf-8 characters. When searching
Sin City as an example, there are. Here's the response I get:

Input is not proper UTF-8, indicate encoding !

The source XML has an encoding declared as such:

<?xml version="1.0" encoding="ISO-8859-1"?>

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can't seem to
find what I need here...
 
M

matt neuburg

Travis Bell said:
Hey guys,

I think I am missing something very basic here. I have an XML request,
using the following code as an example:

require "rubygems"
require "xml/libxml"

movie = "sin+city"
search_url =
'http://www.movie-xml.com/interfaces/getmovie.php?moviename='
url = search_url+movie
doc = XML::Document.file(url)

Now, with most of the XML results I get from movie-xml.com, the default
utf-8 is fine since there are no non-utf-8 characters. When searching
Sin City as an example, there are. Here's the response I get:

Input is not proper UTF-8, indicate encoding !

The source XML has an encoding declared as such:

<?xml version="1.0" encoding="ISO-8859-1"?>

So I should probably just decode as ISO-8859-1 as well. How the hell do
I do that? I have Googled the crap out of this and just can't seem to
find what I need here...

Could this just be a bug in Libxml? REXML seems to do the right thing...
m.
 
E

Eric I.

Could this just be a bug in Libxml? REXML seems to do the right thing...

Clearly libxml is expecting UTF-8, even though the XML file specifies
that it's encoded in ISO-8859-1. So that's a bug.

However, it appears that libxml is "correctly" rejecting data that is
not proper UTF-8 (independent of what it claims to be). Twice in the
XML data the word "verg?enza" appears where the "?" has hex code 0xFC
that encodes a lower case "u" with umlaut in ISO-8859-1. 0xFC cannot
appear in UTF-8 data due to RFC-3629.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn't contain 13 specific bytes (0xC0, 0xC1, 0xF5..0xFF).

Eric

====

Are you interested in on-site Ruby or Ruby on Rails training
that uses well-designed, real-world, hands-on exercises?
http://LearnRuby.com
 
T

Travis Bell

Eric said:
Clearly libxml is expecting UTF-8, even though the XML file specifies
that it's encoded in ISO-8859-1. So that's a bug.

libxml should work with ISO-8859-1 data much of the time, as long as
it doesn't contain 13 specific bytes (0xC0, 0xC1, 0xF5..0xFF).

Heh, so is there a way around this aside from using REXML? Are we
concluding this is a bug in libxml?
 
M

matt neuburg

Travis Bell said:
Heh, so is there a way around this aside from using REXML?

Well, if you really want to, I suppose you could parse the encoding info
yourself, convert the encoding of the entire text and change the
encoding info to utf8, and then open with libxml.
Are we
concluding this is a bug in libxml?

Not sure. Couldn't hurt to report it, though. It has its own google
group and its own bug reporting page... m.
 
T

Travis Bell

matt said:
Well, if you really want to, I suppose you could parse the encoding info
yourself, convert the encoding of the entire text and change the
encoding info to utf8, and then open with libxml.


Not sure. Couldn't hurt to report it, though. It has its own google
group and its own bug reporting page... m.

Right on. For now I just switched to rexml and without any special
change everything parses properly. Good for anyone else to know for
future reference.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top