Encoding hell

D

Damphyr

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
/&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

which I can't parse with REXML :(. If all the &lt; &gt; where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database :)

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.
 
J

James Edward Gray II

OK, I am officially frustrated/lost/bewildered (take your pick)
with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN
numbers. I'm using a simple GET HTTP request and on a query the
service returns the following:

This is definitely a hack, but it's working for the data you showed:

require "rexml/document"

def unescape( xml )
xml.gsub!("&lt;", "<")
xml.gsub!("&gt;", ">")
xml.gsub!("&amp;", "&")
xml
end

doc = REXML::Document.new(unescape(DATA.read))
doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }

__END__
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB /
&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

Hope that helps.

James Edward Gray II
 
Z

Zach Dennis

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
&lt;ISBNORG&gt;
&lt;RECORD&gt;
&lt;ISBN&gt;0764558315&lt;/ISBN&gt;
&lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
&lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

&lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
/&lt;/SHORTTITLE&gt;
&lt;EDITION&gt;&lt;/EDITION&gt;
&lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
&lt;DATE&gt;c2004.&lt;/DATE&gt;
&lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

&lt;/RECORD&gt;
&lt;/ISBNORG&gt;
</string>

which I can't parse with REXML :(. If all the &lt; &gt; where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database :)

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

HTH,

Zach
 
J

Joshua Haberman

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML::Document.new( str )

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Josh
 
D

Damphyr

Zach said:
You are seeing already escaped characters. You need to unescape them.


str = CGI.unescapeHTML( string ) REXML::Document.new( str )
Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference:
reading about the changes in CGI between 1.6 and 1.8.
Thanks, that's what I was looking for (sorry James, the whole purpose
for the mail was to avoid the hack you so kindly provided :) ).
Cheers,
V.-

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.
 
Z

Zach Dennis

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

Good call. then unescape real_doc...

Zach
 
J

Joshua Haberman

--Apple-Mail-3--545520846
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML::Document.new(str)
real_doc = REXML::Document.new(enclosing_doc.elements["/
string"].text)

Good call. then unescape real_doc...

No, that's the whole point. All escapes were interpreted when you
parsed enclosing_doc, and replaced by their corresponding characters.

The text of the <string> element is itself a valid XML document, and
incidentally, the XML document you really care about.

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Josh

--Apple-Mail-3--545520846--
 
Z

Zach Dennis

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

Zach
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top