Encoding hell

Damphyr · Sep 5, 2005

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

which I can't parse with REXML

. If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

James Edward Gray II · Sep 5, 2005

OK, I am officially frustrated/lost/bewildered (take your pick)
with all this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN
numbers. I'm using a simple GET HTTP request and on a query the
service returns the following:

This is definitely a hack, but it's working for the data you showed:

require "rexml/document"

def unescape( xml )
xml.gsub!("<", "<")
xml.gsub!(">", ">")
xml.gsub!("&", "&")
xml
end

doc = REXML:

ocument.new(unescape(DATA.read))
doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }

__END__
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB /
</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

Hope that helps.

James Edward Gray II

Zach Dennis · Sep 5, 2005

OK, I am officially frustrated/lost/bewildered (take your pick) with all
this encoding/decoding of character sets.
I'm trying to grab some book data from a web service using ISBN numbers.
I'm using a simple GET HTTP request and on a query the service returns
the following:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www.webserviceX.NET">
<ISBNORG>
<RECORD>
<ISBN>0764558315</ISBN>
<AUTHOR>Rod Johnson, with Juergen Hoeller.</AUTHOR>
<FULLTITLE>Expert one-on-one J2EE development without EJB / Rod
Johnson, with Juergen Hoeller.</FULLTITLE>

<SHORTTITLE>Expert one-on-one J2EE development without EJB
/</SHORTTITLE>
<EDITION></EDITION>
<PUBLISHER>Wiley Pub./Wrox,</PUBLISHER>
<DATE>c2004.</DATE>
<SUBJECT>Java (Computer program language)</SUBJECT>

</RECORD>
</ISBNORG>
</string>

which I can't parse with REXML . If all the < > where < and >
then no prob, everything checks out fine. Same code with the above
snippet refuses to extract the data. Obviously I'm missing something.
Is there a way to parse this string so that all the escaped stuff goes
back to normal? Can REXML understand the ampersand thingies?
Any help will be appreciated,
Cheers,
V.-
P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
their database

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML:

ocument.new( str )

HTH,

Zach

Joshua Haberman · Sep 5, 2005

You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string )
REXML:ocument.new( str )

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML:

ocument.new(str)
real_doc = REXML:

ocument.new(enclosing_doc.elements["/string"].text)

Josh

Damphyr · Sep 5, 2005

Zach said:
You are seeing already escaped characters. You need to unescape them.

str = CGI.unescapeHTML( string ) REXML:ocument.new( str )

Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference:
reading about the changes in CGI between 1.6 and 1.8.
Thanks, that's what I was looking for (sorry James, the whole purpose
for the mail was to avoid the hack you so kindly provided

).
Cheers,
V.-

____________________________________________________________________
http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
http://www.freemail.gr - free email service for the Greek-speaking.

Zach Dennis · Sep 5, 2005

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML:ocument.new(str)
real_doc = REXML:ocument.new(enclosing_doc.elements["/string"].text)

Click to expand...

Good call. then unescape real_doc...

Zach

Joshua Haberman · Sep 5, 2005

--Apple-Mail-3--545520846
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Another way of looking at this: you're getting one XML document
embedded in another:

enclosing_doc = REXML:ocument.new(str)
real_doc = REXML:ocument.new(enclosing_doc.elements["/
string"].text)

Click to expand...

Click to expand...

Good call. then unescape real_doc...

No, that's the whole point. All escapes were interpreted when you
parsed enclosing_doc, and replaced by their corresponding characters.

The text of the <string> element is itself a valid XML document, and
incidentally, the XML document you really care about.

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Josh

--Apple-Mail-3--545520846--

Zach Dennis · Sep 5, 2005

Try "puts enclosing_doc.elements["/string"].text", and it should all
make more sense.

Ah, yep, you're right. Doing that makes more sense. =) I did't know
REXML would auto-unescape for you. Pretty cool. Thanks Josh!

Zach

[rrt_ruby]Ruby and RoseRT	3	Sep 15, 2005
Wierd behaviour with Date	6	Sep 27, 2005
Quick way to find all drives in a windows box	4	Jul 25, 2003
Regexp madness	2	Apr 1, 2006
using variables in regular expressions	3	Nov 24, 2003
soap4r 1.4.8.1 with REXML 2.7.1 - no REXML::VERSION_MAJOR	2	Jul 16, 2003
Ruby subverion bindings	3	Feb 3, 2006
Class Methods and derivation	10	Sep 28, 2005

Encoding hell

Damphyr

James Edward Gray II

Zach Dennis

Joshua Haberman

Damphyr

Zach Dennis

Joshua Haberman

Zach Dennis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads