Encoding hell

Discussion in 'Ruby' started by Damphyr, Sep 5, 2005.

  1. Damphyr

    Damphyr Guest

    OK, I am officially frustrated/lost/bewildered (take your pick) with all
    this encoding/decoding of character sets.
    I'm trying to grab some book data from a web service using ISBN numbers.
    I'm using a simple GET HTTP request and on a query the service returns
    the following:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www.webserviceX.NET">
    &lt;ISBNORG&gt;
    &lt;RECORD&gt;
    &lt;ISBN&gt;0764558315&lt;/ISBN&gt;
    &lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
    &lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
    Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

    &lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
    /&lt;/SHORTTITLE&gt;
    &lt;EDITION&gt;&lt;/EDITION&gt;
    &lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
    &lt;DATE&gt;c2004.&lt;/DATE&gt;
    &lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

    &lt;/RECORD&gt;
    &lt;/ISBNORG&gt;
    </string>

    which I can't parse with REXML :(. If all the &lt; &gt; where < and >
    then no prob, everything checks out fine. Same code with the above
    snippet refuses to extract the data. Obviously I'm missing something.
    Is there a way to parse this string so that all the escaped stuff goes
    back to normal? Can REXML understand the ampersand thingies?
    Any help will be appreciated,
    Cheers,
    V.-
    P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
    their database :)

    ____________________________________________________________________
    http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
    http://www.freemail.gr - free email service for the Greek-speaking.
     
    Damphyr, Sep 5, 2005
    #1
    1. Advertising

  2. On Sep 5, 2005, at 8:34 AM, Damphyr wrote:

    > OK, I am officially frustrated/lost/bewildered (take your pick)
    > with all this encoding/decoding of character sets.
    > I'm trying to grab some book data from a web service using ISBN
    > numbers. I'm using a simple GET HTTP request and on a query the
    > service returns the following:


    This is definitely a hack, but it's working for the data you showed:

    require "rexml/document"

    def unescape( xml )
    xml.gsub!("&lt;", "<")
    xml.gsub!("&gt;", ">")
    xml.gsub!("&amp;", "&")
    xml
    end

    doc = REXML::Document.new(unescape(DATA.read))
    doc.each_element("string/ISBNORG/RECORD/*") { |e| p e }

    __END__
    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www.webserviceX.NET">
    &lt;ISBNORG&gt;
    &lt;RECORD&gt;
    &lt;ISBN&gt;0764558315&lt;/ISBN&gt;
    &lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
    &lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
    Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;

    &lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB /
    &lt;/SHORTTITLE&gt;
    &lt;EDITION&gt;&lt;/EDITION&gt;
    &lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
    &lt;DATE&gt;c2004.&lt;/DATE&gt;
    &lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;

    &lt;/RECORD&gt;
    &lt;/ISBNORG&gt;
    </string>

    Hope that helps.

    James Edward Gray II
     
    James Edward Gray II, Sep 5, 2005
    #2
    1. Advertising

  3. Damphyr

    Zach Dennis Guest

    On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:
    > OK, I am officially frustrated/lost/bewildered (take your pick) with all
    > this encoding/decoding of character sets.
    > I'm trying to grab some book data from a web service using ISBN numbers.
    > I'm using a simple GET HTTP request and on a query the service returns
    > the following:
    >
    > <?xml version="1.0" encoding="utf-8"?>
    > <string xmlns="http://www.webserviceX.NET">
    > &lt;ISBNORG&gt;
    > &lt;RECORD&gt;
    > &lt;ISBN&gt;0764558315&lt;/ISBN&gt;
    > &lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
    > &lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
    > Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;
    >
    > &lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
    > /&lt;/SHORTTITLE&gt;
    > &lt;EDITION&gt;&lt;/EDITION&gt;
    > &lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
    > &lt;DATE&gt;c2004.&lt;/DATE&gt;
    > &lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;
    >
    > &lt;/RECORD&gt;
    > &lt;/ISBNORG&gt;
    > </string>
    >
    > which I can't parse with REXML :(. If all the &lt; &gt; where < and >
    > then no prob, everything checks out fine. Same code with the above
    > snippet refuses to extract the data. Obviously I'm missing something.
    > Is there a way to parse this string so that all the escaped stuff goes
    > back to normal? Can REXML understand the ampersand thingies?
    > Any help will be appreciated,
    > Cheers,
    > V.-
    > P.S. I'd have used Pickaxe 2.ed for the example if only the book was in
    > their database :)


    You are seeing already escaped characters. You need to unescape them.

    str = CGI.unescapeHTML( string )
    REXML::Document.new( str )

    HTH,

    Zach
     
    Zach Dennis, Sep 5, 2005
    #3
  4. On Sep 5, 2005, at 8:25 AM, Zach Dennis wrote:

    > On Mon, 2005-09-05 at 22:34 +0900, Damphyr wrote:
    >
    >> OK, I am officially frustrated/lost/bewildered (take your pick)
    >> with all
    >> this encoding/decoding of character sets.
    >> I'm trying to grab some book data from a web service using ISBN
    >> numbers.
    >> I'm using a simple GET HTTP request and on a query the service
    >> returns
    >> the following:
    >>
    >> <?xml version="1.0" encoding="utf-8"?>
    >> <string xmlns="http://www.webserviceX.NET">
    >> &lt;ISBNORG&gt;
    >> &lt;RECORD&gt;
    >> &lt;ISBN&gt;0764558315&lt;/ISBN&gt;
    >> &lt;AUTHOR&gt;Rod Johnson, with Juergen Hoeller.&lt;/AUTHOR&gt;
    >> &lt;FULLTITLE&gt;Expert one-on-one J2EE development without EJB / Rod
    >> Johnson, with Juergen Hoeller.&lt;/FULLTITLE&gt;
    >>
    >> &lt;SHORTTITLE&gt;Expert one-on-one J2EE development without EJB
    >> /&lt;/SHORTTITLE&gt;
    >> &lt;EDITION&gt;&lt;/EDITION&gt;
    >> &lt;PUBLISHER&gt;Wiley Pub./Wrox,&lt;/PUBLISHER&gt;
    >> &lt;DATE&gt;c2004.&lt;/DATE&gt;
    >> &lt;SUBJECT&gt;Java (Computer program language)&lt;/SUBJECT&gt;
    >>
    >> &lt;/RECORD&gt;
    >> &lt;/ISBNORG&gt;
    >> </string>
    >>
    >> which I can't parse with REXML :(. If all the &lt; &gt; where < and >
    >> then no prob, everything checks out fine. Same code with the above
    >> snippet refuses to extract the data. Obviously I'm missing something.
    >> Is there a way to parse this string so that all the escaped stuff
    >> goes
    >> back to normal? Can REXML understand the ampersand thingies?
    >> Any help will be appreciated,
    >> Cheers,
    >> V.-
    >> P.S. I'd have used Pickaxe 2.ed for the example if only the book
    >> was in
    >> their database :)
    >>

    >
    > You are seeing already escaped characters. You need to unescape them.
    >
    > str = CGI.unescapeHTML( string )
    > REXML::Document.new( str )


    Another way of looking at this: you're getting one XML document
    embedded in another:

    enclosing_doc = REXML::Document.new(str)
    real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)

    Josh
     
    Joshua Haberman, Sep 5, 2005
    #4
  5. Damphyr

    Damphyr Guest

    Zach Dennis wrote:
    > You are seeing already escaped characters. You need to unescape them.
    >
    >
    > str = CGI.unescapeHTML( string ) REXML::Document.new( str )

    Aaaaargh, I knew it. That's whre I saw the 'double escaping' reference:
    reading about the changes in CGI between 1.6 and 1.8.
    Thanks, that's what I was looking for (sorry James, the whole purpose
    for the mail was to avoid the hack you so kindly provided :) ).
    Cheers,
    V.-

    ____________________________________________________________________
    http://www.freemail.gr - äùñåÜí õðçñåóßá çëåêôñïíéêïý ôá÷õäñïìåßïõ.
    http://www.freemail.gr - free email service for the Greek-speaking.
     
    Damphyr, Sep 5, 2005
    #5
  6. Damphyr

    Zach Dennis Guest

    On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:

    > >
    > > Another way of looking at this: you're getting one XML document
    > > embedded in another:
    > >
    > > enclosing_doc = REXML::Document.new(str)
    > > real_doc = REXML::Document.new(enclosing_doc.elements["/string"].text)
    > >


    Good call. then unescape real_doc...

    Zach
     
    Zach Dennis, Sep 5, 2005
    #6
  7. --Apple-Mail-3--545520846
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain;
    charset=US-ASCII;
    delsp=yes;
    format=flowed

    On Sep 5, 2005, at 9:26 AM, Zach Dennis wrote:

    > On Tue, 2005-09-06 at 00:58 +0900, Christian Neukirchen wrote:
    >
    >>>
    >>> Another way of looking at this: you're getting one XML document
    >>> embedded in another:
    >>>
    >>> enclosing_doc = REXML::Document.new(str)
    >>> real_doc = REXML::Document.new(enclosing_doc.elements["/
    >>> string"].text)
    >>>
    >>>

    >
    > Good call. then unescape real_doc...


    No, that's the whole point. All escapes were interpreted when you
    parsed enclosing_doc, and replaced by their corresponding characters.

    The text of the <string> element is itself a valid XML document, and
    incidentally, the XML document you really care about.

    Try "puts enclosing_doc.elements["/string"].text", and it should all
    make more sense.

    Josh

    --Apple-Mail-3--545520846--
     
    Joshua Haberman, Sep 5, 2005
    #7
  8. Damphyr

    Zach Dennis Guest

    On Tue, 2005-09-06 at 01:34 +0900, Joshua Haberman wrote:

    > Try "puts enclosing_doc.elements["/string"].text", and it should all
    > make more sense.


    Ah, yep, you're right. Doing that makes more sense. =) I did't know
    REXML would auto-unescape for you. Pretty cool. Thanks Josh!

    Zach
     
    Zach Dennis, Sep 5, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,995
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,524
    Real Gagnon
    Oct 8, 2004
  3. fscked
    Replies:
    8
    Views:
    455
    Stefan Behnel
    Apr 14, 2007
  4. southof40
    Replies:
    3
    Views:
    455
    southof40
    Mar 8, 2011
  5. Xavier Noëlle

    [ENCODING] UTF8 hell

    Xavier Noëlle, Feb 2, 2010, in forum: Ruby
    Replies:
    12
    Views:
    552
    Michael Fellinger
    Feb 24, 2010
Loading...

Share This Page