REXML & Extended characters - newbie question

Discussion in 'Ruby' started by Ralph Mason, Jan 12, 2004.

  1. Ralph Mason

    Ralph Mason Guest

    I am doing a quick and dirty automatic translation from English to
    spanish of some text in an xml document.

    However the translation returns characters outsize the 7 bit range,
    which seems to creates ain invalid xml document. I need those string
    utf8 encoded before I set the text of an element. But I cant see how to
    do this.

    Thanks for any help

    Regards
    Ralph


    A test doc looks like

    <?xml version='1.0' encoding='UTF-8'?>
    <text>Vehicle</text>

    Full code.

    require 'net/http'
    require 'cgi'
    require 'rexml/document'

    def translate(text)
    puts "translating #{text}"
    ret =""
    Net::HTTP.start('translate.google.com'){ |session|

    session.get("/translate_t?langpair=en|es&hl=en&text=#{CGI.escape(text)}"){
    |result| ret<< result
    }
    }
    ret =~ /(name=q.*?>)(.*?)</
    $2
    end

    def process(node)
    puts node.name
    node.text = translate(node.text) if ( node.text.strip != "" )
    node.elements.each{|x| process x}
    end

    doc = REXML::Document.new File.new "lang_eng.xml"
    doc.elements.each{|x| process x }
    doc.write(File.new("lang_spn.xml","w"),0)
     
    Ralph Mason, Jan 12, 2004
    #1
    1. Advertising

  2. Hi!

    * Ralph Mason:
    > I need those string utf8 encoded before I set the text of an
    > element.


    IIRC the encoding used by Google defaults to ISO-8859-1 while adding
    an explicit 'en=utf-8' to the argument part of the URL makes it use
    utf-8.

    Josef 'Jupp' SCHUGT
    --
    http://oss.erdfunkstelle.de/ruby/ - German comp.lang.ruby-FAQ
    http://rubyforge.org/users/jupp/ - Ruby projects at Rubyforge
    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Germany 2004: To boldly spy where no GESTAPO / STASI has spied before
     
    Josef 'Jupp' SCHUGT, Jan 13, 2004
    #2
    1. Advertising

  3. Ralph Mason

    Ralph Mason Guest

    Josef 'Jupp' SCHUGT wrote:

    >Hi!
    >
    >* Ralph Mason:
    >
    >
    >>I need those string utf8 encoded before I set the text of an
    >>element.
    >>
    >>

    >
    >IIRC the encoding used by Google defaults to ISO-8859-1 while adding
    >an explicit 'en=utf-8' to the argument part of the URL makes it use
    >utf-8.
    >
    >Josef 'Jupp' SCHUGT
    >
    >

    Thanks for that, I'll give it a go I had a workaround with

    node.text = str.pack("C*").unpack("U*")

    It would be good if there was some documentation somewhere about text
    conversions and REXML. Or some kind of encoding aware string class
    that could act as an intermediary.

    Ralph
     
    Ralph Mason, Jan 13, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew Holme
    Replies:
    1
    Views:
    510
    Andrew Holme
    May 15, 2007
  2. Damphyr
    Replies:
    2
    Views:
    148
    Damphyr
    Jul 16, 2003
  3. Daniel Berger

    rexml error - REXML::Validation

    Daniel Berger, Oct 12, 2004, in forum: Ruby
    Replies:
    2
    Views:
    157
    Henrik Horneber
    Oct 12, 2004
  4. Phlip
    Replies:
    0
    Views:
    149
    Phlip
    Jan 15, 2008
  5. Onno Faber
    Replies:
    1
    Views:
    114
    Robert Klemme
    Oct 22, 2008
Loading...

Share This Page