Nokogiri SAX parser encoding problem

Discussion in 'Ruby' started by Michel Demazure, Aug 24, 2010.

  1. According to Nokogiri's doc, it works internally in UTF-8.
    Running this :

    # encoding: utf-8

    require 'nokogiri'

    class MyDoc < Nokogiri::XML::SAX::Document
    def characters(string)
    puts string.encoding
    puts string
    end
    end

    puts RUBY_VERSION
    puts Encoding.default_external

    parser = Nokogiri::XML::SAX::parser.new(MyDoc.new, 'UTF-8')
    parser.parse('<foo>épée</foo>')

    gives :

    1.9.2
    UTF-8
    UTF-8
    épée

    Why ?
    _md
    --
    Posted via http://www.ruby-forum.com/.
     
    Michel Demazure, Aug 24, 2010
    #1
    1. Advertising

  2. Ryan Davis wrote:
    >
    > What if you redirect nokogiri's output to a file and view it in whatever
    > you entered the above string in?
    >
    > Chances are it is your terminal, not ruby.


    Yes, Ryan, you are right : writing to a utf-8 file gives the good
    answer.

    Actually, in my project, I use the SAX parser to build complex ruby
    objects, which are marshaled to a file, and then used by a Shoes app.
    This app gets the wrong answer. The culprit may therefore be Marshal.
    I'll shift to YAML and report.

    _md
    --
    Posted via http://www.ruby-forum.com/.
     
    Michel Demazure, Aug 24, 2010
    #2
    1. Advertising

  3. Michel Demazure wrote:
    > Ryan Davis wrote:
    >>
    >> What if you redirect nokogiri's output to a file and view it in whatever
    >> you entered the above string in?
    >>
    >> Chances are it is your terminal, not ruby.

    >
    > Yes, Ryan, you are right : writing to a utf-8 file gives the good
    > answer.
    >


    Alas, no !

    This is strange : when writing to a file :
    1. by luck, for the example I gave ("épée"), I get back "épée"
    correctly,
    2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
    initial bug I discovered in my app).

    This is not the first time I see the "grave accented e" giving trouble
    when scanning or parsing in ruby, whatever tool is used...

    _md


    --
    Posted via http://www.ruby-forum.com/.
     
    Michel Demazure, Aug 24, 2010
    #3
  4. Michel Demazure wrote:
    > Michel Demazure wrote:


    > 2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
    > initial bug I discovered in my app).
    >
    > This is not the first time I see the "grave accented e" giving trouble
    > when scanning or parsing in ruby, whatever tool is used...
    >

    Sorry for posting again. Actually, in this last example, 'characters' is
    called twice, the first call giving "deuxi", the second one "ème".
    Strange feature, still a bug (?), but one can do with...

    _md


    --
    Posted via http://www.ruby-forum.com/.
     
    Michel Demazure, Aug 24, 2010
    #4
  5. Michel Demazure

    Ryan Davis Guest

    On Aug 24, 2010, at 06:49 , Michel Demazure wrote:

    > Michel Demazure wrote:
    >> Michel Demazure wrote:

    >=20
    >> 2. but when parsing "<foo>deuxi=E8me</foo>", I get "=E8me" (this was =

    the=20
    >> initial bug I discovered in my app).
    >>=20
    >> This is not the first time I see the "grave accented e" giving =

    trouble=20
    >> when scanning or parsing in ruby, whatever tool is used...
    >>=20

    > Sorry for posting again. Actually, in this last example, 'characters' =

    is=20
    > called twice, the first call giving "deuxi", the second one "=E8me".=20=


    > Strange feature, still a bug (?), but one can do with...


    Yeah. that last part sounds like a bug. Unfortunately, Aaron Patterson =
    is on an airplane for the next 12ish hours as he flies to rubykaigi. =
    Mike may be able to help out here... otherwise I suggest you email the =
    nokogiri mailing list with a minimal reproduction of the bug.
     
    Ryan Davis, Aug 24, 2010
    #5
  6. Hi,

    On 2010-08-24, at 9:49 AM, Michel Demazure wrote:

    > Michel Demazure wrote:
    >> Michel Demazure wrote:

    >=20
    >> 2. but when parsing "<foo>deuxi=C3=A8me</foo>", I get "=C3=A8me" =

    (this was the=20
    >> initial bug I discovered in my app).
    >>=20
    >> This is not the first time I see the "grave accented e" giving =

    trouble=20
    >> when scanning or parsing in ruby, whatever tool is used...
    >>=20

    > Sorry for posting again. Actually, in this last example, 'characters' =

    is=20
    > called twice, the first call giving "deuxi", the second one "=C3=A8me".=20=


    > Strange feature, still a bug (?), but one can do with...


    Actually this is allowed by the XML spec, annoying as it is. Many =
    parsers do this when encountering an entity (e.g. &apos;) in the input =
    stream (you get three strings, before, entity character, after). Some =
    XML parsers have a parameter that tells it to join adjacent strings =
    together before reporting a single string. I don't know if Nokogiri =
    provides this functionality, but it might be worth a quick peek.

    Cheers,
    Bob

    >=20
    > _md
    >=20
    >=20
    > --=20
    > Posted via http://www.ruby-forum.com/.
    >=20


    ----
    Bob Hutchison
    Recursive Design Inc.
    http://www.recursive.ca/
    weblog: http://xampl.com/so
     
    Bob Hutchison, Aug 25, 2010
    #6
  7. Bob Hutchison wrote:
    >
    > Actually this is allowed by the XML spec, annoying as it is. Many
    > parsers do this when encountering an entity (e.g. &apos;) in the input
    > stream (you get three strings, before, entity character, after). Some
    > XML parsers have a parameter that tells it to join adjacent strings
    > together before reporting a single string. I don't know if Nokogiri
    > provides this functionality, but it might be worth a quick peek.
    >


    @Bob : Yes, it is allowed.

    From the nokogiri doc for the 'characters' method :

    "This method might be called multiple times given one contiguous string
    of characters."

    @Ryan : strange as it is, it's a feature. So, IMHO, no bug report.

    Actually, it is very strange. Parsing 'deuxième', you get two calls
    'deuxi' + 'ème', but parsing the more complex 'épée deuxième', you get
    only one ...

    Thanks to both of you.
    _md
    --
    Posted via http://www.ruby-forum.com/.
     
    Michel Demazure, Aug 25, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Martin Schlatter

    Encoding problem with SAX parser

    Martin Schlatter, Dec 10, 2003, in forum: Java
    Replies:
    2
    Views:
    864
    Martin Schlatter
    Dec 14, 2003
  2. Replies:
    5
    Views:
    16,274
    Steve W. Jackson
    Sep 15, 2005
  3. RamaKrishna Narla
    Replies:
    1
    Views:
    664
    Joe Kesselman
    Aug 22, 2006
  4. Åukasz
    Replies:
    2
    Views:
    1,614
    Stefan Behnel
    Aug 7, 2009
  5. Trans

    Nokogiri sax parser error

    Trans, Feb 8, 2009, in forum: Ruby
    Replies:
    2
    Views:
    128
    Mike Cargal
    Feb 8, 2009
Loading...

Share This Page