Nokogiri SAX parser encoding problem

Michel Demazure · Aug 24, 2010

According to Nokogiri's doc, it works internally in UTF-8.
Running this :

# encoding: utf-8

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX:

ocument
def characters(string)
puts string.encoding
puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX:

arser.new(MyDoc.new, 'UTF-8')
parser.parse('<foo>Ã©pÃ©e</foo>')

gives :

1.9.2
UTF-8
UTF-8
ÃƒÂ©pÃƒÂ©e

Why ?
_md

Michel Demazure · Aug 24, 2010

Ryan said:
What if you redirect nokogiri's output to a file and view it in whatever
you entered the above string in?

Chances are it is your terminal, not ruby.

Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Actually, in my project, I use the SAX parser to build complex ruby
objects, which are marshaled to a file, and then used by a Shoes app.
This app gets the wrong answer. The culprit may therefore be Marshal.
I'll shift to YAML and report.

_md

Michel Demazure · Aug 24, 2010

Michel said:
Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Alas, no !

This is strange : when writing to a file :
1. by luck, for the example I gave ("Ã©pÃ©e"), I get back "Ã©pÃ©e"
correctly,
2. but when parsing "<foo>deuxiÃ¨me</foo>", I get "Ã¨me" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...

_md

Michel Demazure · Aug 24, 2010

Michel said:
Michel Demazure wrote:

2. but when parsing "<foo>deuxiÃ¨me</foo>", I get "Ã¨me" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...

Sorry for posting again. Actually, in this last example, 'characters' is
called twice, the first call giving "deuxi", the second one "Ã¨me".
Strange feature, still a bug (?), but one can do with...

_md

Ryan Davis · Aug 24, 2010

Sorry for posting again. Actually, in this last example, 'characters' = is=20
called twice, the first call giving "deuxi", the second one "=E8me".=20=

Strange feature, still a bug (?), but one can do with...

Yeah. that last part sounds like a bug. Unfortunately, Aaron Patterson =
is on an airplane for the next 12ish hours as he flies to rubykaigi. =
Mike may be able to help out here... otherwise I suggest you email the =
nokogiri mailing list with a minimal reproduction of the bug.

Bob Hutchison · Aug 25, 2010

Hi,

Sorry for posting again. Actually, in this last example, 'characters' = is=20
called twice, the first call giving "deuxi", the second one "=C3=A8me".=20=

Strange feature, still a bug (?), but one can do with...

Actually this is allowed by the XML spec, annoying as it is. Many =
parsers do this when encountering an entity (e.g. &apos

in the input =
stream (you get three strings, before, entity character, after). Some =
XML parsers have a parameter that tells it to join adjacent strings =
together before reporting a single string. I don't know if Nokogiri =
provides this functionality, but it might be worth a quick peek.

Cheers,
Bob

Michel Demazure · Aug 25, 2010

Bob said:
Actually this is allowed by the XML spec, annoying as it is. Many
parsers do this when encountering an entity (e.g. &apos in the input
stream (you get three strings, before, entity character, after). Some
XML parsers have a parameter that tells it to join adjacent strings
together before reporting a single string. I don't know if Nokogiri
provides this functionality, but it might be worth a quick peek.

@Bob : Yes, it is allowed.

From the nokogiri doc for the 'characters' method :

"This method might be called multiple times given one contiguous string
of characters."

@Ryan : strange as it is, it's a feature. So, IMHO, no bug report.

Actually, it is very strange. Parsing 'deuxiÃ¨me', you get two calls
'deuxi' + 'Ã¨me', but parsing the more complex 'Ã©pÃ©e deuxiÃ¨me', you get
only one ...

Thanks to both of you.
_md

Nokogiri : XSLT Transform passing parameters	1	Jan 29, 2012
[ANN] nokogiri 1.4.0 Released	1	Oct 31, 2009
[ANN] nokogiri 1.0.7 Released	1	Dec 3, 2008
Simple problem using Nokogiri xml eaderr	2	Nov 12, 2009
[ANN] nokogiri 1.0.5 Released	4	Nov 13, 2008
[ANN] nokogiri 1.0.3 Released	0	Nov 4, 2008
[ANN] nokogiri 1.0.0 Released	0	Oct 31, 2008
A sure way to crash JRuby with Nokogiri, on Windows	3	Mar 27, 2010

Nokogiri SAX parser encoding problem

Michel Demazure

Michel Demazure

Michel Demazure

Michel Demazure

Ryan Davis

Bob Hutchison

Michel Demazure

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads