Nokogiri SAX parser encoding problem

M

Michel Demazure

According to Nokogiri's doc, it works internally in UTF-8.
Running this :

# encoding: utf-8

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX::Document
def characters(string)
puts string.encoding
puts string
end
end

puts RUBY_VERSION
puts Encoding.default_external

parser = Nokogiri::XML::SAX::parser.new(MyDoc.new, 'UTF-8')
parser.parse('<foo>épée</foo>')

gives :

1.9.2
UTF-8
UTF-8
épée

Why ?
_md
 
M

Michel Demazure

Ryan said:
What if you redirect nokogiri's output to a file and view it in whatever
you entered the above string in?

Chances are it is your terminal, not ruby.

Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Actually, in my project, I use the SAX parser to build complex ruby
objects, which are marshaled to a file, and then used by a Shoes app.
This app gets the wrong answer. The culprit may therefore be Marshal.
I'll shift to YAML and report.

_md
 
M

Michel Demazure

Michel said:
Yes, Ryan, you are right : writing to a utf-8 file gives the good
answer.

Alas, no !

This is strange : when writing to a file :
1. by luck, for the example I gave ("épée"), I get back "épée"
correctly,
2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...

_md
 
M

Michel Demazure

Michel said:
Michel Demazure wrote:
2. but when parsing "<foo>deuxième</foo>", I get "ème" (this was the
initial bug I discovered in my app).

This is not the first time I see the "grave accented e" giving trouble
when scanning or parsing in ruby, whatever tool is used...
Sorry for posting again. Actually, in this last example, 'characters' is
called twice, the first call giving "deuxi", the second one "ème".
Strange feature, still a bug (?), but one can do with...

_md
 
R

Ryan Davis

Sorry for posting again. Actually, in this last example, 'characters' = is=20
called twice, the first call giving "deuxi", the second one "=E8me".=20=
Strange feature, still a bug (?), but one can do with...

Yeah. that last part sounds like a bug. Unfortunately, Aaron Patterson =
is on an airplane for the next 12ish hours as he flies to rubykaigi. =
Mike may be able to help out here... otherwise I suggest you email the =
nokogiri mailing list with a minimal reproduction of the bug.
 
B

Bob Hutchison

Hi,

Sorry for posting again. Actually, in this last example, 'characters' = is=20
called twice, the first call giving "deuxi", the second one "=C3=A8me".=20=
Strange feature, still a bug (?), but one can do with...

Actually this is allowed by the XML spec, annoying as it is. Many =
parsers do this when encountering an entity (e.g. &apos;) in the input =
stream (you get three strings, before, entity character, after). Some =
XML parsers have a parameter that tells it to join adjacent strings =
together before reporting a single string. I don't know if Nokogiri =
provides this functionality, but it might be worth a quick peek.

Cheers,
Bob
 
M

Michel Demazure

Bob said:
Actually this is allowed by the XML spec, annoying as it is. Many
parsers do this when encountering an entity (e.g. &apos;) in the input
stream (you get three strings, before, entity character, after). Some
XML parsers have a parameter that tells it to join adjacent strings
together before reporting a single string. I don't know if Nokogiri
provides this functionality, but it might be worth a quick peek.

@Bob : Yes, it is allowed.

From the nokogiri doc for the 'characters' method :

"This method might be called multiple times given one contiguous string
of characters."

@Ryan : strange as it is, it's a feature. So, IMHO, no bug report.

Actually, it is very strange. Parsing 'deuxième', you get two calls
'deuxi' + 'ème', but parsing the more complex 'épée deuxième', you get
only one ...

Thanks to both of you.
_md
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top