libxml's SaxParser and UTF-8 problem

P

Peter Higgins

I've written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

<key>Name</key><string>90's Music</string>

In case it doesn't come through correctly, the "'" character above is an
apostrophe, represented as <E2><80><99> when I view the xml with less.

When the on_characters method is called for the string "90's Music", the
buffer only contains "90", with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I've overlooked to cause this
behavior?
 
P

Peter Higgins

Peter said:
I've written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

<key>Name</key><string>90's Music</string>

In case it doesn't come through correctly, the "'" character above is an
apostrophe, represented as <E2><80><99> when I view the xml with less.

When the on_characters method is called for the string "90's Music", the
buffer only contains "90", with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I've overlooked to cause this
behavior?

As part of researching the problem, I wrote a small test script with
REXML looking for that particular string, and it returned the correct,
full quote: "90’s Music". It looks like this is a bug with libxml then,
so I'll post on their mailing list.
 
J

Jenda Krynicky

Peter said:
I've written a small script to parse an xml doc with SaxParser and
everything goes well until the parser encounters a Unicode character.
For example, in the for the following snippet:

<key>Name</key><string>90's Music</string>

In case it doesn't come through correctly, the "'" character above is an
apostrophe, represented as <E2><80><99> when I view the xml with less.

When the on_characters method is called for the string "90's Music", the
buffer only contains "90", with no error or warning being presented.
After this is encountered parsing occurs normally; the first I saw of
the bug was when I noticed some of my strings being truncated. Is there
some setting of libxml or ruby that I've overlooked to cause this
behavior?

Any chance the quote is passed to another call to on_characters? I do
believe SAX does not always return all the content of a tag in one call
to the handler, but sometimes calls the handler several times and you
have to put it all together yourself.

Of course it could be Wuby unable to handle the UTF8.

Jenda
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top