REXML: parsing a string with unescaped ampersand entities

F

Frank Reiff

Hi,

REXML seems to SOMETIMES choke on parsing ampersands within entities,
e.g.

string = '<?xml version="1.0"
encoding="UTF-8"?><hello>hello&world</hello>'
doc = Document.new(string)
puts "#{doc}"

works fine (output below):
<?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>

BUT:

string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
world</hello>'
doc = Document.new(string)
puts "#{doc}"

crashes out with:

REXML::parseException: #<RuntimeError: Illegal character '&' in raw
string "hello& world">
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in
‘initialize’
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`new'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`parse'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:190:in
`build'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:45:in
`initialize' /Users/frankreiff/Live
Developments/ruby/analyze/xml_parser.rb:102:in `new'
/Users/frankreiff/Live Developments/ruby/analyze/xml_parser.rb:102 ...
Illegal character '&' in raw string "hello& world" Line: Position: Last
80 unconsumed characters: </hello>

The difference is the space after the &

What is going on? and how can I fix this?

Best regards,

Frank
 
B

Bob Hutchison

Hi,

Hi,

REXML seems to SOMETIMES choke on parsing ampersands within entities,
e.g.

string = '<?xml version="1.0"
encoding="UTF-8"?><hello>hello&world</hello>'
doc = Document.new(string)
puts "#{doc}"

works fine (output below):
<?xml version='1.0' encoding='UTF-8'?><hello>hello&world</hello>

BUT:

string = '<?xml version="1.0" encoding="UTF-8"?><hello>hello&
world</hello>'
doc = Document.new(string)
puts "#{doc}"

[ snip]
What is going on? and how can I fix this?

Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.

Cheers,
Bob
 
F

Frank Reiff

Neither is legal XML, both should fail. You can either escape the
content or use a CDATA block.

You're of course right. Both are illegal.

Somebody suggested to me that the original problem might be caused by
incorrectly encoded entities (&amp; &quot;) and reading through the w3c
spec (always a bad idea) got me confused to the extend of believing that
you only had to encode character entities in attribute values; which
isn't the case. Can't in fact be the case, otherwise the parser couldn't
differentiate between a "normal" ampersand and the beginning of a
character entity.

Which brings me back to my original problem of receiving a truncated XML
as an HTML post (see my previous question). This ONLY HAPPENS when there
is an ampersand somewhere in the message.

Could it be that CGI.params behaves differently when there is an
ampersand in the request, e.g. it tries to parse the request into
key/value pairs and returns a hash rather than a simple string in that
case!?

I think I might be on to something there..
 
F

Frank Reiff

I think I might be on to something there..

Ok, it was in fact precisely that. When I do a :

cgi.params.to_s

I get the correctly formatted XML message, but when there is an & in the
message

cgi.params.to_s

this produces an erratic output.

This is of course because:

The method params() returns a hash of all parameters in the request as
name/value-list pairs, where the value-list is an Array of one or more
values. The CGI object itself also behaves as a hash of parameter names
to values, but only returns a single value (as a String) for each
parameter name.

The output is therefore a fluke that's solely based on the fact that
there is only one parameter.

Now my FINAL question to all the Ruby gurus:

* How do I get the POST-ed message body without any clever splitting
into key/value pairs!?
 
E

Engine Yard

Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content (as
a result of a web service call) with unescaped "& " and this causes the
same issue.

Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.


Thanks,
Maruthy.
 
E

Eric Hodel

Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content
(as
a result of a web service call) with unescaped "& " and this causes
the
same issue.

These two statements are contradictory. It can't both be XML and have
unescaped &.
Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.


Use a parser that handles errors:

$ ruby -rubygems -e 'require "nokogiri"; d = Nokogiri::XML said:
</foo>"; p d.errors; p d'
[#<Nokogiri::XML::SyntaxError: xmlParseEntityRef: no name
<?xml version="1.0"?>
<foo>
<bar/>
</foo>

$

PS: The name on your email account is odd.
 
E

Engine Yard

Sorry, may be I was unclear. What happens is: the xml that is sent back
contains something like "cheese & coffee" so this breaks REXML, and
since the call to REXML's method is made from the ActiveResource object
itself we are unable to customize how REXML's initialize method is
called.

When I modified the line "cheese & coffee" to "cheese &amp; coffee"
everything works fine. So I am pretty sure this issue is being caused
because of the unescaped ampersand that is contained in the xml. But I
cannot modify the xml content as it comes to me from a third party
source over which we have no control.

Here is the stack trace that might help clearing up things:

<code>
--- !ruby/exception:REXML::parseException
message: |-
#<RuntimeError: Illegal character '&' in raw string "cheese & coffee
">
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/text.rb:91:in
`initialize'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`new'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/parsers/treeparser.rb:43:in
`parse'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:227:in
`build'
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rexml/document.rb:43:in
`initialize'
/Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/xml_mini/rexml.rb:17:in
`new'
/Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/xml_mini/rexml.rb:17:in
`parse'
(__DELEGATION__):2:in `__send__'
(__DELEGATION__):2:in `parse'
/Users/mxx/.gem/ruby/1.8/gems/activesupport-2.3.2/lib/active_support/core_ext/hash/conversions.rb:154:in
`from_xml'
/Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/formats/xml_format.rb:19:in
`decode'
/Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/connection.rb:116:in
`get'
/Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/base.rb:587:in
`find_one'
/Library/Ruby/Gems/1.8/gems/activeresource-2.3.2/lib/active_resource/base.rb:522:in
`find'
</code>


How can we change what parser is being used by Rails. I would definitely
like to try out Nokogiri but am unsure about how to make it work with
Rails (2.3.2) and in specific with ActiveResource though.

Eric said:
Did anyone actually find out a solution to this issue.

I have an ActiveResource object which returns back some xml content
(as
a result of a web service call) with unescaped "& " and this causes
the
same issue.

These two statements are contradictory. It can't both be XML and have
unescaped &.
Wondering if we can customize the method call to
REXML::Text.initialize() and set raw=>false.

Tried to change the setting in the REXML library that comes with the
Ruby distribution, but that did not help.

Is there a way to overcome this issue without patching any of Ruby or
Rails code. I cannot change the content on the web service server.


Use a parser that handles errors:

$ ruby -rubygems -e 'require "nokogiri"; d = Nokogiri::XML said:
</foo>"; p d.errors; p d'
[#<Nokogiri::XML::SyntaxError: xmlParseEntityRef: no name
<?xml version="1.0"?>
<foo>
<bar/>
</foo>

$

PS: The name on your email account is odd.
 
E

Eric Hodel

Sorry, may be I was unclear.

I understood.
What happens is: the xml that is sent back contains something like
"cheese & coffee"

Right. This is not XML because & is not escaped.
so this breaks REXML, and since the call to REXML's method is made
from the ActiveResource object itself we are unable to customize how
REXML's initialize method is called.

When I modified the line "cheese & coffee" to "cheese &amp; coffee"
everything works fine

Yep, this is escaped and valid for XML. You should file a bug with
the website you're consuming telling them they have broken XML output.

That doesn't help you solve your problem though! :)
So I am pretty sure this issue is being caused because of the
unescaped ampersand that is contained in the xml. But I cannot
modify the xml content as it comes to me from a third party source
over which we have no control.

Here is the stack trace that might help clearing up things:

<code>
--- !ruby/exception:REXML::parseException
message: |-
#<RuntimeError: Illegal character '&' in raw string "cheese &
coffee">
[...]
</code>

Yes. ActiveResource and REXML are behaving correctly. They're not
really sure what to do with text you've given them as it's not XML.

Since you have to deal with reality, you'll want a forgiving XML
parser that can handle some invalid XML when you really need to.
How can we change what parser is being used by Rails. I would
definitely
like to try out Nokogiri but am unsure about how to make it work with
Rails (2.3.2) and in specific with ActiveResource though.

I'm not sure either. You could either use a different tool than
ActiveResource or try contacting the ActiveResource maintainers for
help in adding the option of being forgiving of invalid XML.
(Nokogiri is good at correcting invalid XML, so I suggested it.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top