turning a non-ASCII character into a XML entity with REXML?

F

Francis Hwang

I asked this a little while back but maybe didn't ask the right way, so
maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions,
ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take
a non-ASCII character and turn it into a usable XML entity. So, for
example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>© 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can
assign a text that includes these sorts of characters, and REXML will
know to turn them into entities on output? I know I can do this by hand
and then prevent escaping by use the :raw flag, but I'd like to avoid
that if possible.

Francis
 
P

Patrick May

I asked this a little while back but maybe didn't ask the right way,
so maybe somebody can help me if I rephrase:

I'm trying to build an RSS feed that takes, in its item descriptions,
ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to
take a non-ASCII character and turn it into a usable XML entity. So,
for example, "\251" would get turned into "&#169":

str = "\251 2004 Francis Hwang"
elt = REXML::Element.new( 'elt' )
elt.text = str
elt.to_s
=> "<elt>\251 2004 Francis Hwang</elt>"
# But I want "<elt>© 2004 Francis Hwang</elt>"

Is there some sort of setting I can twiddle in REXML so that I can
assign a text that includes these sorts of characters, and REXML will
know to turn them into entities on output? I know I can do this by
hand and then prevent escaping by use the :raw flag, but I'd like to
avoid that if possible.

I think there's an escapeHTML function on the CGI that might do it. Of
course, it will also hit the &gt; and &lt;. You could still lift the
code from there.

~ pat
 
F

Francis Hwang

I just tried; it doesn't do it.

irb(main):004:0> CGI.escapeHTML( "<br>")
=> "&lt;br&gt;"
irb(main):005:0> CGI.escapeHTML( "<br>\251")
=> "&lt;br&gt;\251"
 
B

Brian Candler

| I'm trying to build an RSS feed that takes, in its item descriptions,
| ISO-8859-1 text. (I'm using REXML for now.) I'd like to be able to take
| a non-ASCII character and turn it into a usable XML entity. So, for
| example, "\251" would get turned into "&#169"

Not exactly what you're asking for, but you could use Iconv to convert
ISO-8859-1 into UTF-8. It should be perfectly legal to include UTF-8
characters directly in XML, without turning them into character entities.

Alternatively, if it's sufficient to convert characters 160-255 straight
into numeric entity refs (which works if the top half of ISO-8859-1 maps
directly into Unicode, as I think it does), then how about

a = "Copyright \251 2004"
a.gsub!(/[\240-\377]/) { |c| "&#%d;" % c[0] }

# => "Copyright © 2004"

Regards,

Brian.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top