Escaping non-ASCII chars for RTF export

D

Dan Herrera

Greetings,

I'm attempting to convert non-ASCII characters to unicode escape
sequences for export to RTF, and I haven't had much luck finding any
good information searching google. Anyone here have any good
resources for this sort of thing?

Thanks!

dan
 
D

Dan Herrera

http://ruby-rtf.rubyforge.org/

Ruby RTF library. Creates RTF documents... might be a good start.

Hi, thanks for taking a look at my problem.

I am using the Ruby RTF library currently to generate RTF files. The
trouble I'm running into is with strings like 'gør'. When you add
that ø character, it doesn't get converted to it's unicode counterpart
and the result is mangled when viewed.

Thanks again for your help,

dan
 
B

bbxx789_05ss

Dan said:
Hi, thanks for taking a look at my problem.

I am using the Ruby RTF library currently to generate RTF files. The
trouble I'm running into is with strings like 'g�r'. When you add
that � character, it doesn't get converted to it's unicode counterpart
and the result is mangled when viewed.

A unicode has to be converted into a character language(called an
'encoding') that your display device can understand before the character
can be displayed. Common character languages(or 'encodings') are ascii
and utf-8. It sounds like the string you are starting with is encoded
in a character language that your display device doesn't understand.

Therefore, you need to figure out what character language your display
device does understand. utf-8 is pretty common, so you can start off
trying to convert your strings to the utf-8 character language, and then
see if the strings will display correctly. But to convert your strings
to utf-8, you need to know the current character language that the
string is written in. If you don't know the current language, you can
start off by trying ISO-8859-15. The characters that make up the
ISO-8859-15 language are listed here:

http://en.wikipedia.org/wiki/ISO_8859-15


To convert from ISO-8859-15 to utf-8, you can do this:

str = "Hell\xf6 w\xf6rld" #\xf6 is 'o' with umlaut in ISO-8859-15
puts str

--output (which my display device shows me):--
Hell? w?rld #I see question marks instead of o's with umlauts

Therefore, my display device does not understand the IS0-8859-15
character language. Since I want my display device to display the o's
with umlauts, I'll try converting the string to the utf-8 character
language:

require 'iconv' #'Internationalization converter'?

converter = Iconv.new('UTF-8', 'ISO-8859-15')
new_str = converter.iconv(str)
puts new_str

--output:--
Hellö wörld #I see o's with unlauts
 
D

Dan Herrera

A unicode has to be converted into a character language(called an
'encoding') that your display device can understand before the character
can be displayed. Common character languages(or 'encodings') are ascii
and utf-8. It sounds like the string you are starting with is encoded
in a character language that your display device doesn't understand.

Therefore, you need to figure out what character language your display
device does understand. utf-8 is pretty common, so you can start off
trying to convert your strings to the utf-8 character language, and then
see if the strings will display correctly. But to convert your strings
to utf-8, you need to know the current character language that the
string is written in. If you don't know the current language, you can
start off by trying ISO-8859-15. The characters that make up the
ISO-8859-15 language are listed here:

http://en.wikipedia.org/wiki/ISO_8859-15

To convert from ISO-8859-15 to utf-8, you can do this:

str = "Hell\xf6 w\xf6rld" #\xf6 is 'o' with umlaut in ISO-8859-15
puts str

--output (which my display device shows me):--
Hell? w?rld #I see question marks instead of o's with umlauts

Therefore, my display device does not understand the IS0-8859-15
character language. Since I want my display device to display the o's
with umlauts, I'll try converting the string to the utf-8 character
language:

require 'iconv' #'Internationalization converter'?

converter = Iconv.new('UTF-8', 'ISO-8859-15')
new_str = converter.iconv(str)
puts new_str

--output:--
Hellö wörld #I see o's with unlauts

Hi,

This is great information, it's really helped me move in the right
direction. I haven't done enough testing yet, but here is what has
seemed to work.

Using an Iconv solution, where str is the string to convert.:

require 'iconv'
converter = Iconv.new('ISO-8859-15', 'UTF-8')
converted_str = converter.iconv(str)

So a little backwards from what we were thinking. Looks like swapping
UTF-8 and ISO-8859-15 did the trick since it appears that the string
was in UTF-8 to begin with.

Thanks!

dan
 
B

bbxx789_05ss

Dan said:
This is great information, it's really helped me move in the right
direction.

Thanks!

There is one missing piece to the puzzle. This is what happens behind
the scenes when you convert from a string written in UTF-8 format to a
string written in ISO-8859-15 format:

UTF-8 encoded character
|
|
V
Unicode integer
|
|
V
ISO-8859-15 encoded character


If for some reason, you ever need to get the unicode integer, you can do
this:

str = "\xc3\xb6" #'o' with umlaut encoded in utf-8
arr = str.unpack('U') #'U' gets the unicode from a char encoded in
*utf-8* only

p arr #[246] --> unicode in decimal format

Since unicode integers are usually written in hex format, you can do the
following to get the unicode in hex format:

puts "%04x" % arr[0] #00f6
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top