character encoding question

Amishera Amishera · Mar 26, 2010

I have an html file which is encoded in UTF-8. The file contains the
following text:

It's a wonderful life

now the character code 39 is for aphostrohpe in UTF8. so suppose I got
the 39 out of the text using:

s="It's a wonderful life"

s.gsub(/&#(\d+);/, '\1')

The output is

It39s a wonderful life

So firstly I am having trouble making it

It\39s a wonderful life

Secondly I manually did this in test_utf8.rb:

puts "It\39s a wonderful life"

and ran it

ruby test_utf8.rb > utf8.txt

but by opening it in the open office by setting the encoding to utf-8
the output is

It#9s a wonderful life

So how to correctly parse the collect and convert html character
reference to encoded charcters in utf-8 and then save file?

Thanks.

David Springer · Mar 26, 2010

s="It's a wonderful life"

I stumbled across this:

David Springer · Mar 26, 2010

try something like this:
-------------------------------------
require 'cgi'
s="UPPERCASE Russian Alphabet\n".encode('utf-8')
s+=CGI.unescapeHTML("АБВГ".encode('utf-8'))
s+=CGI.unescapeHTML("ДЕЖЗ".encode('utf-8'))
s+=CGI.unescapeHTML("ИЙКЛ".encode('utf-8'))
s+=CGI.unescapeHTML("МНОП".encode('utf-8'))
s+=CGI.unescapeHTML("РСТУ".encode('utf-8'))
s+=CGI.unescapeHTML("ФХЦЧ".encode('utf-8'))
s+=CGI.unescapeHTML("ШЩЪЫ".encode('utf-8'))
s+=CGI.unescapeHTML("ЬЭЮЯ".encode('utf-8'))
puts s

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Uploading images - binary or unsupported text encoding	2	Dec 24, 2022
Output confusion	2	Mar 9, 2023
Short question about encoding.	6	Nov 10, 2010
A few questiosn about encoding	103	Jun 9, 2013
encoding error	1	Feb 20, 2013
Encoding of character literals	4	Nov 3, 2011
Ruby1.9 Encoding	2	Sep 10, 2009

character encoding question

Amishera Amishera

David Springer

David Springer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads