ruby 1.9.2 unicode with hex codepoint problem

G

Gilles Gilles

Hi,

I have the following test case:

$ cat test.in
Der gro\xdfe BilderSauger

$ cat test.rb
File.open('test.in', 'r').each_line do |line|
puts line
test =3D "Der gro\xdfe BilderSauger"
puts test

The result is
Der gro\xdfe BilderSauger
Der gro?e BilderSauger

I have tried to put an encoding in the File.open() or line.encode()
without success. the '\' is recognized as a real '\', not as the
beginning of an hex escape sequence.

How can I get \xdf to be recognized as =C3=9F when reading from a file?

Thanks

--Gilles

-- =

Posted via http://www.ruby-forum.com/.=
 
E

Eric Hodel

I have the following test case:
=20
$ cat test.in
Der gro\xdfe BilderSauger
=20
$ cat test.rb
File.open('test.in', 'r').each_line do |line|
puts line
test =3D "Der gro\xdfe BilderSauger"
puts test
=20
The result is
Der gro\xdfe BilderSauger
Der gro?e BilderSauger
=20
I have tried to put an encoding in the File.open() or line.encode()
without success. the '\' is recognized as a real '\', not as the
beginning of an hex escape sequence.
=20
How can I get \xdf to be recognized as =DF when reading from a file?

Unicode codepoint 00df needs to be written in a particular encoding. =
I'll choose UTF-8.

$ echo 'Der gro=DFe BilderSauger' > test.in
$ hexdump -C test.in=20
00000000 44 65 72 20 67 72 6f c3 9f 65 20 42 69 6c 64 65 |Der gro..e =
Bilde|
00000010 72 53 61 75 67 65 72 0a |rSauger.|
00000018

In test.rb you also need to set UTF-8 for this to work:

$ cat test.rb
# coding: UTF-8

File.open('test.in', 'r').each_line do |line|
puts line
test =3D "Der gro=DFe BilderSauger"
puts test

end
$ ruby19 test.rb
Der gro=DFe BilderSauger
Der gro=DFe BilderSauger
 
B

Brian Candler

Gilles Devaux wrote in post #980948:
test =3D "Der gro\xdfe BilderSauger"

\xdf is the single byte DF, and your 'test' string will have encoding =

ASCII-8BIT.

You'd use \u00df instead if you are using UTF-8 encoding (where that =

character is encoded as two bytes).

Otherwise, if you are using ISO-8859-1 (say), where that codepoint *is* =

the single byte DF, then you'll need to open your file with that =

encoding specified

Of course, test.in should not contain the character sequence "\" "x" "d" =

"f"
but rather the single byte (or two bytes, if you are using UTF-8 =

encoding)

Probably the simplest solution is to open a text editor and type in =C3=9F=
=

Then use hexdump -C on the file to see what it contains, byte by byte.

-- =

Posted via http://www.ruby-forum.com/.=
 
G

Gilles Devaux

Sorry I haven't responded earlier but it seems I'm not notified by
email.

The thing is I do not control the input, it is the browscap file found
here and is in ISO-8859-1
http://browsers.garykeith.com/downloads.asp

\xdf (decimal 223) is a valid ISO-8859-1 code point
(http://en.wikipedia.org/wiki/ISO/IEC_8859-1)
it appears as '?' because my terminal is UTF-8 but the bytes are there:

$ cat test.rb
a = "Der gro\xdfe BilderSauger"
a.each_byte { |b| puts b }

$ ruby test.rb
68
101
114
32
103
114
111
223 <- Here I am
101
32
66
105
108
100
101
114
83
97
117
103
101
114

You can also see that the length is 22, not 25.

Also if I
puts a.encode('UTF-8', 'ISO-8859-1')
I see the proper character in my terminal

But when read from a file:

$ cat test.rb
File.open('test.in', 'r:ISO-8859-1').each_line do |l|
puts l
puts '***'
puts l.length
puts '***'
l.each_byte {|b| puts b}
end

$ ruby test.rb
Der gro\xdfe BilderSauger
***
25
***
68
101
114
32
103
114
111
92 <- Here
120 <- we
100 <- are
102 <- as 4 ASCII chars '\xdf'
101
32
66
105
108
100
101
114
83
97
117
103
101
114

I also tried to put UTF-8 codepoints and read as UTF-8 without luck. It
seems there is no escape sequence when reading from a stream, which I
can understand.

What I can't figure out is how to interpret these escape sequences when
reading them from a file.

--Gilles
 
B

Brian Candler

You're doing two different things.
$ cat test.rb
a = "Der gro\xdfe BilderSauger"

That's a double-quoted string, and so Ruby is doing some translation of
the contents. A common example is \n meaning "newline"; in this case,
\xNN means the byte with hex code NN. So when you do each_byte, that's
what you get, a single byte.

Change the double-quotes to single-quotes and you'll actually get the
four separate characters.
But when read from a file: ...
l.each_byte {|b| puts b} ...
92 <- Here
120 <- we
100 <- are
102 <- as 4 ASCII chars '\xdf'

That proves that the file actually contains the four characters
'\', 'x', 'd', 'f'. If you want further proof, try

hexdump -C test.in

to take Ruby out of the loop completely.

So there's neither UTF-8 nor ISO-8859-1 in that file, just plain ASCII
characters.

If you want to turn this into something else, you would have to process
it. For example:

l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr }

# or in ruby 1.9, if you want to tag the encoding:

l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr("ISO-8859-1") }
 
G

Gilles Devaux

Stupid of me. The file is indeed ISO-8859-1 (some other characters are
encoded this way) just not this one, it's escaped.

This:
l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr }
# or in ruby 1.9, if you want to tag the encoding:
l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr("ISO-8859-1") }

is exactly what I want.

Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top