StringScanner and UTF-8 in ruby 1.9

S

Stefano Crocco

I've recently switched to ruby 1.9 and I'm having problems using its=20
multilingual features. Now I'm having a problem with StringScanner. It seem=
s=20
that, when working with multi-byte encodings, such as utf-8, its pos method=
=20
returns a position in bytes, rather than in characters. For instance, the=20
following code (when the source encoding is set to utf-8) outputs 2:

s =3D StringScanner.new "=C3=A8a"
s.scan(/./)
puts s.pos

(If you can't see it correctly, the string passed to StringScanner.new is m=
ade=20
of two characters: the first is an "e" with a grave accent and the second i=
s a=20
"a"). If I replace the first character with an ASCII character, the output =
is=20
1. This clearly hints that StringScanner#pos gives a position in terms of=20
bytes rather than characters. Does anyone know whether there's a way to hav=
e=20
it return the position in characters rather than in bytes, or to convert it=
=20
from bytes to characters?

I noticed that StringScanner has a get_byte methd and a getch method, which=
=20
return the next byte and the next character respectively, so I can't help=20
wondering why something similar hasn't been provided for pos. Do you think=
=20
there's a reason for this, or should it be reported as a bug? (or am I miss=
ing=20
something obvious?)

Thanks in advance

Stefano
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top