StringScanner and UTF-8 in ruby 1.9

Stefano Crocco · Sep 16, 2009

I've recently switched to ruby 1.9 and I'm having problems using its=20
multilingual features. Now I'm having a problem with StringScanner. It seem=
s=20
that, when working with multi-byte encodings, such as utf-8, its pos method=
=20
returns a position in bytes, rather than in characters. For instance, the=20
following code (when the source encoding is set to utf-8) outputs 2:

s =3D StringScanner.new "=C3=A8a"
s.scan(/./)
puts s.pos

(If you can't see it correctly, the string passed to StringScanner.new is m=
ade=20
of two characters: the first is an "e" with a grave accent and the second i=
s a=20
"a"). If I replace the first character with an ASCII character, the output =
is=20
1. This clearly hints that StringScanner#pos gives a position in terms of=20
bytes rather than characters. Does anyone know whether there's a way to hav=
e=20
it return the position in characters rather than in bytes, or to convert it=
=20
from bytes to characters?

I noticed that StringScanner has a get_byte methd and a getch method, which=
=20
return the next byte and the next character respectively, so I can't help=20
wondering why something similar hasn't been provided for pos. Do you think=
=20
there's a reason for this, or should it be reported as a bug? (or am I miss=
ing=20
something obvious?)

Thanks in advance

Stefano

Ruby 1.9 string slicing and StringScanner pointers	6	Dec 14, 2009
StringScanner question	8	Sep 17, 2005
Unicode (UTF-8) in C	13	Mar 16, 2014
Ruby 1.9 # coding: utf-8	5	Mar 27, 2009
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UTF-8 vs w_char	48	Nov 3, 2013
to_yaml in utf-8 encoding	7	Apr 8, 2011
UTF-8 and strings	44	Jun 7, 2011

StringScanner and UTF-8 in ruby 1.9

Stefano Crocco

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads