StringScanner and UTF-8 in ruby 1.9

Discussion in 'Ruby' started by Stefano Crocco, Sep 16, 2009.

  1. I've recently switched to ruby 1.9 and I'm having problems using its=20
    multilingual features. Now I'm having a problem with StringScanner. It seem=
    s=20
    that, when working with multi-byte encodings, such as utf-8, its pos method=
    =20
    returns a position in bytes, rather than in characters. For instance, the=20
    following code (when the source encoding is set to utf-8) outputs 2:

    s =3D StringScanner.new "=C3=A8a"
    s.scan(/./)
    puts s.pos

    (If you can't see it correctly, the string passed to StringScanner.new is m=
    ade=20
    of two characters: the first is an "e" with a grave accent and the second i=
    s a=20
    "a"). If I replace the first character with an ASCII character, the output =
    is=20
    1. This clearly hints that StringScanner#pos gives a position in terms of=20
    bytes rather than characters. Does anyone know whether there's a way to hav=
    e=20
    it return the position in characters rather than in bytes, or to convert it=
    =20
    from bytes to characters?

    I noticed that StringScanner has a get_byte methd and a getch method, which=
    =20
    return the next byte and the next character respectively, so I can't help=20
    wondering why something similar hasn't been provided for pos. Do you think=
    =20
    there's a reason for this, or should it be reported as a bug? (or am I miss=
    ing=20
    something obvious?)

    Thanks in advance

    Stefano
     
    Stefano Crocco, Sep 16, 2009
    #1
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.