Ruby 1.9 string slicing and StringScanner pointers

Discussion in 'Ruby' started by Caio Chassot, Dec 14, 2009.

  1. Caio Chassot

    Caio Chassot Guest

    Hi all,

    Earlier today I posted this question to Stack Overflow:

    http://stackoverflow.com/questions/1899999

    Basically, it boils down to this:
    While I welcome answers to that question here, I'm posting to ask
    something else:

    Should this be considered a bug in StringScanner? Wouldn't it make more
    sense for it to use character indexes?
     
    Caio Chassot, Dec 14, 2009
    #1
    1. Advertisements

  2. I suspect the reason it does it this way is because it's very expensive
    in ruby 1.9 to jump to the Nth character. So if you were scanning a
    large string, it would get slower and slower as you scanned further
    along, calling #scan each time.

    I think what you're doing is the only option: tag the string as a
    single-byte encoding ("ASCII-8BIT" would be better than "US-ASCII"),
    select the range of bytes, and tag it back again, relying on the fact
    that strscan has chomped a whole number of characters.
     
    Brian Candler, Dec 14, 2009
    #2
    1. Advertisements

  3. It would seem so.
    I don't know StringScanner internals, but does this have to be so? I
    mean, with $' you get the remainder of the string so when not using
    positions you could handle it that way at the expense of an additional
    String instance for each match.
    Or use String#scan or another matching option, if that is possible.

    Kind regards

    robert
     
    Robert Klemme, Dec 14, 2009
    #3
  4. Even if it did, I think the point remains that StringScanner#pos
    wouldn't be of much use if it gave a character offset, since str[n..m]
    is an expensive operation in ruby 1.9.

    If all you want is the rest of the string, then StringScanner#post_match
    gives you that already, doesn't it? But I think the OP wanted to get the
    buffer between two arbitrary match positions.
     
    Brian Candler, Dec 14, 2009
    #4
  5. You're right. Frankly, I did not read the stackoverflow question
    initially but from that it is obvious.

    Still this leaves an awkward feeling: you have a String which can
    informally be defined as a sequence of _characters_. Now, for most
    application cases accessing the nth character seems to be a more
    natural operation than accessing the nth byte. I know the internal
    reasons for the fact that accessing the nth character is expensive
    (variable length encodings) but from an interface perspective this is
    not good IMHO.

    Java did solve this with a specialized character type so you can have
    arrays of char, but from what I recall about Matz's comments the Java
    model is flawed because it does not work well with non western
    languages, namely Asian languages.

    Btw, although UTF-16 is a fixed length encoding, char based accesses
    are *really* slow:

    #!ruby19

    require 'benchmark'

    REP =3D 100

    s1 =3D "abcdeABCDE" * 1_000_000

    encodings =3D["ASCII", "UTF-8", "UTF-16BE", "UTF-16LE"]
    strings =3D {}

    encodings.each do |enc|
    strings[enc] =3D s1.encode(enc)
    end

    idx =3D s1.length - 10

    Benchmark.bmbm 30 do |b|
    encodings.each do |enc|
    str =3D strings[enc]
    rep =3D /16/ =3D~ enc ? REP : REP * 1000

    b.report "enc %-10s rep %11d" % [enc, rep] do
    rep.times do
    s =3D str[idx..-1]
    end
    end
    end
    end


    Cheers

    robert


    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Dec 14, 2009
    #5
  6. UTF-16 is not a fixed length encoding. A UTF-16 character may be =
    encoded in two or four bytes.

    I believe UTF-32 is fixed length, but even then it would not be cheap to =
    index due to Unicode's use of "combining characters." With them =
    multiple codepoints may represent a single index.

    James Edward Gray II
     
    James Edward Gray II, Dec 14, 2009
    #6
  7. Thanks for the education, James! I would have sword UTF-16 is fixed
    length...

    For reference of other readers:
    http://en.wikipedia.org/wiki/Utf-16
    http://en.wikipedia.org/wiki/Combining_characters

    Combining characters - oh what a mess. I189 is really a minefield.

    Kind regards

    robert
     
    Robert Klemme, Dec 14, 2009
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.