Ruby 1.9 string slicing and StringScanner pointers

Caio Chassot · Dec 14, 2009

Hi all,

Earlier today I posted this question to Stack Overflow:

http://stackoverflow.com/questions/1899999

Basically, it boils down to this:

I'm working with UTF-8 strings in Ruby 1.9. I need to get a slice using byte-based indexes, not char-based, because StringScanner's internal pointers are byte-based.

While I welcome answers to that question here, I'm posting to ask
something else:

Should this be considered a bug in StringScanner? Wouldn't it make more
sense for it to use character indexes?

Brian Candler · Dec 14, 2009

Caio said:
Should this be considered a bug in StringScanner? Wouldn't it make more
sense for it to use character indexes?

I suspect the reason it does it this way is because it's very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I think what you're doing is the only option: tag the string as a
single-byte encoding ("ASCII-8BIT" would be better than "US-ASCII"),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.

Robert Klemme · Dec 14, 2009

It would seem so.

I suspect the reason it does it this way is because it's very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I don't know StringScanner internals, but does this have to be so? I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

I think what you're doing is the only option: tag the string as a
single-byte encoding ("ASCII-8BIT" would be better than "US-ASCII"),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.

Or use String#scan or another matching option, if that is possible.

Kind regards

robert

Brian Candler · Dec 14, 2009

Robert said:
I don't know StringScanner internals, but does this have to be so? I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

Even if it did, I think the point remains that StringScanner#pos
wouldn't be of much use if it gave a character offset, since str[n..m]
is an expensive operation in ruby 1.9.

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn't it? But I think the OP wanted to get the
buffer between two arbitrary match positions.

Robert Klemme · Dec 14, 2009

2009/12/14 Brian Candler said:
Robert said:

I don't know StringScanner internals, but does this have to be so? =A0I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

Click to expand...

Even if it did, I think the point remains that StringScanner#pos
wouldn't be of much use if it gave a character offset, since str[n..m]
is an expensive operation in ruby 1.9.

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn't it? But I think the OP wanted to get the
buffer between two arbitrary match positions.

You're right. Frankly, I did not read the stackoverflow question
initially but from that it is obvious.

Still this leaves an awkward feeling: you have a String which can
informally be defined as a sequence of _characters_. Now, for most
application cases accessing the nth character seems to be a more
natural operation than accessing the nth byte. I know the internal
reasons for the fact that accessing the nth character is expensive
(variable length encodings) but from an interface perspective this is
not good IMHO.

Java did solve this with a specialized character type so you can have
arrays of char, but from what I recall about Matz's comments the Java
model is flawed because it does not work well with non western
languages, namely Asian languages.

Btw, although UTF-16 is a fixed length encoding, char based accesses
are *really* slow:

#!ruby19

require 'benchmark'

REP =3D 100

s1 =3D "abcdeABCDE" * 1_000_000

encodings =3D["ASCII", "UTF-8", "UTF-16BE", "UTF-16LE"]
strings =3D {}

encodings.each do |enc|
strings[enc] =3D s1.encode(enc)
end

idx =3D s1.length - 10

Benchmark.bmbm 30 do |b|
encodings.each do |enc|
str =3D strings[enc]
rep =3D /16/ =3D~ enc ? REP : REP * 1000

b.report "enc %-10s rep %11d" % [enc, rep] do
rep.times do
s =3D str[idx..-1]
end
end
end
end

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

James Edward Gray II · Dec 14, 2009

Btw, although UTF-16 is a fixed length encoding,

UTF-16 is not a fixed length encoding. A UTF-16 character may be =
encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to =
index due to Unicode's use of "combining characters." With them =
multiple codepoints may represent a single index.

James Edward Gray II

Robert Klemme · Dec 14, 2009

UTF-16 is not a fixed length encoding. A UTF-16 character may be encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to index due to Unicode's use of "combining characters." With them multiple codepoints may represent a single index.

Thanks for the education, James! I would have sword UTF-16 is fixed
length...

For reference of other readers:
http://en.wikipedia.org/wiki/Utf-16
http://en.wikipedia.org/wiki/Combining_characters

Combining characters - oh what a mess. I189 is really a minefield.

Kind regards

robert

StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
how to convert string to binary and back in Ruby 1.9?	9	Sep 1, 2009
Ruby 1.9, CSV and encodings	0	Feb 18, 2008
[ANN] RubyInstaller Release Candidate 1 - 1.8 and 1.9 releasessigned!	0	Nov 10, 2009
Getting the current module(s), class name and method in Ruby 1.9	6	Mar 30, 2009
Ruby 1.9 and rubygems: should be able to use gem command?	2	Jan 21, 2008
[ANN] Ruby Hacking Guide - New chapters (and a bonus)	2	Apr 5, 2006
Stepping up as SQLite3/Ruby maintainer	0	Jul 8, 2009

Ruby 1.9 string slicing and StringScanner pointers

Caio Chassot

Brian Candler

Robert Klemme

Brian Candler

Robert Klemme

James Edward Gray II

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads