Ruby 1.9 string slicing and StringScanner pointers

C

Caio Chassot

Hi all,

Earlier today I posted this question to Stack Overflow:

http://stackoverflow.com/questions/1899999

Basically, it boils down to this:
I'm working with UTF-8 strings in Ruby 1.9. I need to get a slice using byte-based indexes, not char-based, because StringScanner's internal pointers are byte-based.

While I welcome answers to that question here, I'm posting to ask
something else:

Should this be considered a bug in StringScanner? Wouldn't it make more
sense for it to use character indexes?
 
B

Brian Candler

Caio said:
Should this be considered a bug in StringScanner? Wouldn't it make more
sense for it to use character indexes?

I suspect the reason it does it this way is because it's very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I think what you're doing is the only option: tag the string as a
single-byte encoding ("ASCII-8BIT" would be better than "US-ASCII"),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.
 
R

Robert Klemme

It would seem so.
I suspect the reason it does it this way is because it's very expensive
in ruby 1.9 to jump to the Nth character. So if you were scanning a
large string, it would get slower and slower as you scanned further
along, calling #scan each time.

I don't know StringScanner internals, but does this have to be so? I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.
I think what you're doing is the only option: tag the string as a
single-byte encoding ("ASCII-8BIT" would be better than "US-ASCII"),
select the range of bytes, and tag it back again, relying on the fact
that strscan has chomped a whole number of characters.

Or use String#scan or another matching option, if that is possible.

Kind regards

robert
 
B

Brian Candler

Robert said:
I don't know StringScanner internals, but does this have to be so? I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

Even if it did, I think the point remains that StringScanner#pos
wouldn't be of much use if it gave a character offset, since str[n..m]
is an expensive operation in ruby 1.9.

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn't it? But I think the OP wanted to get the
buffer between two arbitrary match positions.
 
R

Robert Klemme

2009/12/14 Brian Candler said:
Robert said:
I don't know StringScanner internals, but does this have to be so? =A0I
mean, with $' you get the remainder of the string so when not using
positions you could handle it that way at the expense of an additional
String instance for each match.

Even if it did, I think the point remains that StringScanner#pos
wouldn't be of much use if it gave a character offset, since str[n..m]
is an expensive operation in ruby 1.9.

If all you want is the rest of the string, then StringScanner#post_match
gives you that already, doesn't it? But I think the OP wanted to get the
buffer between two arbitrary match positions.

You're right. Frankly, I did not read the stackoverflow question
initially but from that it is obvious.

Still this leaves an awkward feeling: you have a String which can
informally be defined as a sequence of _characters_. Now, for most
application cases accessing the nth character seems to be a more
natural operation than accessing the nth byte. I know the internal
reasons for the fact that accessing the nth character is expensive
(variable length encodings) but from an interface perspective this is
not good IMHO.

Java did solve this with a specialized character type so you can have
arrays of char, but from what I recall about Matz's comments the Java
model is flawed because it does not work well with non western
languages, namely Asian languages.

Btw, although UTF-16 is a fixed length encoding, char based accesses
are *really* slow:

#!ruby19

require 'benchmark'

REP =3D 100

s1 =3D "abcdeABCDE" * 1_000_000

encodings =3D["ASCII", "UTF-8", "UTF-16BE", "UTF-16LE"]
strings =3D {}

encodings.each do |enc|
strings[enc] =3D s1.encode(enc)
end

idx =3D s1.length - 10

Benchmark.bmbm 30 do |b|
encodings.each do |enc|
str =3D strings[enc]
rep =3D /16/ =3D~ enc ? REP : REP * 1000

b.report "enc %-10s rep %11d" % [enc, rep] do
rep.times do
s =3D str[idx..-1]
end
end
end
end


Cheers

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
J

James Edward Gray II

Btw, although UTF-16 is a fixed length encoding,

UTF-16 is not a fixed length encoding. A UTF-16 character may be =
encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to =
index due to Unicode's use of "combining characters." With them =
multiple codepoints may represent a single index.

James Edward Gray II
 
R

Robert Klemme

UTF-16 is not a fixed length encoding. A UTF-16 character may be encoded in two or four bytes.

I believe UTF-32 is fixed length, but even then it would not be cheap to index due to Unicode's use of "combining characters." With them multiple codepoints may represent a single index.

Thanks for the education, James! I would have sword UTF-16 is fixed
length...

For reference of other readers:
http://en.wikipedia.org/wiki/Utf-16
http://en.wikipedia.org/wiki/Combining_characters

Combining characters - oh what a mess. I189 is really a minefield.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top