James said:
I'm pretty sure you are in the minority with this opinion.
Quite possibly
You really like this?
$ ruby -e 'p "R�sum�"[0..1]'
"R\303"
How often is that going to be the desired result?
Well, if I were extracting the first two bytes from a JPEG header, that
would be exactly what I'd expect. I've very rarely wanted to extract the
first two *characters* from a string. I can think of one example: a
string truncation helper in a web page.
def trunc(string, maxlen=50)
if string.length > maxlen
string = string[0,maxlen-3] + "..."
end
string
end
I'll certainly agree that's something you'd want to do, and /.{,50}/u is
an ugly way of doing it. In any case, I'm not saying there shouldn't be
any m17n support, or even that tagging strings with encodings is in
itself wrong, as long as the semantic implications are made clear.
The number one bugbear I have is that (unless you take a number of
specific steps to avoid it), program behaviour is inconsistent. You can
run the *same* program with exactly the *same* input data on two
different machines, and they will process it differently, possibly even
crashing in one case. If someone has a problem running your app, it's
now insufficient just to ask what O/S and ruby version they are running
in order to be able to replicate the problem.
Consider an app which is bundled with HTML templates, which the app
reads using File.read(). The templates happen to be written using, say,
UTF-8. It all works fine on my machine, and passes all tests. However it
barfs when run on someone else's machine, because their environment
variables are different.
I think that LC_ALL is a very poor predictor of what encoding a specific
file is in. Ruby doesn't trust it for source files (it uses #encoding
tags instead), so why trust it for data?
Now, if the default external encoding were fixed as (say) UTF-8, that
would be more sane. The default behaviour would then be the same on any
machine where ruby is installed:
- File#gets returns a string with encoding='UTF-8'
- File#read returns a string with encoding='BINARY'
unless explicitly overridden, e.g. when the file is opened. So if these
hypothetical HTML templates are written in ISO-8859-15, you would be
forced to declare this in your program.
In any case, I'm used to having my data treated as binary unless I
explicitly ask otherwise. e.g.
$ echo "ßßß" | wc
1 1 7
$ echo "ßßß" | wc -m
4
[Ubuntu Hardy, default setup with LANG=en_GB.UTF-8]
Can you list what's not yet covered in my blog series?
I've posted a bunch of lists before. Every time I try out some feature,
because it's undocumented, the test turns up more questions than it
answers. Maybe I really should go ahead and document it all, but that
would be a very large project.
Trying out in irb used to be a good way to test ruby, but that's no good
in ruby 1.9 because it's not consistent with script behaviour. For
example:
$ irb19
irb(main):001:0> "foo".encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> /foo/.encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> "fooß".encoding
=> #<Encoding:UTF-8>
irb(main):004:0> /fooß/.encoding
=> #<Encoding:UTF-8>
Now try running this program:
p "foo".encoding
p /foo/.encoding
p "fooß".encoding
p /fooß/.encoding
It barfs on the multi-byte chars. That's reasonable in the absence of
knowledge about the source file, so now add an #encoding line:
#encoding: UTF-8
p "foo".encoding
p /foo/.encoding
p "fooß".encoding
p /fooß/.encoding
and you still get a different answer to IRB. The first string gets an
encoding as UTF-8 instead of US-ASCII; and yet the /foo/ regexp gets an
encoding of US-ASCII in both cases.
This is compounded by the hidden state which remembers whether a
particular string is all 7-bit characters or not. That is, although
"foo" and "fooß" are both marked as having identical encoding UTF-8,
they are actually treated *differently* by the encoding rules. You have
to test using the #ascii_only? method. And yet a regexp literal
apparently follows a different rule. Except when you are in IRB.
It means that I think your comments are doing harm to the 1.9
migration and I can't find the good you are doing to balance that.
I don't think what I'm saying would stop any library author from
modifying their library to work with 1.9 if they so wish. They have to
make up their own minds.
I believe the worst long-term problems are likely to be C extensions. I
have seen no hints at all for C extension writers on how to handle
strings properly (especially the hidden ascii_only? state) so I believe
these are likely to have obscure bugs for some time.
Regards,
Brian.