John Ersatznom wrote:
[me:]
String.toUpperCase() does /not/ change the spelling of words (how could
it, it doesn't know anything about words ?). What it does follow are
the correct (insofar as the Unicode spec is correct) rules for mapping
lowercase to uppercase. It produces the /same/ word with the /same/
spelling[*], but (naturally) a different representation. In this case
the number of visually separable glyphs changes because the U+00DF
character (LATIN SMALL LETTER SHARP S) is a ligature of two logical
characters, long s and short s (U+017F and U+0073), there is no upper
case ligature for that combination (compare fi and FI in English
typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the
Uncicode people seem to be operating on -- they know more about it than
me so I'm willing to believe them).
This seems to be excessively technical when the matter under discussion
is simply capitalizing strings.
'fraid not. Case mapping is /NOT SIMPLE/, it never has been simple, and never
will be. The fact that case mapping in English /is/ simple is neither here not
there. That fact has mislead many Englsh-speaking programmers into making
invalid assumptions about the complexity of case mapping (and other
orthographical operations), and in the process either creating software which
is inherently broken (in implementation or API design) or which is restricted
to English text. One example of that unfortunate process is
String.equalsIgnoreCase() -- which would be better named something like
equalsWhileIgnoringCaseAccordingToTheRulesOfEnglish(), except that it doesn't
actually inplement the contract implied by that name /either/. In fact there
is no sensible name for what String.equalsIgnoreCase() does.
Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer, only
the visually displayed output (and then usually only when serious
typesetting software is used)
That is a fair criticism of the Unicode position. It may even be correct (I
don't know). The Unicode position is that it ignores ligatures (as a purely
display issue), /except/ where ligature characters are needed in order to
support round-tripping with other existing character sets. In this case U+00DF
/is/ needed for that purpose (and may also be well established as an regularly
used "character" even outside typographically advanced contexts -- I don't
know).
The fact is that there are rules to follow. If those rules strike you as
unnecessarily complicated, then that is your problem, not anyone else's (but
you are certainly not alone). But even if you do dislike the rules, do you
also want to write buggy software ? If you do write buggy software (in this
respect) then, again, you are certainly not alone -- but that doesn't make it
right.
No, it is not erroneous to expect a method to do exactly and only what
its name implies.
But it /does/ do exactly what its name implies. Only if you have an incomplete
idea of what case-mapping involves would you fail to understand the name and
its implications.
So you at least agree with me that it should be consistent with
toUpperCase (and toLowerCase) -- all strings should have a single
canonical toUpperCase, a single canonical toLowerCase, both should
define equivalence classes on the mixed-case input strings, these should
be the SAME equivalence class, and equalsIgnoreCase should implement and
embody the corresponding equivalence relation.
But where does the "should" come from ? You can set up that kind of structure
for English, no problem, but it doesn't generalise to other languages. No
matter how much you may /want/ it to, it simply doesn't...
The version that doesn't shouldn't
surprise English speakers; the version that does shouldn't surprise
anyone familiar with its locale-specific behavior for the locale
actually used.
But there is /nothing/ about Java which implies that instances of
java.lang.String hold English text. Indeed there is everthing to suggest
otherwise (why use Unicode at all, for instance).
Once you add in Locales then you get /another/ layer of complexity, in that the
case mapping may be Local-dependent /as well/ as not fitting with the
preconceptions of English (only) speakers.
Having locale-dependent behavior invoked randomly without
explicit use of Locale objects, and which furthermore doesn't use the
system locale, is by itself a sign of a questionable design as well as a
sure source of bugs and problems.
There's a good deal to be said for the idea that Local-dependent operations
should either take an explicit Locale as a parameter, or should use a single,
/invarient/, default Locale (not installation dependent). Just as a great deal
of bother would be saved if String<->byte[] conversions didn't use an implicit,
and installation-dependent, character encoding. But even if the Java class
library was in that ideal state, case mapping would not be simple and would not
conform to the expectations of some English speaking programmers.
There are two problems here. One is that too many programmers expect complex
things to be more simple than they are (which is odd when you consider how
eager programmers and designers often are to make simple things complex). The
other is that we are using legacy libraries which in parts were designed by
programmers who were still holding on to that folorn hope. The use of default
Locales is one example of that. String.equalsIgnoreCase() is another, and far
worse, example.
I've even encountered somewhere a notion that aString.length() is not
even accurate in current Java versions if a string contains obscure
characters.
It depends on what you mean. String.length() returns, correctly, the number of
Java "char"s in the String. No problem there. What /is/ a problem is that
that is not the same as the number of characters in the Unicode text. That's a
problem caused by the mis-specification of Java's chars to be 16-bit
quantities. It is highly unfortunate, but there is very little that can be
done about it now. It means that correct programming is more difficult than it
looks, and also more difficult than it /should/ be. There is nothing in the
problem space that makes this difficult (well, actually there is, but we'll
pretend there isn't for now[*]), it's not an /inherently/ complex problem, but
historical mistakes in Java's design mean that the API mostly works in terms of
UTF-16 encoding (sequences of 16-bit values) rather than in terms of real
Unicode characters.
It suggests aString.<something using the obscure term "code
point", apparently just Unicode-geek for "character"> as its
replacement, while of course there's a ton of legacy code using
length().
For the most part, such code will remain correct. One way to think of it is
that instances of java.lang.String do not, despite the name, directly represent
Unicode strings (sequences of Unicode characters), but are UTF-16. I.e. only
the name of the class is wrong. Most operations on UTF16 data "does the right
thing" for the Unicode information it represents. For instance concatenating
two UTF-16 sequences. It's only operations which mess around taking strings
apart[**] which are likely to do something invalid unexpectedly, and even there
they quite often work correctly.
The situation is unfortunate, but it's not really fatal. If any programmer is
capable of understanding the difference between a sequence of characters and a
sequence of bytes in some encoding, in the first place (necessary to do textual
IO in Java at all), then adjusting to the deficiencies of the String class
should not be overwhelmingly difficult.
There are issues to understand, and knowledge to be acquired; that's all...
I don't suppose it occurred to them that the new fancy-whosit
should have been a replacement length() implementation instead of some
new name that doesn't suggest anything to do with the length of a string
to someone who doesn't care about all the Unicode bells and whistles and
just wants to process strings while remaining agnostic about what they
are ultimately used for or contain?
I think they did the best they could. A better (but impossible in practise)
solution would have been to redefine "char" to be a >=24 bit quantity (I'd have
chosen 32-bit signed, myself), and redefine String to contain the new "char"s.
It would have been nice to refactor String to separate the physical (internal)
representation of the data from the logical character-based API.
Unfortunately, that would have been impossible unless they made the change
/very/ early -- and they missed the short window of opportunity for that. The
scheme they came up with, effectively redefining what "String" and "char" mean,
is probably the best possible solution. It doesn't break existing code -- in
the sense that what worked before continues to work -- all that has changed is
the interpretation of that code.
Code which /looks/ as if it will cope with all meaningful inputs does not (but
then, it never would have done). Not a satisfactory position, but the best we
are going to get.
There are issues to understand, and knowledge to be acquired; that's all...
-- chris
[*] The "length" of a Unicode string is somewhat problematical since some
characters qualify others (diacritical marks etc), and some "characters" are
not even characters at all. These issues are probably better thought of as
technical problems caused by the (unavoidable) compromises in Unicode's design
than something inherent to the problem space, but they are still issues for
creators of text-aware applications (few Java applications /are/ text-aware to
that degree).
[**] I should note that taking sequences of logical Unicode characters apart is
also non-trivial, quite independently of Java's representational deficiencies,
and may not fit with English speaking programmers' preconceptions. However,
that's a different kettle of problems and not really relevant here.