How to check variables for uniqueness ?

Ian Wilson · Jan 16, 2007

John said:
Well, it certainly shouldn't actually use a different spelling.

AIUI, It has to since there is not un uppercase version of the lowercase
ß ligature. The uppercase equivalent of the ß ligature "character" is
the two characters SS.

Would an American newspaper use "color" in article text but "COLOUR"
in headlines?

They might use the 6 character a\uFB04uent in article text but the 8
character AFFLUENT in headlines.

Oliver Wong · Jan 16, 2007

John Ersatznom said:
This seems to be excessively technical when the matter under discussion is
simply capitalizing strings.

The above sentence, as perceived by a linguist, is probably akin to the
statement "This seems to be excessively technical, when the matter under
discussion is simply not putting bugs into our software in the first place."
stated by a pointy-haired boss, as perceived by a typical programmer.

[...]

No, it is not erroneous to expect a method to do exactly and only what its
name implies.

Note that the name of the method is not
"String.mapCharactersOneToOneInTheWayThatEnglishCaseMappingWorks()" but
rather "String.toUpperCase()". Perhaps due to your limited exposure of
languages (e.g. only English), you are unable to conceive of scenarios were
converting a text from lower case to uppercase might not work in the same
way that it does in English? That is why I gave an example to you, and
repeatedly ask you not to simply take my word for it, and run it yourself,
to see what the results were.

It is erroneous, of course, to give a method a name that is misleading. If
toUpperCase needs a lengthy documentation block explaining why its
behavior is surprising, then it's a sure bet that it should not have been
named that, since it's apparently really
toUpperCaseAndDoesSomeExtraStuffToo.

I believe that having "ß".toUpperCase() yield "SS" is surprising only to
those who are unfamiliar with the ß character. Probably to most German
speakers, this behaviour is very non-surprising, and in fact, expected.

[...]

It sounds like toUpperCase has a "badly misleading" name since it
(supposedly) does transformations that go well beyond what is normally
meant by everyday blokes by "to upper case", and the method name is
supposed to be a reasonably meaningful capsule summary for everyday blokes
of what the method does.

I think "everyday blokes" are unqualified to have any expectations of
have the concepts of upper case and lower case mean in an international
setting. These blokes may have a good idea of what these concepts mean in
their particular language, but unless they are linguists, they probably have
no idea what these concepts might mean in other languages. Such blokes are
probably unqualified to request that the linguists and the unicode
consortium redefine their concept of uppercase and lowercase to suit said
blokes.

Similarly, an everyday bloke might be surprised about the output of the
following Java program:

<code>
public class Test {
public static void main(String args[]) {
System.out.println(0.1 + 0.2 == 0.3);
}
}
</code>

But unless said bloke studied numerical computing, or at the very least,
has a understanding of the binary representation system for numbers, said
bloke is probably unqualified to request that the computer scientists and
IEEE redefine floating point computation to suit said blokes.

If a method is supposed to do behavior that's surprising for any English
speaker but not for a German speaker, maybe it should have a German rather
than an English name?

I claim that there exists at least one English speaker for which its
behaviour is not surprisingly (me).

If it's supposed to do locale-dependent stuff, then it should have a
version that accepts a Locale object.

It does. See the JavaDocs.

The version that doesn't shouldn't surprise English speakers;

It doesn't surprise me.

Are you basically saying that it should not surprise ANY English
speaker? What if I had a cousin, "Surprised Sally" we call her, who is
surprised at everything. And she's an English speaker. No matter what the
implementation of toUpperCase is, it would surprise her.

Or are you basically saying that it should not surprise *you*? If so,
then maybe you should apply for a position on the unicode consortium, so
that when the next version of Unicode comes out (6.0?), perhaps you will
have exerted enough influence on the standard such that toUpperCase will no
longer surprise you.

the version that does shouldn't surprise anyone familiar with its
locale-specific behavior for the locale actually used. Having
locale-dependent behavior invoked randomly without explicit use of Locale
objects, and which furthermore doesn't use the system locale, is by itself
a sign of a questionable design as well as a sure source of bugs and
problems.

What locale were you using, and what did you expect the uppercase form
of "ß" to be in that locale?

[...]

[*] Arguably the concept "same spelling" is flawed in the context of
Unicode
case mapping.

Click to expand...

A concept like "same spelling" can't be flawed. It's generally accepted
that "color" and "colour" are the same word, but have different spellings,
right?

You'll have to define the terms "spelling" and "word" outside of the
context of any one particular language (e.g. you can't assume only the Latin
alphabet) before I can agree or disagree with your claim.

While "two" and "too" are different words spelled differently that sound
the same, "tomato" and "tomato" are the same word spelled the same but
pronounced differently

Ditto.

and "ant" (the bug) and "ant" (the build tool) are different words both
spelled and pronounced the same.

Could we possibly get a bigger hint? =P

- Oliver

Stefan Ram · Jan 16, 2007

Ian Wilson said:
AIUI, It has to since there is not un uppercase version of the lowercase
ß ligature. The uppercase equivalent of the ß ligature "character" is
the two characters SS.

Yes.

There used to be another rule, requesting to use »SZ« instead,
when »SS« would be ambigous. For example,

»Das Rechnen mit Massen beherrschen«
»DAS RECHNEN MIT MASSEN BEHERRSCHEN«

»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES«

»Das Rechnen mit Maßen beherrschen«
»DAS RECHNEN MIT MASZEN BEHERRSCHEN«

»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MEASURES«

The »amtliche Regelung« for the language to be used in public
Schools now is specifying that the uppercase spelling of »ß«
always is »SS«. But according to polls only 19 % of the
population use this regulation - most of them should be
teachers or pupils. Everyone out of school or contracts is
free to choose the regulation he wants to adhere to.
Therefore an unknown part of the population might use the
SZ-rule, although it would be deemed wrong in a public school.

There are also some official regulations regarding telegraphy
of the administration which demand to use »SZ« when »SS« might
introduce ambiguity as of now. (According to a recent Usenet
post, which I can not find now.)

A Usenet post from 1997 claims that »sz« is always used for
»ß« in certain messages of the »Bundeswehr« (German Federal
Armed Forces) and by news agencies. Another usenet posting
claims that this spelling is to be used for labels in
technical drawings of a certain company. So it still seems to
be used when avoiding ambiguity matters.

Sometimes, the letter »B« is used, because it vaguely looks
like »ß«. This is considered wrong, but for fun some people
even use it in pronunciation, e.g., speaking of a »StraBe«
(from »Straße« - »street«).

Stefan Ram · Jan 16, 2007

»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES«

.replace( "WITH ", "" )

Schools now is specifying that the uppercase spelling of »ß«

.replace( "Sc", "sc" )

Sorry, I /have/ proof-read my post. But I only spot errors
after posting it. (I do not dare to think of the error I
still have not found or I made within this new post.)

John W. Kennedy · Jan 16, 2007

John said:
This seems to be excessively technical when the matter under discussion
is simply capitalizing strings. In any event, equalsIgnoreCase should
collapse these "ligatures" of yours as well. Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer,

It does if Unicode U+FB01 is used.

Look, you are /way/ out of your depth on this. All you're doing is
making repeated assertions about the way things "ought to" work, when in
plain fact they don't work that way, and aren't supposed to. Please
either get a book about Unicode and read it through, or else drop the
subject.

public class FI {
public static void main(String[] args) {
System.out.println("\uFB01".toUpperCase()); // Result: "FI"
}
}

John W. Kennedy · Jan 16, 2007

Philipp said:
FYI ÃŸ is just a modern callygraphy (ie. a ligature) of the ancient
german "ss". (the first s being a "long s", "Å¿" ). It is one of the
rare cases where a ligature has actually been accepted as a full letter.

Except that it's actually an "sz" ligature that is now generally
accepted as being equivalent to "ss" -- except that in old spelling
(still used by many), sometimes it's still "sz".

Chris Uppal · Jan 16, 2007

John Ersatznom wrote:

[me:]

String.toUpperCase() does /not/ change the spelling of words (how could
it, it doesn't know anything about words ?). What it does follow are
the correct (insofar as the Unicode spec is correct) rules for mapping
lowercase to uppercase. It produces the /same/ word with the /same/
spelling[*], but (naturally) a different representation. In this case
the number of visually separable glyphs changes because the U+00DF
character (LATIN SMALL LETTER SHARP S) is a ligature of two logical
characters, long s and short s (U+017F and U+0073), there is no upper
case ligature for that combination (compare fi and FI in English
typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the
Uncicode people seem to be operating on -- they know more about it than
me so I'm willing to believe them).

Click to expand...

This seems to be excessively technical when the matter under discussion
is simply capitalizing strings.

'fraid not. Case mapping is /NOT SIMPLE/, it never has been simple, and never
will be. The fact that case mapping in English /is/ simple is neither here not
there. That fact has mislead many Englsh-speaking programmers into making
invalid assumptions about the complexity of case mapping (and other
orthographical operations), and in the process either creating software which
is inherently broken (in implementation or API design) or which is restricted
to English text. One example of that unfortunate process is
String.equalsIgnoreCase() -- which would be better named something like
equalsWhileIgnoringCaseAccordingToTheRulesOfEnglish(), except that it doesn't
actually inplement the contract implied by that name /either/. In fact there
is no sensible name for what String.equalsIgnoreCase() does.

Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer, only
the visually displayed output (and then usually only when serious
typesetting software is used)

That is a fair criticism of the Unicode position. It may even be correct (I
don't know). The Unicode position is that it ignores ligatures (as a purely
display issue), /except/ where ligature characters are needed in order to
support round-tripping with other existing character sets. In this case U+00DF
/is/ needed for that purpose (and may also be well established as an regularly
used "character" even outside typographically advanced contexts -- I don't
know).

The fact is that there are rules to follow. If those rules strike you as
unnecessarily complicated, then that is your problem, not anyone else's (but
you are certainly not alone). But even if you do dislike the rules, do you
also want to write buggy software ? If you do write buggy software (in this
respect) then, again, you are certainly not alone -- but that doesn't make it
right.

No, it is not erroneous to expect a method to do exactly and only what
its name implies.

But it /does/ do exactly what its name implies. Only if you have an incomplete
idea of what case-mapping involves would you fail to understand the name and
its implications.

So you at least agree with me that it should be consistent with
toUpperCase (and toLowerCase) -- all strings should have a single
canonical toUpperCase, a single canonical toLowerCase, both should
define equivalence classes on the mixed-case input strings, these should
be the SAME equivalence class, and equalsIgnoreCase should implement and
embody the corresponding equivalence relation.

But where does the "should" come from ? You can set up that kind of structure
for English, no problem, but it doesn't generalise to other languages. No
matter how much you may /want/ it to, it simply doesn't...

The version that doesn't shouldn't
surprise English speakers; the version that does shouldn't surprise
anyone familiar with its locale-specific behavior for the locale
actually used.

But there is /nothing/ about Java which implies that instances of
java.lang.String hold English text. Indeed there is everthing to suggest
otherwise (why use Unicode at all, for instance).

Once you add in Locales then you get /another/ layer of complexity, in that the
case mapping may be Local-dependent /as well/ as not fitting with the
preconceptions of English (only) speakers.

Having locale-dependent behavior invoked randomly without
explicit use of Locale objects, and which furthermore doesn't use the
system locale, is by itself a sign of a questionable design as well as a
sure source of bugs and problems.

There's a good deal to be said for the idea that Local-dependent operations
should either take an explicit Locale as a parameter, or should use a single,
/invarient/, default Locale (not installation dependent). Just as a great deal
of bother would be saved if String<->byte[] conversions didn't use an implicit,
and installation-dependent, character encoding. But even if the Java class
library was in that ideal state, case mapping would not be simple and would not
conform to the expectations of some English speaking programmers.

There are two problems here. One is that too many programmers expect complex
things to be more simple than they are (which is odd when you consider how
eager programmers and designers often are to make simple things complex). The
other is that we are using legacy libraries which in parts were designed by
programmers who were still holding on to that folorn hope. The use of default
Locales is one example of that. String.equalsIgnoreCase() is another, and far
worse, example.

I've even encountered somewhere a notion that aString.length() is not
even accurate in current Java versions if a string contains obscure
characters.

It depends on what you mean. String.length() returns, correctly, the number of
Java "char"s in the String. No problem there. What /is/ a problem is that
that is not the same as the number of characters in the Unicode text. That's a
problem caused by the mis-specification of Java's chars to be 16-bit
quantities. It is highly unfortunate, but there is very little that can be
done about it now. It means that correct programming is more difficult than it
looks, and also more difficult than it /should/ be. There is nothing in the
problem space that makes this difficult (well, actually there is, but we'll
pretend there isn't for now[*]), it's not an /inherently/ complex problem, but
historical mistakes in Java's design mean that the API mostly works in terms of
UTF-16 encoding (sequences of 16-bit values) rather than in terms of real
Unicode characters.

It suggests aString.<something using the obscure term "code
point", apparently just Unicode-geek for "character"> as its
replacement, while of course there's a ton of legacy code using
length().

For the most part, such code will remain correct. One way to think of it is
that instances of java.lang.String do not, despite the name, directly represent
Unicode strings (sequences of Unicode characters), but are UTF-16. I.e. only
the name of the class is wrong. Most operations on UTF16 data "does the right
thing" for the Unicode information it represents. For instance concatenating
two UTF-16 sequences. It's only operations which mess around taking strings
apart[**] which are likely to do something invalid unexpectedly, and even there
they quite often work correctly.

The situation is unfortunate, but it's not really fatal. If any programmer is
capable of understanding the difference between a sequence of characters and a
sequence of bytes in some encoding, in the first place (necessary to do textual
IO in Java at all), then adjusting to the deficiencies of the String class
should not be overwhelmingly difficult.

There are issues to understand, and knowledge to be acquired; that's all...

I don't suppose it occurred to them that the new fancy-whosit
should have been a replacement length() implementation instead of some
new name that doesn't suggest anything to do with the length of a string
to someone who doesn't care about all the Unicode bells and whistles and
just wants to process strings while remaining agnostic about what they
are ultimately used for or contain?

I think they did the best they could. A better (but impossible in practise)
solution would have been to redefine "char" to be a >=24 bit quantity (I'd have
chosen 32-bit signed, myself), and redefine String to contain the new "char"s.
It would have been nice to refactor String to separate the physical (internal)
representation of the data from the logical character-based API.
Unfortunately, that would have been impossible unless they made the change
/very/ early -- and they missed the short window of opportunity for that. The
scheme they came up with, effectively redefining what "String" and "char" mean,
is probably the best possible solution. It doesn't break existing code -- in
the sense that what worked before continues to work -- all that has changed is
the interpretation of that code.

Code which /looks/ as if it will cope with all meaningful inputs does not (but
then, it never would have done). Not a satisfactory position, but the best we
are going to get.

There are issues to understand, and knowledge to be acquired; that's all...

-- chris

[*] The "length" of a Unicode string is somewhat problematical since some
characters qualify others (diacritical marks etc), and some "characters" are
not even characters at all. These issues are probably better thought of as
technical problems caused by the (unavoidable) compromises in Unicode's design
than something inherent to the problem space, but they are still issues for
creators of text-aware applications (few Java applications /are/ text-aware to
that degree).

[**] I should note that taking sequences of logical Unicode characters apart is
also non-trivial, quite independently of Java's representational deficiencies,
and may not fit with English speaking programmers' preconceptions. However,
that's a different kettle of problems and not really relevant here.

Chris Uppal · Jan 16, 2007

Stefan said:
There used to be another rule, requesting to use »SZ« instead,
when »SS« would be ambigous. For example, [...]

Interesting.

And takes the complexity of case-mapping into an entirely different -- word and
meaning sensitive -- direction.

It would be "nice" if we had similar rules in English. (Not too long ago our
government was trying to introduce VAT on books, and there was a popular
campaign opposed to it. The local branch of one bookshop had large "Don't Tax
Reading" posters up everwhere, and I rather enjoyed the ambiguity since
"Reading" (pronounced red-ing) was the name of the town....)

-- chris

Mark Thornton · Jan 16, 2007

Chris said:
It depends on what you mean. String.length() returns, correctly, the number of
Java "char"s in the String. No problem there. What /is/ a problem is that
that is not the same as the number of characters in the Unicode text. That's a
problem caused by the mis-specification of Java's chars to be 16-bit
quantities. It is highly unfortunate, but there is very little that can be
done about it now. It means that correct programming is more difficult than it

When UNICODE was first proposed it was expected that 16 bits would be
enough. The designers of Java believed them. I'm not sure of the exact
timing of Unicode's extension beyond 16 bits relative to Java's
development. Even if Gosling et al had known that Unicode would grow
beyond 16 bits, it might still have been correct to use 16 bits for Java
characters. Even as it was there was a fair bit of muttering about the
space used by these wide characters.

As for case mapping, it worth noting that The Windows NTFS file system
uses a special case mapping which doesn't correspond to that of any
known locale. I wonder how much software exists which compares file
names using regular string comparison.

Mark Thornton

Martin Gregorie · Jan 16, 2007

Oliver said:
Similarly, an everyday bloke might be surprised about the output of the
following Java program:

<code>
public class Test {
public static void main(String args[]) {
System.out.println(0.1 + 0.2 == 0.3);
}
}
</code>

But unless said bloke studied numerical computing, or at the very least,
has a understanding of the binary representation system for numbers, said
bloke is probably unqualified to request that the computer scientists and
IEEE redefine floating point computation to suit said blokes.

An ordinary bloke might be surprised but any programmer who, in the last
40 years, would test equality that way rather than this:

if (Math.abs((0.1 + 0.2) - 0.3) < 0.05)
{
// 0.005 is an arbitrary constant: its value depends
// on the value of the least significant digit in the
// numbers being compared. It be should half the value
// of the LSD.
System.out.println("Equal");
}

doesn't know his trade. This isn't some numeric esotericism: it is basic
knowledge about the representation of real numbers and is absolutely
required of anybody handling real number computation.

Using a simple equality is every bit as inexcusable as using floats or
doubles to hold monetary values. Both mistakes result from the same
misunderstanding.

Chris Uppal · Jan 16, 2007

Mark Thornton wrote:

[me:]

When UNICODE was first proposed it was expected that 16 bits would be
enough. The designers of Java believed them. I'm not sure of the exact
timing of Unicode's extension beyond 16 bits relative to Java's
development.

According to a previous post of mine (i.e. I can't be bothered to go back and
re-check the facts), Unicode 2.0.0, by which version the 16-bit idea was
definitely dead, was published in July '96. JDK 1.0 had been released in
January of the same year. But I don't think the window of opportunity was as
small as that suggests. For one thing I think Java was standardised too early
(why standardise something /before/ anyone has used it ?). For another, Sun's
central involvement with the development of Unicode certainly put them in a
position where it was /possible/ that Gosling etc. could have known what was
coming up, even if the general Unicode-using population wasn't paying
attention. (Not to say that they /should/ have known, only that they /might/
have done, and that it's unfortunate that they didn't.)

Even if Gosling et al had known that Unicode would grow
beyond 16 bits, it might still have been correct to use 16 bits for Java
characters. Even as it was there was a fair bit of muttering about the
space used by these wide characters.

That's pretty much the reason why I think the physical representation of the
data in a String should have been separated from its logical contents. There
are too many tradeoffs for any one representation to have a convincing
advantage (and if there was one convincing "best" representation, I doubt
whether it would be UTF16 -- combining as it does most of the space costs of a
constant-width representation with most of the time costs of a variable-width
encoding...).

As for case mapping, it worth noting that The Windows NTFS file system
uses a special case mapping which doesn't correspond to that of any
known locale. I wonder how much software exists which compares file
names using regular string comparison.

I didn't know that; thanks. Do you have any kind of a reference ?

I admit I sometimes use "poor-man's" filename comparison, even if I do prefer
to use the system-supplied methods where it doesn't cause too much fuss. I
will probably continue to do so, but now I'll feel even more uncomfortable
about it :-(

-- chris

John Ersatznom · Jan 18, 2007

Ian said:
They might use the 6 character a\uFB04uent in article text but the 8
character AFFLUENT in headlines.

??

Encoding not apparently supported at my end, sorry.

Regardless, the correct way to go about doing things is to have the
usual string representation (e.g. "affluent") under the hood, however
it's actually rendered. Representation and presentation are *supposed*
to be kept separate -- that's why we invented things like CSS, also.

John Ersatznom · Jan 18, 2007

Oliver said:
That is why I gave an example to you, and
repeatedly ask you not to simply take my word for it, and run it yourself,
to see what the results were.

What you have not done is explain why you attacked one of my posts
earlier in the thread. That is what started this whole sideline, which
is irrelevant to the OP's problem.

I believe that having "ß".toUpperCase() yield "SS" is surprising only to
those who are unfamiliar with the ß character. Probably to most German
speakers, this behaviour is very non-surprising, and in fact, expected.

What is surprising (and violates the Principle of Least Surprise) is the
following:

x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)
x.toFooCase().length() != x.length()

for some choices of x, y, and Foo.

You may argue that it is equalsIgnoreCase that is broken, but that still
doesn't resolve the issue that strings might *change length*
unexpectedly as well.

I think "everyday blokes" are unqualified to have any expectations of
have the concepts of upper case and lower case mean in an international
setting. These blokes may have a good idea of what these concepts mean in
their particular language, but unless they are linguists, they probably have
no idea what these concepts might mean in other languages. Such blokes are
probably unqualified to request that the linguists and the unicode
consortium redefine their concept of uppercase and lowercase to suit said
blokes.

Similarly, an everyday bloke might be surprised about the output of the
following Java program:

<code>
public class Test {
public static void main(String args[]) {
System.out.println(0.1 + 0.2 == 0.3);
}
}
</code>

But unless said bloke studied numerical computing, or at the very least,
has a understanding of the binary representation system for numbers, said
bloke is probably unqualified to request that the computer scientists and
IEEE redefine floating point computation to suit said blokes.

I don't think this is relevant here. Someone familiar with FP math won't
be surprised by the behavior of the above. But a programmer using
toUpperCase on strings to key a hash table for case-insensitive lookup
is going to be surprised if they do weird things like change length,
compare equal for strings that aren't equalsIgnoreCase(), and the like.
Remember, most programmers a) are English speaking and b) have
backgrounds in various programming languages, often including ones with
ASCII string classes and case-transforming methods that behave in the
"usual" way -- that is, each output letter corresponds to 1 input letter
under a fairly basic transformation rule.

Principle of Least Surprise is being violated.

I claim that there exists at least one English speaker for which its
behaviour is not surprisingly (me).

Yes, but you're weird, and apparently multilingual rather than
*unilingual English*.

It does. See the JavaDocs.

In which case the version that doesn't shouldn't behave in a surprising
way, unless your system default locale is surprising, and of course THAT
shouldn't happen.

You'll have to define the terms "spelling" and "word" outside of the
context of any one particular language (e.g. you can't assume only the Latin
alphabet) before I can agree or disagree with your claim.

It suffices to mention the axiom that words with different numbers of
letters are spelled differently. So if x.length() != y.length() (excuse
me, codePointCount

) then x and y are spelled differently.

Or are you now going to claim that the same spelling can have different
lengths? (Encodings such as zipping the text up, UTF-8 etc. don't count.)

John Ersatznom · Jan 18, 2007

John said:
It does if Unicode U+FB01 is used.

Look, you are /way/ out of your depth on this.

Maybe so, but I was *dragged down* by people piling onto my earlier,
innocuous posting. What do you want me to do, simply concede and let you
win? Why was I attacked to begin with?

I checked the history of this thread again and saw that it started with
a post by one Oliver Wong. I then googled this bloke, and found in this
same newsgroup a thread of around 500 articles half of them authored by
him. I get the impression he's an extremely argumentative, arrogant and
condescending man whose primary mission in life is to find postings in
this newsgroup and attack them accusing the author of making mistakes if
he finds anything in them that differs in the slightest from his
personal beliefs.

That is not a useful way to discuss things, and serves only to put
various people on the defensive and start long argumentative threads
apropos of nothing. He really should cut it out, and I think I may just
go and killfile him now, along with this thread and any others that he
has polluted with his incessant pedantry and unsolicited criticism.

John Ersatznom · Jan 18, 2007

Chris said:
It depends on what you mean. String.length() returns, correctly, the number of
Java "char"s in the String. No problem there. What /is/ a problem is that
that is not the same as the number of characters in the Unicode text. That's a
problem caused by the mis-specification of Java's chars to be 16-bit
quantities. It is highly unfortunate, but there is very little that can be
done about it now. It means that correct programming is more difficult than it
looks, and also more difficult than it /should/ be. There is nothing in the
problem space that makes this difficult (well, actually there is, but we'll
pretend there isn't for now[*]), it's not an /inherently/ complex problem, but
historical mistakes in Java's design mean that the API mostly works in terms of
UTF-16 encoding (sequences of 16-bit values) rather than in terms of real
Unicode characters.

Returning to the original context of this discussion, I repeat my
assertion that it also /should/ be simple to use strings as map keys,
and to do so case-insensitively if you so desire. AFAICT, in fact, using
toUpperCase might be the way to do it, avoiding equalsIgnoreCase and
toLowerCase. The weird German word with the untypable character becomes
"BEISSEN" and so presumably does "beissen", so all variations on that
become one key value. I guess x.toUpperCase().equals(y.toUpperCase())
has to be used as the "real" equalsIgnoreCase() then. And to get a
canonical lower case form, x.toUpperCase().toLowerCase(), which will
turn any spelling of that same word into "beissen". That becomes the
"real" toLowerCase() then.

John Ersatznom · Jan 18, 2007

Chris said:
(why standardise something /before/ anyone has used it ?)

Why use something before anyone has standardised it? You're only asking
for headaches and incompatibility down the road...

Lew · Jan 18, 2007

John said:
??

Encoding not apparently supported at my end, sorry.

Regardless, the correct way to go about doing things is to have the
usual string representation (e.g. "affluent") under the hood, however
it's actually rendered. Representation and presentation are *supposed*
to be kept separate -- that's why we invented things like CSS, also.

The usual String representation of a word spelled with a ligature character
will be with the ligature character in the spelling, not with the equivalent
double-character pair. As has been stated multiple times in this thread, Java
Strings have no native construct for "word", nor for "correct spelling", nor
for what you or I think should happen. Rather, they have an adaptation of what
the Unicode folks determined should happen.

The reasoning presented in this thread has convinced me that the shortcoming
is not in toUpperCase() or toLowerCase() but in equalsIgnoreCase(), and not in
a language's own practice of how to case-convert ligatures, as "ß" to "SS" or
"SZ", but in the use of UTF-16 encoding internally.

While I agree with your statement that "[r]epresentation and presentation are
*supposed* to be kept separate", it is clear that the representation of "ß"
should be a character for "ß", and not for "ss" nor "sz". In this domain of
discourse, the character represented as "ß" may be presented as "ß", or as
"\u00DF", or "?", as the system warrants. The upper-case transformation of "ß"
is represented by "SS". That's a fact in German, it's a fact in Unicode, and
it's a fact in Java.

So, in fact, what you describe as "the correct way to go about doing things"
is, in fact, what is actually in reality happening. The "usual", in fact, the
*correct* (within the limits of UTF-16) representation is what's "under the
hood, however it's actually rendered". In fact.

- Lew

Lew · Jan 18, 2007

Ian said:
??

John said:
Encoding not apparently supported at my end, sorry.

Ian's quoted snippet uses all 7-bit characters, so it is likely not an
encoding issue on your end.

Ian's point was that the ligature character '\uFB04' would be upper-cased to
"FF" even in the non-computer, social context of newspaper headlines. What
Java does is an echo of that.

- Lew

Lew · Jan 18, 2007

Ian's point was that the ligature character '\uFB04' would be
upper-cased to "FF" even in the non-computer, social context of
newspaper headlines. What Java does is an echo of that.

- Lew

Er, "FFL".

Lew · Jan 18, 2007

John said:
What is surprising (and violates the Principle of Least Surprise) is the
following:

The documented behavior of the Java API methods String.toUpperCase() and
String.toLowerCase() is completely unsurprising, at least to a practitioner of
the Java art. Arguing that it should differ from what it is will yield no
sweet fruit. It is what it's supposed to be.

- Lew

Check forms With JavaScript	1	Mar 28, 2023
How to put a null check on this code	0	Jan 4, 2022
How to convert XML to XSLT & XSL-FO to be used by FOP ?	1	Mar 21, 2007
Trouble accessing a value within a JSON string.	1	Jun 16, 2023
How to check time delay caused by code itself?	1	Jul 20, 2022
ValueError - "Found input variables with inconsistent numbers of samples: [100, 120]"	1	Jul 27, 2023
Subclassing Hash to enforce value uniqueness ala key uniqueness.	5	Nov 18, 2008
How to treat an input data as variable?	4	Apr 13, 2023

How to check variables for uniqueness ?

Ian Wilson

Oliver Wong

Stefan Ram

Stefan Ram

John W. Kennedy

John W. Kennedy

Chris Uppal

Chris Uppal

Mark Thornton

Martin Gregorie

Chris Uppal

John Ersatznom

John Ersatznom

John Ersatznom

John Ersatznom

John Ersatznom

Lew

Lew

Lew

Lew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads