How to check variables for uniqueness ?

Lew · Jan 18, 2007

John said:
Returning to the original context of this discussion, I repeat my
assertion that it also /should/ be simple to use strings as map keys,
and to do so case-insensitively if you so desire. AFAICT, in fact, using
toUpperCase might be the way to do it, avoiding equalsIgnoreCase and
toLowerCase. The weird German word with the untypable character becomes
"BEISSEN" and so presumably does "beissen", so all variations on that
become one key value. I guess x.toUpperCase().equals(y.toUpperCase())
has to be used as the "real" equalsIgnoreCase() then. And to get a
canonical lower case form, x.toUpperCase().toLowerCase(), which will
turn any spelling of that same word into "beissen". That becomes the
"real" toLowerCase() then.

Your observations are correct. It is not generally true that for non-null
String foo

foo.toUpperCase().toLowerCase().equals( foo );

nor that

foo.toUpperCase().length() == foo.length()

nor that

foo.equalsIgnoreCase( foo.toUpperCase() )

This is by design. Nowhere does anyone claim that these equalities hold. You
may lament all you wish that they do not.

- Lew

Chris Uppal · Jan 19, 2007

John said:
I checked the history of this thread again and saw that it started with
a post by one Oliver Wong. I then googled this bloke, and found in this
same newsgroup a thread of around 500 articles half of them authored by
him. I get the impression he's an extremely argumentative, arrogant and
condescending man whose primary mission in life is to find postings in
this newsgroup and attack them accusing the author of making mistakes if
he finds anything in them that differs in the slightest from his
personal beliefs.

Look, we all know who you are.

I, personally, was willing to assume that your new nom de plume reflected a
desire on your part to start afresh here, without the baggage of your previous
(occasionally atrocious) behaviour. I have been, despite slight misgivings,
happy to interact with "John Ersatznom" as if he were a brand new member of
this community.

I would /still/ be willing to act on that assumption, even if you want to
provoke acrimonious dispute (though somehow I doubt if you'd find it easy to
persuade Oliver to join in), but this kind of glove-puppetry is just sickening.
The point is not the slur against Oliver (although I respect him, and don't
want to see him slagged off, I respect him enough to think that he can look
after himself in these matters) but the above quoted paragraph is an insult to
every reader's intelligence.

How, or even whether, other people choose to react is their affair, but you
have passed the bounds of /my/ tolerance.

-- chris

John W. Kennedy · Jan 19, 2007

John said:
I don't think this is relevant here. Someone familiar with FP math won't
be surprised by the behavior of the above. But a programmer using
toUpperCase on strings to key a hash table for case-insensitive lookup
is going to be surprised if they do weird things like change length,
compare equal for strings that aren't equalsIgnoreCase(), and the like.
Remember, most programmers a) are English speaking and b) have
backgrounds in various programming languages, often including ones with
ASCII string classes and case-transforming methods that behave in the
"usual" way -- that is, each output letter corresponds to 1 input letter
under a fairly basic transformation rule.

Ineducable.

*PLONK*

Ian Wilson · Jan 19, 2007

John said:
Ian said:

[American newspapers] might use the 6 character a\uFB04uent in article text but the 8
character AFFLUENT in headlines.

Click to expand...

??

Encoding not apparently supported at my end, sorry.

Apparently you are wrong.

The quoted text above is all encoded in ASCII. It wasn't intended to
present an ffi ligature on your screen. It contains an ASCII
representation of the Unicode code-point for an ffi ligature, in a form
(\uXXXX) that should be familiar to readers of Java newsgroups.

My message had this encoding:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Lets look at your headers ..
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Your headers also indicate you're using Thunderbird, as am I.

ISO-8859-1 is a superset of ASCII, so has no problems with the ASCII
text of my message.

Your newsreader only needs to be ASCII compatible to display the six
ASCII characters backlash u F B zero four.

The encodings I used ARE supported at your end, either in Thunderbird or
in Java.

Lew · Jan 19, 2007

Chris said:
Look, we all know who you are.

I *thought* so!

- Lew

Oliver Wong · Jan 19, 2007

John Ersatznom said:
What you have not done is explain why you attacked one of my posts earlier
in the thread. That is what started this whole sideline, which is
irrelevant to the OP's problem.

I fear I'm going to open up a whole can of twisty little worms with this
one, but... Can you cite what it is I said that you consider to be an
"attack"?

What is surprising (and violates the Principle of Least Surprise) is the
following:

x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)
x.toFooCase().length() != x.length()

for some choices of x, y, and Foo.

If you are not surprised by the fact that "ß".toUpperCase() yield "SS",
then you should not be surprised that there exists some values for x such
that x.toUpperCase().length() != x.length().

[Snip "everyday blokes" argument]

I don't think this is relevant here.

The relevancy is thus: You claim that the behaviour of toUpperCase
should change because it's surprising to every day blokes. I am arguing that
this is not a valid reason for changing the behaviour of toUpperCase,
because every day blokes, not being linguists, are unqualified to make
linguistic rules that may have widespread implication for languages other
than their own.

[...]

Remember, most programmers a) are English speaking and b) have backgrounds
in various programming languages, often including ones with ASCII string
classes and case-transforming methods that behave in the "usual" way --
that is, each output letter corresponds to 1 input letter under a fairly
basic transformation rule.

Are you sure about these assertions? Do you not think that there might
be more Chinese/Japanese programmers than English programmers, given the
huge population of Asia as compared to the western countries, and the recent
ecomonic growth spurt in Asian? And what about India?

Yes, but you're weird, and apparently multilingual rather than *unilingual
English*.

I claim I am not the only programmer in the world who is unilingual
English.

[...]

It suffices to mention the axiom that words with different numbers of
letters are spelled differently.

Two issues:

(1) Your axiom fails to satisfy my requirement that your definition must
be outside the context of any one particular language. Chinese characters,
for example, are not composed of letters, and so speaking about "number of
letters in a word" is meaningless there.

(2) That wasn't what I was reluctant to agree with anyway. I am not
arguing against the idea that "color" and "colour" are spelt differently.
However, I *AM* arguing against the idea that "color" and "colour" are the
same word (depending on your definition of "word" which I am awaiting), and
I am arguing against the idea that "a concept like 'same spelling' can't be
flawed" (depending on your definition of spelling, which I am awaiting).

Recall that there exists languages where words are not written using
letters. So any definition of "spelling" which depends on "letters" is
inherently flawed.

- Oliver

Oliver Wong · Jan 19, 2007

John Ersatznom said:
I guess x.toUpperCase().equals(y.toUpperCase()) has to be used as the
"real" equalsIgnoreCase() then. And to get a canonical lower case form,
x.toUpperCase().toLowerCase(), which will turn any spelling of that same
word into "beissen". That becomes the "real" toLowerCase() then.

I don't think this works in general (though if you're sure all your
input is going to be ASCII, English, German, or some other subset of the
full Unicode standard, you can probably get away with it).

If you look at the source code for String.equalsIgnoreCase(), you'll see
it calls some helper methods, and in one of these methods, you have the
following comment:

<quote>
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
</quote>

I'm not familiar with the Georgian alphabet, so I don't know of any
examples that clearly demonstrate the problem with
x.toUpperCase().equals(y.toUpperCase()), but suffice it to say that there
exists languages out there for which this won't work.

- Oliver

Mark Thornton · Jan 19, 2007

John said:
What is surprising (and violates the Principle of Least Surprise) is the
following:

x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)
x.toFooCase().length() != x.length()

for some choices of x, y, and Foo.

The trouble is that some (human) languages are evidently surprising to
those not aware of them. Java can't change the fact that German and
Georgian exist, nor can it change how these languages behave. For me, to
not uppercase ß as SS would be surprising. (Although English is my
native tongue, I did learn German at school some 30 years ago.)

> x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)

I believe this problem arises because some languages effectively have
more than two cases. An identity that seems obvious in a two case world,
ceases to be meaningful in a more complex situation.

Mark Thornton

Mark Thornton · Jan 19, 2007

Chris said:
I didn't know that; thanks. Do you have any kind of a reference ?

I can't find the article I wanted, but this gives some clues:

http://blogs.msdn.com/michkap/archive/2005/10/17/481600.aspx

http://blogs.msdn.com/michkap/archive/2005/01/16/353873.aspx

Mark Thornton

Mike Schilling · Feb 18, 2007

Chris said:
Mark Thornton wrote:

[me:]

When UNICODE was first proposed it was expected that 16 bits would be
enough. The designers of Java believed them. I'm not sure of the
exact timing of Unicode's extension beyond 16 bits relative to Java's
development.

Click to expand...

According to a previous post of mine (i.e. I can't be bothered to go
back and re-check the facts), Unicode 2.0.0, by which version the
16-bit idea was definitely dead, was published in July '96. JDK 1.0
had been released in January of the same year. But I don't think the
window of opportunity was as small as that suggests. For one thing I
think Java was standardised too early (why standardise something
/before/ anyone has used it ?). For another, Sun's central
involvement with the development of Unicode certainly put them in a
position where it was /possible/ that Gosling etc. could have known
what was coming up, even if the general Unicode-using population
wasn't paying attention. (Not to say that they /should/ have known,
only that they /might/ have done, and that it's unfortunate that they
didn't.)

Even if Gosling et al had known that Unicode would grow
beyond 16 bits, it might still have been correct to use 16 bits for
Java characters. Even as it was there was a fair bit of muttering
about the space used by these wide characters.

Click to expand...

That's pretty much the reason why I think the physical representation
of the data in a String should have been separated from its logical
contents. There are too many tradeoffs for any one representation to
have a convincing advantage (and if there was one convincing "best"
representation, I doubt whether it would be UTF16 -- combining as it
does most of the space costs of a constant-width representation with
most of the time costs of a variable-width encoding...).

I believe that the existing mess could have been avoided with precisely one
change to the existing definition of Java: rather than define a char as a
16-bit integral type, define it as a non-integral type of an unspecified
size. (Exactly like boolean.) As it stands today the only Java that would
have trouble with 21-bit characters is code which converts chars to shorts
and back.

Chris Uppal · Feb 19, 2007

Mike said:
I believe that the existing mess could have been avoided with precisely
one change to the existing definition of Java: rather than define a char
as a 16-bit integral type, define it as a non-integral type of an
unspecified size. (Exactly like boolean.)

Of even as an unsigned integral type with a size guaranteed to be <= 31 bits.

There would have to have been small changes to the JVM spec, and to the
serialisation spec too (and to JNI -- though that didn't come out until later).
But nothing of staggering difficulty. Even the JVM /implementation/ would be
largely unchanged since chars are represented in 32-bit slots on the stack
anyway...

.... sigh ...

-- chris

Mike Schilling · Feb 19, 2007

Chris said:
Of even as an unsigned integral type with a size guaranteed to be <=
31 bits.

I considered that, but didn't see a definite need for even that loose
guarantee. (Why 31 and not 32, by the way? What makes the guarantee
useful, I think, is that chars can be losslessly converted to ints.)

Mike Schilling · Feb 19, 2007

Mike said:
I considered that, but didn't see a definite need for even that loose
guarantee. (Why 31 and not 32, by the way? What makes the guarantee
useful, I think, is that chars can be losslessly converted to ints.)

Oh, "unsigned". Never mind.

=?ISO-8859-1?Q?Arne_Vajh=F8j?= · Feb 20, 2007

Chris said:
Of even as an unsigned integral type with a size guaranteed to be <= 31 bits.

There would have to have been small changes to the JVM spec, and to the
serialisation spec too (and to JNI -- though that didn't come out until later).
But nothing of staggering difficulty. Even the JVM /implementation/ would be
largely unchanged since chars are represented in 32-bit slots on the stack
anyway...

For proper interoperability it has to be specified what it is.

Arne

Mike Schilling · Feb 20, 2007

Arne said:
For proper interoperability it has to be specified what it is.

Or, to be precise, how it's represented externally. This could be, for
instance, as the character's UTF-8 representation.

Chris Uppal · Feb 20, 2007

Arne Vajhøj wrote:

[me:]

For proper interoperability it has to be specified what it is.

We were considering what changes would have been necessary back at the
beginning of Java's history for this UTF-16 mess to have been avoided, or
avoidable. I agree that we are in fact stuck with what we've got.

Way back then there was no interoperability, since there was nothing to
interoperate /with/ ;-)

-- chris

=?ISO-8859-1?Q?Arne_Vajh=F8j?= · Feb 21, 2007

Chris said:
Arne Vajhøj wrote:
[me:]

For proper interoperability it has to be specified what it is.

Click to expand...

We were considering what changes would have been necessary back at the
beginning of Java's history for this UTF-16 mess to have been avoided, or
avoidable. I agree that we are in fact stuck with what we've got.

Way back then there was no interoperability, since there was nothing to
interoperate /with/ ;-)

It is not backwards compatibility I am talking about.

I am talking about interoperability between JVM's from
different vendors.

If you have a SUN and IBM Java exchanging binary data, then
it is very beneficial that the number of bits in a char is
well defined - not at least X bits as we all know it from C/C++.

Arne

Chris Uppal · Feb 21, 2007

Arne Vajhøj wrote:

[me:]

It is not backwards compatibility I am talking about.

I am talking about interoperability between JVM's from
different vendors.

If you have a SUN and IBM Java exchanging binary data, then
it is very beneficial that the number of bits in a char is
well defined - not at least X bits as we all know it from C/C++.

Ah, I had misunderstood you. Sorry.

But I don't think it would have been possible to allow for binary compatibility
(in that sense) /and/ had the Java spec worded in such a way that it didn't
make it impossible to fix up future problems.

At least, not without buildng extra (explicit) flexibility into each binary
spec.

-- chris

Mike Schilling · Feb 21, 2007

Chris Uppal said:
Arne Vajhøj wrote:

[me:]

It is not backwards compatibility I am talking about.

I am talking about interoperability between JVM's from
different vendors.

If you have a SUN and IBM Java exchanging binary data, then
it is very beneficial that the number of bits in a char is
well defined - not at least X bits as we all know it from C/C++.

Click to expand...

Ah, I had misunderstood you. Sorry.

But I don't think it would have been possible to allow for binary
compatibility
(in that sense) /and/ had the Java spec worded in such a way that it
didn't
make it impossible to fix up future problems.

At least, not without buildng extra (explicit) flexibility into each
binary
spec.

That is, Java wouldn't define a binary format for chars; chars would be
exchanged (as Strings are) via some encoding of chars into bytes.

Check forms With JavaScript	1	Mar 28, 2023
How to put a null check on this code	0	Jan 4, 2022
How to convert XML to XSLT & XSL-FO to be used by FOP ?	1	Mar 21, 2007
Trouble accessing a value within a JSON string.	1	Jun 16, 2023
How to check time delay caused by code itself?	1	Jul 20, 2022
ValueError - "Found input variables with inconsistent numbers of samples: [100, 120]"	1	Jul 27, 2023
Subclassing Hash to enforce value uniqueness ala key uniqueness.	5	Nov 18, 2008
How to treat an input data as variable?	4	Apr 13, 2023

How to check variables for uniqueness ?

Lew

Chris Uppal

John W. Kennedy

Ian Wilson

Lew

Oliver Wong

Oliver Wong

Mark Thornton

Mark Thornton

Mike Schilling

Chris Uppal

Mike Schilling

Mike Schilling

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mike Schilling

Chris Uppal

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Chris Uppal

Mike Schilling

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads