How to check variables for uniqueness ?

E

Ed

Lew skrev:

(Please do not embed TAB characters in newsgroup postings.)

You could use a HashMap if you wanted to know how many times each word occurred:
snip
- Lew

Indeed.

And in case anyone's interested, here are the times for HashMap. Looks
like Map is in the league of Set, and not the slow-moving List. (These
times are longer than the previous times because of current CPU
loading; relativity is the key.)

522393 duplicated words. Using java.util.HashSet, time = 789ms.
522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
522393 duplicated words. Using Map , time = 1180ms.
522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

Apologies to Patricia: I see I mis-attributed her post, yet again. And
Lew, I've now become fast friends now with Linux's expand(). Let's see
whether I purged those nasty TABs:

import java.util.*;
import java.io.*;

class Test {
private static String TEXT_BOOK_NAME = "war-and-peace.txt";

public static void main(String[] args) {
try {
String text = readText(); // Read text into RAM
countDuplicateWords(text, new HashSet());
countDuplicateWords(text, new TreeSet());
countDuplicateWordsMap(text);
countDuplicateWords(text, new ArrayList());
countDuplicateWords(text, new LinkedList());
} catch (Throwable t) {
System.out.println(t.toString());
}
}

private static String readText() throws Throwable {
BufferedReader reader =
new BufferedReader(new FileReader(TEXT_BOOK_NAME));
String line = null;
StringBuffer text = new StringBuffer();
while ((line = reader.readLine()) != null) {
text.append(line + " ");
}
return text.toString();
}

private static void countDuplicateWords(String text,
Collection listOfWords) {
int numDuplicatedWords = 0;
long startTime = System.currentTimeMillis();
for (StringTokenizer i = new StringTokenizer(text);
i.hasMoreElements();) {
String word = i.nextToken();
if (listOfWords.contains(word)) {
numDuplicatedWords++;
} else {
listOfWords.add(word);
}
}
long endTime = System.currentTimeMillis();
System.out.println(numDuplicatedWords + " duplicated words. " +
"Using " + listOfWords.getClass().getName() +
", time = " + (endTime - startTime) + "ms.");
}

private static void countDuplicateWordsMap(String text) {
int numDuplicatedWords = 0;
Map wordsToFrequency = new HashMap();
long startTime = System.currentTimeMillis();
for (StringTokenizer i = new StringTokenizer(text);
i.hasMoreElements();) {
String word = i.nextToken();
Integer frequency = (Integer)wordsToFrequency.get(word);
if (frequency == null) {
wordsToFrequency.put(word, new Integer(0));
} else {
int value = frequency.intValue();
wordsToFrequency.put(word, new Integer(value + 1));
numDuplicatedWords++;
}
}
long endTime = System.currentTimeMillis();
System.out.println(numDuplicatedWords + " duplicated words. " +
"Using Map " +
", time = " + (endTime - startTime) + "ms.");
}
}



..ed
 
L

Lew

Ed said:
And in case anyone's interested, here are the times for HashMap. Looks
like Map is in the league of Set, and not the slow-moving List. (These
times are longer than the previous times because of current CPU
loading; relativity is the key.)

522393 duplicated words. Using java.util.HashSet, time = 789ms.
522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
522393 duplicated words. Using Map , time = 1180ms.
522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

These times are extremely interesting.

I speculate that the greater part of the difference between HashMap and
HashSet would be the second loop through the Map. Note that though the Map was
slightly slower than the Set, it delivers more information. With the Set you
only knew how many words were duplicated; with the Map you can also figure out
which words were, and how many times each one occurred.

You could, for example, use the Map to deliver the words in order of
frequency, given the right comparator over the entry set.

- Lew
 
J

John Ersatznom

Lew said:
These times are extremely interesting.

I speculate that the greater part of the difference between HashMap and
HashSet would be the second loop through the Map. Note that though the
Map was slightly slower than the Set, it delivers more information. With
the Set you only knew how many words were duplicated; with the Map you
can also figure out which words were, and how many times each one occurred.

You could, for example, use the Map to deliver the words in order of
frequency, given the right comparator over the entry set.

A lot of the Map slowness is probably the churn of Integer objects
created. Using an int[1] as a "mutable Integer" would work far better
(although mutable objects in collections is normally bad, mutable values
in a map isn't generally a problem, so long as you don't have mutable keys).

On the subject of tabs, my copy of Thunderbird seems to be quietly
converting tabs into spaces, though I can't find the setting for it.
Posts apparently originally containing tabs (e.g. Ed's earlier) have
spaces when I view them, and my own posts written with tabs don't make
you complain. :) The curious thing is that incoming posts seem to have
tab->4 spaces and the editor shows tabs as 8 spaces, but they become 4
in the actual sent posting...and none of the options in Thunderbird say
anything about conversion of tabs at all, either to set their displayed
width or to actually change tabs to certain numbers of spaces. Hrm. The
"online help" doesn't open a help window, but rather hijacks my open
Firefox window, and the search there is useless on this topic too...
 
J

John Ersatznom

Oliver said:
You weren't intended to.

Then you're missing the point entirely. "COLOR" and "colour" differ only
by capitalization while "beissen" and "beißen" differ by spelling in a
manner similar to "color" vs. "colour". Alternate spellings of the same
word can't in general be idenfitied as identical by a computer -- not
without a trip through a spellchecking dictionary or the like, anyway. I
think you may be expecting too much of Java's humble string classes.
Perhaps Collator is smart enough for you?
 
A

Andrew Thompson

John Ersatznom wrote:
....
Then you're missing the point entirely. "COLOR" and "colour" differ only
by capitalization ..

As well as the 'u' in the second word. And from my
vague recollections of this thread (that I am not prepared
to review at this instant) - a misunderstanding between
the spelling observed, might actually explain this (sub)
thread..(?)

Andrew T.
 
E

Ehsan Khoddam mohammadi

I prefer to implement a BST for that, create a BST yourself, insert
vars to them , it uses
O(logn) to compare vars.

class BST {
Node root;
Node leftChild, rightChild;
int insert(String s){
return insert (new Node (String s),root);
}

int insert (Node node,Node node2insert){
if (node2insert==null){
node2insert=node;
return 1;
}
else {

if (node < node2insert){
return insert (node,node2insert.leftChild);

}
else if (node > node2insert){
return insert (node,node2insert.leftChild);
}
else return 0;
}
}



you insert all vars, which nodes that the insert return 0 is duplicated


}
 
O

Oliver Wong

John Ersatznom said:
Then you're missing the point entirely.

Must be, because I was under the impression I was making a point to you,
as opposed to the other way around. I thought you were curious as to how
manually doing case-insensitive conversions could fail, as opposed to using
the build in equalsIgnoreCase().


"COLOR" and "colour" differ only by capitalization while "beissen" and
"beißen" differ by spelling in a manner similar to "color" vs. "colour".

I disagree.
Alternate spellings of the same word can't in general be idenfitied as
identical by a computer -- not without a trip through a spellchecking
dictionary or the like, anyway. I think you may be expecting too much of
Java's humble string classes. Perhaps Collator is smart enough for you?

You should take the code I posted and put it in your favorite IDE, fix
the compile errors (apparently, it's toLowerCase, not toLowercase), and run
it. You might find the results enlightening. If those results surprise you,
add a few System.out.println(a) to see what's going on.

- Oliver
 
J

John Ersatznom

Andrew said:
John Ersatznom wrote:
...



As well as the 'u' in the second word.

That wasn't supposed to be there, though the later "color" vs "colour"
(all lower case) is correct. :p I trust my meaning is still easy to glean.
 
J

John Ersatznom

Oliver said:
Must be, because I was under the impression I was making a point to you,
as opposed to the other way around. I thought you were curious as to how
manually doing case-insensitive conversions could fail, as opposed to using
the build in equalsIgnoreCase().

Both will fail when you want words spelled differently to compare equal,
though Collator may have more smarts in that area.
I disagree.

On what basis? The typo I made? It was meant to say "COLOR" and "color"
differ only by capitalization while "beissen" and "beißen" differ by
spelling in a manner similar to "color" vs. "colour".

In fact the analogy goes so far as for the number of letters in the
latter two examples to differ by one in both cases, and for a two letter
region in one to correspond to a single letter at the same place in the
other in particular. And (presumably -- I don't know the German word(s))
they are in both cases variant spellings of a different word --
differing in more than just capitalization, but used interchangeably or
as regional variants rather than having distinct meanings.
You should take the code I posted and put it in your favorite IDE, fix
the compile errors (apparently, it's toLowerCase, not toLowercase), and run
it.

It would have been nice if Sun had been consistent about their own
capitalization. There's also Character.isWhitespace (in the same class!
Note lowercase s) and System.arraycopy (note lowercase c), at minimum.
:p Maybe they need to implement an isCamelCase method (note second
capital C)... :)

In any event, I suppose the real lesson here is that String (and
friends) get you primitive ordering and comparisons, perhaps somewhat
Anglocentric, and you need to use Collator and relatives for serious
language-and-locale-sensitive comparisons. I don't know the extent to
which even the latter will cope with variant spellings, mind you. There
is also a where-do-you-draw-the-line issue -- from case to slight
variations in the actual sequence of letters used on to more overt
differences, as between "huge" and "giant" -- when should those be
considered synonyms, and when different? -- and on until if you broaden
your requirements enough solving the NLP seems to be a required
component of any conforming implementation. :) Language has a fuzziness
in it in actual human usage that computers have trouble with. It's
curiously not unlike the problems that arose elsewhere here today with
float and double comparisons. You can't rely usefully on == for the most
part, and using Math.abs(x - y) < someThreshold gives an "equality" test
that's more meaninful in some ways but is not transitive any more.
Eventually linguistic equality loses transitivity too -- you can play
all kinds of games of picking close synonyms of the previous word to
grow a chain that can end in a fairly good approximation to an antonym
for your starting word, in most any language, using either phonemic
proximity or lexical proximity, and get different results with each besides.

The real upshot is simply "computers, at present, don't have the ability
to really model things in linguistics". But they know about abstract
sequences of discrete, wholly-distinct characters that happen to stand
for graphical squiggles meaningful to humans.

Play to their strengths -- the computers' *and* the humans'. :)
 
O

Oliver Wong

John Ersatznom said:
Both will fail when you want words spelled differently to compare equal,
though Collator may have more smarts in that area.

I don't know how you define "fail" or "not fail" in this context, but the
point that I'm trying to make is that the two methods do not give the same
results and are thus not equivalent. Try running the example I provided
earlier, or try this example:

{
System.out.println("beißen".equalsIgnoreCase("BEISSEN"));
System.out.println("beißen".toUpperCase().equals("BEISSEN"));
}
On what basis?

Replace the "beißen" by "colour" and "BEISSEN" by "COLOR", and you will
see get different results, thus showing that the difference between "COLOR"
and "colour" is not of the same nature as that between "beißen" and
"BEISSEN".

- Oliver
 
J

John Ersatznom

Oliver said:
I don't know how you define "fail" or "not fail" in this context, but the
point that I'm trying to make is that the two methods do not give the same
results and are thus not equivalent. Try running the example I provided
earlier, or try this example:

{
System.out.println("beißen".equalsIgnoreCase("BEISSEN"));
System.out.println("beißen".toUpperCase().equals("BEISSEN"));
}



Replace the "beißen" by "colour" and "BEISSEN" by "COLOR", and you will
see get different results, thus showing that the difference between "COLOR"
and "colour" is not of the same nature as that between "beißen" and
"BEISSEN".

This may show that it is "not of the same nature" as defined by certain
Java library functions, but I don't see how this is really meaningful to
people, except insofar as it "means" that the standard library has a bug
or at least a wart or misfeature of some kind. The "equalsIgnoreCase"
method should ignore case, but not spelling. It shouldn't consider
"color" equal to "colour" and it shouldn't consider "beißen" equal to
"beissen" either. Why? Because those pairs differ by spelling and not
just capitalization!
 
L

Lew

John said:
... The "equalsIgnoreCase"
method should ignore case, but not spelling. It shouldn't consider
... "beißen" equal to "beissen" either.
Why? Because those pairs differ by spelling and not
just capitalization!

That is how equalsIgnoreCase() works:

"beißen".equalsIgnoreCase("BEISSEN"): false

- Lew
 
J

John Ersatznom

Lew said:
That is how equalsIgnoreCase() works:

"beißen".equalsIgnoreCase("BEISSEN"): false

Well, then, either Wong is completely nuts, or we're using different JDK
versions (1.6 here), or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.
 
L

Lew

John said:
Well, then, either Wong is completely nuts,

The result agrees with Oliver's assertion.
or we're using different JDK
versions (1.6 here), or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.

AFAIK, toUpperCase() follows the socially-determined locale rules. What is the
upper case of "beißen" in German? (E.g., what would a German newspaper do?)

- Lew
 
C

Chris Uppal

John said:
Well, then, either Wong is completely nuts, or we're using different JDK
versions (1.6 here),

You mean you've tried this and found that your version gives different results
? I find that hard to believe unless its a side effect of attemting to use
non-ASCII characters in the input to javac. Try being explicit about using the
Unicode character (well, UTF16 value).

public class Test
{
public static void
main(String[] args)
{
System.out.println("bei\u00DFen -> " +
"bei\u00DFen".toUpperCase());
System.out.println("BEISSEN".equalsIgnoreCase("bei\u00DFen"));
System.out.println("BEISSEN".equals("bei\u00DFen".toUpperCase()));

// or equivalently, but using octal string escapes
System.out.println("bei\337en -> " + "bei\337en".toUpperCase());
System.out.println("BEISSEN".equalsIgnoreCase("bei\337en"));
System.out.println("BEISSEN".equals("bei\337en".toUpperCase()));
}
}


(Tested on 1.4.2, 1.5.0, and 1.6.0)

or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.

That sounds as if you /haven't/ actually tried it. (Nor read the documentation
for String.toUpperCase() which expounds on this subject).

String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?). What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase. It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation. In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works. I can't, it isn't
supposed to, and it doesn't...

String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules. Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

-- chris

[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.
 
J

John W. Kennedy

John said:
Well, then, either Wong is completely nuts, or we're using different JDK
versions (1.6 here), or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.

No, but toUpperCase translates "ß" to "SS" because that is the only
uppercase that "ß" has. It doesn't work the other way because "SS" can
be either "ss" or "ß", depending on the word. This is documented behavior.
 
P

Philipp

John said:
No, but toUpperCase translates "ß" to "SS" because that is the only
uppercase that "ß" has. It doesn't work the other way because "SS" can
be either "ss" or "ß", depending on the word. This is documented behavior.

FYI ß is just a modern callygraphy (ie. a ligature) of the ancient
german "ss". (the first s being a "long s", "Å¿" ). It is one of the
rare cases where a ligature has actually been accepted as a full letter.

See http://en.wikipedia.org/wiki/ß for more info

Phil
 
J

John Ersatznom

Lew said:
AFAIK, toUpperCase() follows the socially-determined locale rules. What
is the upper case of "beißen" in German? (E.g., what would a German
newspaper do?)

Well, it certainly shouldn't actually use a different spelling. Would an
American newspaper use "color" in article text but "COLOUR" in headlines? :)

Regardless, even if toUpperCase makes changes other than to case, even
altering the number of symbols, what does toLowerCase do, and why isn't
equalsIgnoreCase consistent with them? It should consider any two
strings equal whose toUpperCase()s are equal as decided by equals() or
whose toLowerCase()s are equal likewise, and extend this transitively as
necessary. Otherwise, equalsIgnoreCase is really equalsIgnoreFoo and
toLowerCase and toUpperCase are toLowerBar and toUpperBar -- the word
"case" is not talking about the same thing in the one as it is in the
other, and in at least one it isn't even talking about "case" at all, as
that term is commonly understood. The methods should then be renamed to
make it clear what they are really talking about -- at least the ones
that aren't really talking about "case". In this case (no pun intended),
that set apparently includes toUpperCase, which seems to make other
transformations than capital letter substitution, and should maybe be
named toAllCapsTitle, with a more logically-behaving toUpperCase also
made available.
 
J

John Ersatznom

Chris said:
String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?). What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase. It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation. In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS. (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

This seems to be excessively technical when the matter under discussion
is simply capitalizing strings. In any event, equalsIgnoreCase should
collapse these "ligatures" of yours as well. Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer, only
the visually displayed output (and then usually only when serious
typesetting software is used). Likewise, it makes sense to represent any
other logical sequence of characters in a sensible way under the hood,
regardless of any rendering fanciness that is done when presenting them
to the user.
It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works. I can't, it isn't
supposed to, and it doesn't...

No, it is not erroneous to expect a method to do exactly and only what
its name implies. It is erroneous, of course, to give a method a name
that is misleading. If toUpperCase needs a lengthy documentation block
explaining why its behavior is surprising, then it's a sure bet that it
should not have been named that, since it's apparently really
toUpperCaseAndDoesSomeExtraStuffToo.
String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules.

So you at least agree with me that it should be consistent with
toUpperCase (and toLowerCase) -- all strings should have a single
canonical toUpperCase, a single canonical toLowerCase, both should
define equivalence classes on the mixed-case input strings, these should
be the SAME equivalence class, and equalsIgnoreCase should implement and
embody the corresponding equivalence relation.
Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

It sounds like toUpperCase has a "badly misleading" name since it
(supposedly) does transformations that go well beyond what is normally
meant by everyday blokes by "to upper case", and the method name is
supposed to be a reasonably meaningful capsule summary for everyday
blokes of what the method does. If a method is supposed to do behavior
that's surprising for any English speaker but not for a German speaker,
maybe it should have a German rather than an English name? :) If it's
supposed to do locale-dependent stuff, then it should have a version
that accepts a Locale object. The version that doesn't shouldn't
surprise English speakers; the version that does shouldn't surprise
anyone familiar with its locale-specific behavior for the locale
actually used. Having locale-dependent behavior invoked randomly without
explicit use of Locale objects, and which furthermore doesn't use the
system locale, is by itself a sign of a questionable design as well as a
sure source of bugs and problems.

I've even encountered somewhere a notion that aString.length() is not
even accurate in current Java versions if a string contains obscure
characters. It suggests aString.<something using the obscure term "code
point", apparently just Unicode-geek for "character"> as its
replacement, while of course there's a ton of legacy code using
length(). I don't suppose it occurred to them that the new fancy-whosit
should have been a replacement length() implementation instead of some
new name that doesn't suggest anything to do with the length of a string
to someone who doesn't care about all the Unicode bells and whistles and
just wants to process strings while remaining agnostic about what they
are ultimately used for or contain? Those users will gravitate to
length() (plus all that legacy code), not caring about the actual
storage length of the internal representation but the length in
characters of their data as a general rule. So there should be a
length() method that returns the true length of the string, and if
necessary a getSize() method that returns the representation's size in
bytes or whatever in case someone needs such low level data. (If they
persist strings as UTF-8 in a text format file that is parsed, or use
serialization, then they don't.)
[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.

A concept like "same spelling" can't be flawed. It's generally accepted
that "color" and "colour" are the same word, but have different
spellings, right? While "two" and "too" are different words spelled
differently that sound the same, "tomato" and "tomato" are the same word
spelled the same but pronounced differently, and "ant" (the bug) and
"ant" (the build tool) are different words both spelled and pronounced
the same.
 
J

John Ersatznom

John said:
No, but toUpperCase translates "ß" to "SS" because that is the only
uppercase that "ß" has. It doesn't work the other way because "SS" can
be either "ss" or "ß", depending on the word. This is documented behavior.

Just because a behavior is documented doesn't mean it isn't broken. Either:

*equalsIgnoreCase should consider all these spelling variants equal,
*toUpperCase should leave ß alone, e.g. Beißen -> BEIßEN while a
toAllCapsTitle or similar makes it BEISSEN, or
*the behavior should be based on the system default locale, with
versions of all the methods that accept a Locale and use it in place of
the system default, consistency among equalsIgnoreCase, toUpperCase, and
toLowerCase for a given locale, and unsurprising behavior to people from
that locale.

Independently of which, a bunch of these methods and others around the
system need renaming for case consistency in a very different way -- a
bunch of Java methods don't follow the usual camelCase convention. A
partial list seems to include

String.toUppercase (should be toUpperCase)
String.toLowercase (analogously; and equalsIgnoreCase is already as it
should be!)
Character.isWhitespace (isWhiteSpace)
System.arraycopy (arrayCopy)

That's at least four where the start of a non-initial word isn't
capitalized (case, case, space, and copy, respectively).

These are an endless source of fun compile errors and debugging sessions
and really they shouldn't be. Even in eclipse, they manage to annoy
without being so extreme, and also hinder documentation search and
completion (put System.arrayc, try to autocomplete it, and it balks; try
to view the documentation to see if there's something you've missed and
it also balks; eclipse could definitely use case-insensitive completion
when there's no ambiguity as a result of case-insensitivity).

Another fun one:

HttpURLConnection

Really, it should be HTTPURLConnection or HttpUrlConnection, probably
the former. HTTPConnection would be best, since it's certainly a network
(i.e. URL) connection. :p
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top