Get list of unique words in a string

T

Thomas 'PointedEars' Lahn

Dr J R Stockton wrot:
JavaScript, including its internals, was written by and designed to be
read by Americans. Therefore, all that is needed is to "put in" the
Object not the "word" itself but the concatenation of the "word", an
underline, and some word which is exceedingly offensive in their
language, so that it can never have been used within JavaScript itself.

This has nothing to do with American or not. ECMAScript implementations may
be written and used by anyone regardless of nationality (even you).

Go away.
You can enter the word by String.fromCharCode() so that it is not
directly visible.

You miss the point. It doesn't matter how the property name is constructed.
Actually, using just "\u0175" should suffice; and the following seems to
be a legal way of assigning a zero to X : X = (\u0175 = 0, 33, \u0175).

Of course it is. Read the ECMAScript Language Specification, the ISO
version if you must. Or is this not British enough?


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
Thomas said:
Thomas 'PointedEars' Lahn wrote:
kangax wrote:
function uniquify(arr) {
var obj = {},
MARKED = {},
token = '_' + (Math.random()+'').slice(2);
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[token + string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[token + string] = MARKED;
i++;
}
}
return arr;
}

This makes it possible to work reliably with names such as `__proto__`:

var arr = ['__proto__', 'a', 'a'];
uniquify(arr); // ["__proto__", "a"]
It is not that easy :) Consider this:

var arr = ['_proto__', 'a', 'a'];
Ahh, Math.random() (String(x) is more efficient than x+'', BTW.)


Strange. I always thought the opposite.


You find a more detailed explanation about it from me in the archives.
I have not done benchmark tests, though.
Did you mean `x.substring(2)` ?

No, but what I said would apply there as well. As for compatibility,
however maybe irrelevant these days, the ECMAScript Support Matrix lists
String.prototype.slice() as available from JavaScript 1.0, JScript *3.0*,
and ECMAScript Ed. 3 (currently online: "?" for unknown; other
implementations have not yet been under closer research for that feature).
As for efficiency, you could compare the specified algorithms and run
benchmark tests to confirm or disprove what I said.
On the other hand, slicing is not really
needed here in the first place; it doesn't matter if a "foo" value
becomes - "0.18932621243836711foo" - or a - "18932621243836711foo".
ACK


I'd love to take a look at your implementation.

Me too. I am still working on it, because it is more general than what is
needed in this particular example (it is supposed to be something like Map
in Java).


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
kangax said:
Thomas 'PointedEars' Lahn wrote: [...]
And x.charAt(2) is probably more efficient and compatible than x.slice(2).
Did you mean `x.substring(2)` ?
No, but what I said would apply there as well. As for compatibility,

`String(Math.random()).charAt(2)` always returns a string of a length of
*1 character*. `String(Math.random()).slice(2)`, on the other hand, a
length of 16-17 characters. I don't see how `charAt` helps here (or can
be used instead). If anything, it only defeats the purpose of augmenting
a key before marking it seen in an `obj` object.

Am I missing something?

Yes, "there" applies to the quoted, `x.substring(2)'.
Oh yes. Good thing we have JSLitmus these days!

We'll see. One really doesn't need a full-blown library like JSLitmus for
that (a few lines of script suffice), and I rather trust my own code than
some library Jorge picked.
Ok. I've seen some implementations floating around on the web (e.g.
<http://www.timdown.co.uk/jshashtable/>), but haven't had a chance to
look closer at them.

I'm not going for a Hashtable implementation yet (maybe later), but this
looks not bad and gives me some ideas. Thanks.


PointedEars
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]
Bother.

According to a few of the regulars over there, not even the Japanese
agree on what is a 'word.' Is it a single glyph? Or is it enough glyphs
to form a single 'idea', similar to the contrast here:

A word is anything that is or would be printed in contiguous bold in
Chambers Dictionary (or in the corresponding notation in the OED),
except where one or more separators are included and the separated parts
are words.

There "would be" additionally accommodates all languages other than the
Queen's English and common lower-class words.

Nothing short of well-informed and educated human judgement can really
decide on all of the borderline cases : how many words in "VI CARSON"? -
2, undoubtedly - and in "GEORGE VI"? How about "QE2", "Queen Elizabeth
2" and "Queen Elizabeth II"? In "James I & VI"?
 
L

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn said:
You find a more detailed explanation about it from me in the archives.
I have not done benchmark tests, though.

A quick benchmark shows that x+'' is actually more efficient that
String(x) in all my browsers, except Opera where there is no noticable
difference.

-- quick benchmark --
var x = 42;
var t0 = new Date();
for (var i = 0; i < 1000000; i++) {
var y = String(x);
}
var t1 = new Date();
for (var i = 0; i < 1000000; i++) {
var y = x + '';
}
var t2 = new Date();
[t1-t0,t2-t1]
-- end --

The largest difference was in Firefox, which reported "1153,516".

An explanation is probably that calling String is a function call,
which is hard to optimize statically, compared to an operator - in
particular the plus operator with one operand being a statically known
string - which does that same thing without the overhead of the
function call.

In general, operators are faster than function calls, which is also
why the prefix "+" is faster than using the Number function.

Unless performance is a problem, I'd go for the more readable version,
though (String and Number in these cases).

/L
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>, Sat, 18
Apr 2009 20:12:53 said:
You find a more detailed explanation about it from me in the archives.
I have not done benchmark tests, though.

NOTE : That exemplifies the Lahn attitude : that the published standards
(we can assume that he does not rely on books) are all that matters, and
the real world is not significant. He does not need to test his
assertions.

It also shows carelessness and a reluctance to follow his own advice :
he so often refers others to the archives, and they contain, somewhere,
previous discussion about Number-to-String (possibly in respect of StrU,
if a clue is needed).

There must be some reason why Standard 11.9.3 NOTE includes :
String comparison can be forced by: "" + a == "" + b
and not String(a) == String(b).

Perhaps he forgot that line.



BTW, that is followed by
Numeric comparison can be forced by: a - 0 == b - 0
about which ECMA will soon hear (well, read) from me.
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>, Sat,
18 Apr 2009 19:51:58 said:
Dr J R Stockton wrot:

This has nothing to do with American or not. ECMAScript implementations may
be written and used by anyone regardless of nationality (even you).

You are being silly. Engage all available intellect, and read my
article again. JavaScript was written by Americans, with the intent
that it be used by Americans. Their intent with respect to the rest of
the world is not germane (had they cared, they would have made provision
for DD/MM/YYYY). They would certainly not have chosen to include,
without extremely good reason, a word that they considered to be
offensive or religious (or a character not on their keyboards) in the
implementation or definition of the language. Neither, of course, would
anyone else have knowingly done so.

They might, of course, use a word offensive in an unknown language; ISTR
that for that reason one Rolls-Royce model sold less well in Central
Europe than they might have hoped.

You miss the point. It doesn't matter how the property name is constructed.

No; you are missing it. A normal human author wanting to insert an
offensive word in the run-time material might very well prefer not to
type it /en clair/ in the source. It does not matter to the software;
but it might well matter to normal people.

Of course it is. Read the ECMAScript Language Specification, the ISO
version if you must. Or is this not British enough?

That is what I was reading when I thought of using such a character -
ECMA FD 5, in fact. However, when reading such a specification, I do
not invariably assume that I have understood it fully.

And, of course, in the real world one must not presume that all relevant
software is fully compliant with even a long-established standards such
as ECMA 3. As LRHN has observed, it fails in Opera.
 
T

Thomas 'PointedEars' Lahn

Dr said:
Thomas 'PointedEars' Lahn posted:

You are being silly. Engage all available intellect, and read my
article again. JavaScript was written by Americans,

JavaScript was written by Brendan Eich, then a citizen of the United States
of America. That fact doesn't matter, though.
with the intent that it be used by Americans.

Intentional fallacy.

That you are utterly wrong here is merely proved by the fact that Eich wrote
JavaScript for Netscape Navigator 1.0 (released 1994-10 CE) of which history
teaches us that it was a World Wide Web Browser that was already distributed
world-wide, at least in the U.S. of A. and Central Europe, where, at CERN in
Geneva, Tim Berners-Lee started the World Wide Web project in 1991.

And shortly after, a successful attempt was made to standardize JavaScript
1.1 and its emerging concurrent copycat Microsoft JScript with the
then-still *European* Computer Manufacturers Association (ECMA). So much
for "American".

Unsurprisingly, your apparently incurable xenophobia blinds you for reality
again.
Their intent with respect to the rest of the world is not germane (ha
they cared, they would have made provision for DD/MM/YYYY).
[snip fallacies]

Carelessness is not an indication for intent per se.
No; you are missing it. A normal human author wanting to insert an
offensive word in the run-time material might very well prefer not to
type it /en clair/ in the source. It does not matter to the software;
but it might well matter to normal people.

The problem of the OP was to determine "unique words in a string". Whatever
their definition of "word", it does not matter how the property name to mark
the occurrence of the word is constructed; what matters is that the name of
that property is not the name of a built-in property. And

,-[ECMAScript Ed. 1 to 3, Conformance section]
|
| [...]
| A conforming implementation of ECMAScript is permitted to provide
| additional types, values, objects, properties, and functions beyond
| those described in this specification. In particular, a conforming
| implementation of ECMAScript is permitted to provide properties not
| described in this specification, and values for those properties,
| for objects that are described in this specification.

So, for finding a reliable algorithm which is what this thread is all about,
it matters not who the original author of JavaScript was, what his possible
intentions were, who the original authors of ECMAScript were, and what their
possible intentions were. Not at all.
That is what I was reading when I thought of using such a character -
ECMA FD 5, in fact.

I don't see the relevance of a draft document here anyway.
However, when reading such a specification, I do
not invariably assume that I have understood it fully.

And, of course, in the real world one must not presume that all relevant
software is fully compliant with even a long-established standards such
as ECMA 3. As LRHN has observed, it fails in Opera.

So what? We would have discovered yet another bug in Opera's ECMAScript
implementation. A bug that should be fixed and, given the history of Opera
bugs, is likely to be fixed.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Dr said:
Thomas 'PointedEars' Lahn posted:

NOTE : That exemplifies the Lahn attitude : that the published standards
(we can assume that he does not rely on books) are all that matters, and
the real world is not significant. He does not need to test his
assertions.

To anyone who can use their brain (so not you), my explicit mentioning that
I have not done benchmark tests (yet) means the exact opposite of what you
are implying, namely that I am well aware that standards does not always
describe reality. FOAD.


PointedEars
 
T

Tim Down

I'm not going for a Hashtable implementation yet (maybe later), but this
looks not bad and gives me some ideas.  Thanks.

PointedEars


Likewise this thread (particularly the double hashing discussion) has
given me some ideas, so thank you for that.

Tim
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>, Sun,
19 Apr 2009 20:49:56 said:
To anyone who can use their brain (so not you), my explicit mentioning that
I have not done benchmark tests (yet) means the exact opposite of what you
are implying, namely that I am well aware that standards does not always
describe reality. FOAD.


You had earlier written, in comp.lang.javascript message <49E98D24.70205
(e-mail address removed)>, Sat, 18 Apr 2009 10:19:48 :
... (String(x) is more efficient than x+'', BTW.)

Now it is sufficiently well established that the above is a definite
terminological inexactitude.

You should now see the folly of making a definite assertion which you
have not yourself tested, especially when the archives will reveal that
others have done the tests and got the opposite result.
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>, Sun,
19 Apr 2009 20:41:15 said:
JavaScript was written by Brendan Eich, then a citizen of the United States
of America. That fact doesn't matter, though.


Intentional fallacy.

That you are utterly wrong here is merely proved by the fact that Eich wrote
JavaScript for Netscape Navigator 1.0 (released 1994-10 CE) of which history
teaches us that it was a World Wide Web Browser that was already distributed
world-wide,

Correct. America is part of the world. I wrotenot with the intent that it be used only by Americans.
That something would annoy their friends and colleagues would provide a
sufficient incentive to avoid doing it; the corresponding effect on
others would be superfluous.

at least in the U.S. of A. and Central Europe, where, at CERN in
Geneva, Tim Berners-Lee started the World Wide Web project in 1991.

And shortly after, a successful attempt was made to standardize JavaScript
1.1 and its emerging concurrent copycat Microsoft JScript with the
then-still *European* Computer Manufacturers Association (ECMA). So much
for "American".

Unsurprisingly, your apparently incurable xenophobia blinds you for reality
again.
Their intent with respect to the rest of the world is not germane (ha
they cared, they would have made provision for DD/MM/YYYY).
[snip fallacies]

Carelessness is not an indication for intent per se.

You may prefer to attribute what they did to relevant stupidity or
ignorance; either would prevent them caring.

The problem of the OP was to determine "unique words in a string". Whatever
their definition of "word", it does not matter how the property name to mark
the occurrence of the word is constructed; what matters is that the name of
that property is not the name of a built-in property.

Therefore, it is necessary either to check that the word is not the name
of a built-in property and modify it if so; or to modify all those words
so that they cannot be such. Adding a word or character that will
certainly not occur in such names (even if permitted by the standard)
will do that in all fully standard-compliant browsers. Unicode provides
a more than adequate choice.


I don't see the relevance of a draft document here anyway.

FACT : it is what I was reading at the time. Having actually seen it
there, it's really very easy to look into the corresponding part of
ISO/IEC 16262; it has the same section number. .
 
T

Thomas 'PointedEars' Lahn

Dr said:
Thomas 'PointedEars' Lahn posted:

Therefore, it is necessary either to check that the word is not the name
of a built-in property and modify it if so; or to modify all those words
so that they cannot be such. Adding a word or character that will
certainly not occur in such names (even if permitted by the standard)
will do that in all fully standard-compliant browsers. Unicode provides
a more than adequate choice.

There is no certainty where arbitrariness is specified as conforming.


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
Is jsx.object available publicly? I see that Map is using its `isMethod`
only (?) but I would be interested to look at it overall.

Yes, consider the `script' elements. (It's a draft as well. `jsx.object'
is going to be the additional "namespace" provided by the next version of
object.js to avoid in-library incompatibilities with foreign scripts that
may declare/define the same identifiers. `jsx' is going to be the
additional "namespace" for all libraries of PointedEars' JavaScript
Extensions (as I have come to call it) for the same reason. Maybe I will
eventually abandon the global "namespace", but it is going to be supported
for a while longer for compatibility.)
A couple of questions regarding implementation:

1) Why do you declare functions in Map which don't use any of map
instance private variables? _hasOwnProperty, _maxAliasLength, _Value,
_Value.isInstance and few others could all be taken out of constructor
into the enclosing "wrapping" scope. Declaring them in Map seems
unnecessary and inefficient. Why waste time and memory redeclaring same
function objects over and over again?

_maxAliasLength is not a function but a "private property", so (AFAIK) it
must be declared in the constructor, accessible only through "public"
methods. _hasOwnProperty() accesses the "private" _items. And _Value()
(and _Value.isInstance) are defined within the constructor so that they are
unique for each Map object.
var Map = (function(){

...
var _hasOwnProperty = (function() {
return (jsx.object.isMethod(_items, "hasOwnProperty"))
? function(o, p) {
return o.hasOwnProperty(p);
}
: function(o, p) {
return typeof o[p] != "undefined";
};
})()

...
function Map(){}

...
return Map;

})();

Or am I missing something?

Yes, I think so.
2) Don't you think that _hasOwnProperty fallback - typeof o[p] !=
"undefined" - is a bit weak? Shouldn't you (at least) be comparing
property value to the value of the same named property of object's
`constructor.prototype`?

No, that would be the opposite of an equivalent to
Object.prototype.hasOwnProperty(). Please observe
that _hasOwnProperty() is called in _getSafeKey()
in a specific way.
3) I find it convenient to decouple unit tests as much as possible. Your
tests seem to depend on each other more than needed, so changing one
will affect another. Instead, why not create "clean" map in test
runner's "setup" method (I assume JSUnit should have such facility).

Good idea, will do.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
kangax said:
Thomas said:
You can find a first draft here:

<http://pointedears.de/scripts/test/map>
1) Why do you declare functions in Map which don't use any of map
instance private variables? [...] _Value, _Value.isInstance and few
others could all be taken out of constructor into the enclosing
"wrapping" scope. Declaring them in Map seems unnecessary and
inefficient. Why waste time and memory redeclaring same function
objects over and over again?

[...] And _Value() (and _Value.isInstance) are defined within
the constructor so that they are unique for each Map object.

The more I think about it, they do not need to. In fact, all Maps SHOULD
share the _Value "type", so I will use the pattern below as you suggested
which will indeed save memory as more Maps are created. Good catch, thank you.

What "others" do you mean, and how do you suggest to take them out of the
constructor?

As for the "private properties" accessible only through "public" getters and
settings, I do not see a better alternative short of foregoing information
hiding altogether.

/**
* A value in the map, to distinguish it from built-in types
*
* @param v Value to be stored
* @private
*/
function _Value(v) {

/**
* Stored value
*/
this.value = v;
}

_Value.isInstance = function(v) {
return !!v && v.constructor === this;
};
[...]
...
function Map(){}

...
return Map;

})();


Regards,

PointedEars
 
T

Thomas 'PointedEars' Lahn

Conrad said:
var Map = (function(){

var _hasOwnProperty = (function() {
return (jsx.object.isMethod({ }, "hasOwnProperty"))
? function(o, p) {
return o.hasOwnProperty(p);
}
: function(o, p) {
return typeof o[p] != "undefined";
};
})();
....
Now Map just got a bit lighter :)

The anonymous function is unnecessary:

var _hasOwnProperty = jsx.object.isMethod({ }, "hasOwnProperty")
? function(o, p) {
return o.hasOwnProperty(p);
}
: function(o, p) {
return typeof o[p] != "undefined";
};

The both of you have made me think. Either of your approaches and the
current approach will result in a runtime error if one deletes or modifies
Object.prototype.hasOwnProperty() after creating a Map, and one of the Map's
methods are called that call _getSafeKey(). I am therefore inclined to make
_hasOwnProperty() a wrapper instead that checks on every call of it if
o.hasOwnProperty() is likely to be callable. What do you think?


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top