Get list of unique words in a string

G

Gordon

Hi

Is there a method for getting all the unique words in a string? I
want to build a filter for a list of items that will hide items that
don't contain keywords entered into a text field. I've not started
writing anything yet, at the moment I'm just looking into the
feasibility of it.
 
M

Martin Honnen

Gordon said:
Is there a method for getting all the unique words in a string? I
want to build a filter for a list of items that will hide items that
don't contain keywords entered into a text field. I've not started
writing anything yet, at the moment I'm just looking into the
feasibility of it.

Try to split the value property of the text field on white space or
whatever else separates the keywords:
var words = 'foo bar foobar'.split(/\s+/);
 
L

Lasse Reichstein Nielsen

Gordon said:
Is there a method for getting all the unique words in a string?
Yes.

I want to build a filter for a list of items that will hide items
that don't contain keywords entered into a text field. I've not
started writing anything yet, at the moment I'm just looking into
the feasibility of it.

I'm guessing you want to extract the entered keywords from the text
field. For checking the items against it, I would probably create
a RegExp from the keywords and use it on each item, and not try
to parse each item.

To find the keywords, you need to decide on a valid input format,
and parse that. A very simple parsing could be:

function extractWords(string) {
return string.match(/\w+/g);
}

This function gives an array of sequences of word characters in the
input, and discarding everything else.

To make them unique, there are again several ways. One elaborate:

function uniquify(stringArray) {
var obj = {};
for (var i = 0; i < stringArray.length;) {
var string = stringArray;
if (obj[string] === obj) {
// Seen already - remove from array.
stringArray = stringArray[stringArray.length - 1];
stringArray.length -= 1;
} else {
// First occurence, mark as seen.
obj[string] = obj;
i++;
}
}
}

This function removes duplicates from an array of strings. It modifies
the array destructively.

A somewhat simpler version, using a (still) non-standard array
extension, is:

function uniquify2(stringArray) {
stringArray.sort();
return stringArray.filter(function(s,i,a) { return a[i + 1] != s; });
}


Building a regexp from that should be very simple:

var re = RegExp("\\b(?:" + uniqueArray.join("|") + ")\b");

Best of luck.
/L 'code NOT tested, use at own risk!'
 
T

Tad J McClellan

Gordon said:
Is there a method for getting all the unique words in a string?


var str = 'foo bar foo again baz foo bar other foo';
var seen = {};
var words = str.split(/\s+/);
for (var i=0; i < words.length; i++ ) {
seen[ words ]++;
}

for (var word in seen ) { // show the uniqified list
document.writeln(word);
}
 
J

Jeremy J Starcher

I'm guessing you want to extract the entered keywords from the text
field. For checking the items against it, I would probably create a
RegExp from the keywords and use it on each item, and not try to parse
each item.

To find the keywords, you need to decide on a valid input format, and
parse that. A very simple parsing could be:

function extractWords(string) {
return string.match(/\w+/g);
}

Note how this differs from some of the other replies. This matches word
characters instead of breaking on white space.

The regular expressions that match against \s are looking for white-space
characters and include punctuation as part of the word.
This function gives an array of sequences of word characters in the
input, and discarding everything else.

To make them unique, there are again several ways. One elaborate:

function uniquify(stringArray) {
var obj = {};
for (var i = 0; i < stringArray.length;) {
var string = stringArray;
if (obj[string] === obj) {
// Seen already - remove from array.
stringArray = stringArray[stringArray.length - 1];
stringArray.length -= 1;
} else {
// First occurence, mark as seen.
obj[string] = obj;
i++;
}
}
}

This function removes duplicates from an array of strings. It modifies
the array destructively.

A somewhat simpler version, using a (still) non-standard array
extension, is:

function uniquify2(stringArray) {
stringArray.sort();
return stringArray.filter(function(s,i,a) { return a[i + 1] != s; });
}


Building a regexp from that should be very simple:

var re = RegExp("\\b(?:" + uniqueArray.join("|") + ")\b");

Best of luck.
/L 'code NOT tested, use at own risk!'
 
L

Lasse Reichstein Nielsen

kangax said:
Lasse Reichstein Nielsen wrote:

It might be a good idea to escape `uniqueArray` characters against
regex special ones - "|", "?", "+", etc. as simply joining arbitrary
values seems a bit risky.

Absolutely. The uniqueArray values in *this* case were expected to
come from the previous functions, so it should only be word
characters.

In general, each "word" should have all RegExp-meaningfull characters
escaped (something like xxx.replace(/[|()[\]^$.?+*{}\\]/g, "\\$&") )

/L
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected].
Note how this differs from some of the other replies. This matches word
characters instead of breaking on white space.

The regular expressions that match against \s are looking for white-space
characters and include punctuation as part of the word.


But \w matches A-Za-z0-9_, and digits and underscore are rarely
considered as parts of English words. "Matches any word character
including underscore." is a shoddy description. The word "word" occurs
only twice in ECMA 5 FD; nowhere near the RegExp part.

And in JavaScript \w does not even match all characters allowed in
identifiers; it rejects $ and accented characters such as SAM or another
Owain Glynd?r might like to use.

It does not match apostrophe, so that genitive singulars are not
accepted. It does not match hyphen, so that Na-dene is not accepted,
nor Dow-Jones, nor dog-ear, nor one-base, nor once-over. It does not
match accented characters, so the other spellings of naif and naive are
not accepted. Those all appear in Lexicon Webster; a couple may not be
in Chambers; but that has other examples, such as nicky-tam, écarté &
Führer with their proper accents, no-man's-land, no-ball & many others.

And what if the OP has dealings with d?r Cymru (for which Courier New
may be helpful)?

ISTM that, unless the input character set is restricted, the matching
will have to be done with something like /[-a-zA-Z' ... ]/g - and
Westward Ho! should not be forgotten, nor M&S.
 
T

Thomas 'PointedEars' Lahn

kangax said:
Lasse said:
Gordon said:
Is there a method for getting all the unique words in a string?
[...]
To make them unique, there are again several ways. One elaborate:

function uniquify(stringArray) {
var obj = {};
for (var i = 0; i < stringArray.length;) {
var string = stringArray;
if (obj[string] === obj) {
// Seen already - remove from array.
stringArray = stringArray[stringArray.length - 1];
stringArray.length -= 1;
} else {
// First occurence, mark as seen.
obj[string] = obj;
i++;
}
}
}


I love the use of `obj` as a unique "key" for marking "seen" values. A
great example of keeping the code safe from `Object.prototype`
augmentation or `Object.prototype` properties collisions. Kind of like a
more compatible alternative to `Object.prototype.hasOwnProperty` :)


Sorry to disappoint you, but the value used to mark strings as already seen
does not matter with regard to collisions with built-in or inherited
properties. Insofar, the circular reference created by assigning the
reference to the object as value of one of its properties does not really
make sense, and I would not consider this approach more elaborate than what
was posted before.


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
kangax said:
Lasse Reichstein Nielsen wrote:
Is there a method for getting all the unique words in a string?
[...]
To make them unique, there are again several ways. One elaborate:

function uniquify(stringArray) {
var obj = {};
for (var i = 0; i < stringArray.length;) {
var string = stringArray;
if (obj[string] === obj) {
// Seen already - remove from array.
stringArray = stringArray[stringArray.length - 1];
stringArray.length -= 1;
} else {
// First occurence, mark as seen.
obj[string] = obj;
i++;
}
}
}
I love the use of `obj` as a unique "key" for marking "seen" values. A
great example of keeping the code safe from `Object.prototype`
augmentation or `Object.prototype` properties collisions. Kind of like a
more compatible alternative to `Object.prototype.hasOwnProperty` :)

Sorry to disappoint you, but the value used to mark strings as already seen
does not matter with regard to collisions with built-in or inherited
properties. Insofar, the circular reference created by assigning the


How so?


I thought it to be obvious by now.
As I understand it, the core of the issue boils down to being able to
reliably determine a presence of a user-defined property in a
newly-created Object object, since Object object's properties is what's
being used for storing "seen" values.

Becausev there is no collision test, in the first iteration a built-in
property can be overwritten or shadowed, so it does not matter with which
value that happens, and it does not really matter against which value in
following iterations is being compared.
Consider an alternative "solution" that a person might come up with -
creating a "truthy" property for every "seen" value and then performing
a plain boolean type conversion to determine value uniqueness:

....
if (obj[string]) {
....
else {
obj[string] = true;
....

It's clear that this approach falls short with values identical to any
of the `Object.prototype` properties' names, since such values would be
considered "seen" by a boolean type conversion (after being resolved
from `Object.prototype` as Function objects and so type converted to
`true` as any other object): -

var arr = ['toString'];
uniquify(arr);
arr; // []

An obvious "fix" for this might be to not perform type conversion, and
instead compare values explicitly:

....
if (obj[string] === true) {
....
else {
obj[string] = true;
....

While better than a previous "solution", this one still has a chance of
collisions, and a quite large one at that: -

Object.prototype.foo = true;
var arr = ['foo'];
uniquify(arr);
arr; // []


An `Object.prototype.hasOwnProperty` would of course solve this problem,
but is unfortunately not always available in modern (and not so modern)
browsers (Safari 2.x comes to mind):

....
if(Object.prototype.hasOwnProperty.call(obj, string)) {
...
}
else {
obj[string] = true; // or any value for that matter,
// even `undefined` will do :)
}

Unique "key", therefore, solves this problem quite painlessly.

No, it doesn't. It merely precludes the possibility that a built-in
property with a true-value (first alternative) or `true' (second
alternative) as value is regarded as a string being seen, but it is
still far from being a solution, much less a painless one.
Did I miss anything?

Yes.


PointedEars
 
T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
kangax wrote: [...]
Unique "key", therefore, solves this problem quite painlessly.
No, it doesn't. It merely precludes the possibility that a built-in
property with a true-value (first alternative) or `true' (second
alternative) as value is regarded as a string being seen, but it is
still far from being a solution, much less a painless one.

Not sure I understand. Could you demonstrate the shortcomings of
KEY-based approach with an example?

For starters, debug with "prototype" as input.
Which solution(s) would you recommend?

That which have been recommended before: Object.prototype.hasOwnProperty()
to determine if a built-in property would not be overwritten or shadowed,
and the less reliable typeof ... == "undefined" test where the former method
is not provided. Then double hashing to find an unused property name in
case of a collision. The object reference as unique marker value would
certainly be a plus, but I would use an "empty" object, not the same that
serves as hash table.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
kangax said:
Thomas said:
kangax wrote: [...]
Unique "key", therefore, solves this problem quite painlessly.
No, it doesn't. It merely precludes the possibility that a built-in
property with a true-value (first alternative) or `true' (second
alternative) as value is regarded as a string being seen, but it is
still far from being a solution, much less a painless one.
Not sure I understand. Could you demonstrate the shortcomings of
KEY-based approach with an example?

For starters, debug with "prototype" as input.

Err, that would not do much harm. Try "__proto__" or you-know-what instead.


PointedEars
 
J

Jeremy J Starcher

In comp.lang.javascript message <[email protected].
But \w matches A-Za-z0-9_, and digits and underscore are rarely
considered as parts of English words.

Thats the geek in me. I forgot that [0-9_] aren't part of English
words. I'll be sure to remind 007.
And in JavaScript \w does not even match all characters allowed in
identifiers; it rejects $ and accented characters such as SAM or another
Owain Glynd?r might like to use.

Accented characters crossed my mine AFTER I posted. I was double-
checking things when I caught this post.
It does not match apostrophe, so that genitive singulars are not
accepted. It does not match hyphen, so that Na-dene is not accepted,
nor Dow-Jones, nor dog-ear, nor one-base, nor once-over. It does not
match accented characters, so the other spellings of naif and naive are
not accepted. Those all appear in Lexicon Webster; a couple may not be
in Chambers; but that has other examples, such as nicky-tam, écarté &
Führer with their proper accents, no-man's-land, no-ball & many others.

There as a long discussion in alt.html over what made a paragraph and
what made a word.
 
J

Jeremy J Starcher

There as a long discussion in alt.html over what made a paragraph and
what made a word.

Bother.

According to a few of the regulars over there, not even the Japanese
agree on what is a 'word.' Is it a single glyph? Or is it enough glyphs
to form a single 'idea', similar to the contrast here:

"Time flies like an arrow."
"Fruit flies like a banana."
 
T

Thomas 'PointedEars' Lahn

kangax said:
Works as expected (FF 3.0.8/FB 1.3X.3 on MacOSX 10.5.6):

Yes, I picked a bad example. Mark my correction, please.
function uniquify(arr) {
var obj = {}, MARKED = {};
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[string] = MARKED;
i++;
}
}
}

var arr = ['prototype', 'foo', 'prototype', 'bar'];
uniquify(arr);
arr; // ["prototype", "foo", "bar"]


And I don't see why it wouldn't. There should be no built-in property
with a value of a unique object that we are comparing it to. That's what
makes it unique, after all.


It does not matter for collisions. Let the input be "foo", and let "foo" be
the name of a built-in property of the object that `obj' refers to or an
inherited property of Object objects. In the first iteration the result of
`obj["foo"]' does not (cannot) strictly equals the value of `MARKED', so the
`else' branch is taken where the value of the `foo' property is either
replaced or, if it is inherited, shadowed by the creation of a `foo'
property on `obj', *unconditionally*.
Ok. That makes sense.


Double hashing?

Double hashing is a technique to avoid collisions in a hash table. With
ECMAScript native objects as hash tables, it merely boils down to finding an
unused property name that serves as an alias for the actual key. Suppose
the key is "toString", which place in that table is taken because the
property of the same name is inherited from Object.prototype, an alias could
be as simple as "_toString". There is, however, a caveat to this technique:
one has to define a cutoff at which no more alternatives are tried and the
storage attempt must be considered failed, or in the Worst Case the
algorithm would run forever.
As in my latter example which uses a separate `MARKED` object?

Yes, I overlooked that.
On the other hand, what's wrong with circular reference in this case?

AFAIK it is considered bad practice in OOP.
There are no host objects involved. What harm does an extra property on
an object with a reference to itself cause?

That would probably depend on whether the object that `obj' refers to can be
universally garbage-collected after control left the local execution context
and the `obj' reference is no more. AFAIK, an object is not marked for
garbage collection if there are still references to it; in this case,
however, there would be several.


PointedEars
 
L

Lasse Reichstein Nielsen

Thomas 'PointedEars' Lahn said:
That would probably depend on whether the object that `obj' refers to can be
universally garbage-collected after control left the local execution context
and the `obj' reference is no more. AFAIK, an object is not marked for
garbage collection if there are still references to it; in this case,
however, there would be several.

That would only be a problem if the Javascript engine uses reference
counting garbage collection. I'm not aware of any implementation that
uses reference counting for native JavaScript objects (although there
are some that uses it for DOM objects).
Modern garbage colelction techniques can handle cyclic references
without problems (otherwise recursive functions would be a problem).

/L
 
T

Thomas 'PointedEars' Lahn

kangax said:
Thomas said:
kangax wrote: [...]
function uniquify(arr) {
var obj = {}, MARKED = {};
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[string] = MARKED;
i++;
}
}
}

var arr = ['prototype', 'foo', 'prototype', 'bar'];
uniquify(arr);
arr; // ["prototype", "foo", "bar"]


And I don't see why it wouldn't. There should be no built-in property
with a value of a unique object that we are comparing it to. That's what
makes it unique, after all.

It does not matter for collisions. Let the input be "foo", and let "foo" be
the name of a built-in property of the object that `obj' refers to or an
inherited property of Object objects. In the first iteration the result of
`obj["foo"]' does not (cannot) strictly equals the value of `MARKED', so the
`else' branch is taken where the value of the `foo' property is either
replaced or, if it is inherited, shadowed by the creation of a `foo'
property on `obj', *unconditionally*.


Yes, indeed. A built-in property can be overwritten by an arbitrary
value. But is it a bad thing in this particular context? Is
shadowing/overwriting relevant to the problem of creating a robust
`uniquify` function?


Probably yes. However, what I recognized later, too, is that (effectively)
ReadOnly built-in properties pose much more of a problem:
Thinking more about it, I see a possibility of an environment
implementing proprietary ReadOnly properties on an Object object (ala
`__proto__`). This means that setting a certain property to a `MARKED`
value *does not guarantee* that the property is indeed changed and that
its value is equal to `MARKED`. This is getting tricky :)

To work around such deficiencies, it should be possible to employ a
usual name augmentation by, say, prepending a value string with a
sufficiently random string (every time before it's being accessed):

function uniquify(arr) {
var obj = {},
MARKED = {},
token = '_' + (Math.random()+'').slice(2);
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[token + string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[token + string] = MARKED;
i++;
}
}
return arr;
}


This makes it possible to work reliably with names such as `__proto__`:


var arr = ['__proto__', 'a', 'a'];
uniquify(arr); // ["__proto__", "a"]


It is not that easy :) Consider this:

var arr = ['_proto__', 'a', 'a'];

That is why I proposed "double hashing". Also, in my "double-hashed" "hash
table" implementation (that I wrote this morning) I use trailing underscores
instead because it is less likely that such property names are already used.
Just remembered that `hasOwnProperty` alone can not be relied on either:

({}).hasOwnProperty('__proto__'); // `true` in FF

I don't see the problem there. The method does not return a false positive,
so `__proto__' will not be used as key.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
kangax said:
function uniquify(arr) {
var obj = {},
MARKED = {},
token = '_' + (Math.random()+'').slice(2);
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[token + string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[token + string] = MARKED;
i++;
}
}
return arr;
}

This makes it possible to work reliably with names such as `__proto__`:

var arr = ['__proto__', 'a', 'a'];
uniquify(arr); // ["__proto__", "a"]


It is not that easy :) Consider this:

var arr = ['_proto__', 'a', 'a'];


Ahh, Math.random() (String(x) is more efficient than x+'', BTW.)
That is why I proposed "double hashing". Also, in my "double-hashed" "hash
table" implementation (that I wrote this morning) I use trailing underscores
instead because it is less likely that such property names are already used.

I'd still go for this solution instead, it is much more robust.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
Thomas said:
kangax said:
function uniquify(arr) {
var obj = {},
MARKED = {},
token = '_' + (Math.random()+'').slice(2);
for (var i = 0; i < arr.length;) {
var string = arr;
if (obj[token + string] === MARKED) {
arr = arr[arr.length - 1];
arr.length -= 1;
} else {
obj[token + string] = MARKED;
i++;
}
}
return arr;
}

This makes it possible to work reliably with names such as `__proto__`:

var arr = ['__proto__', 'a', 'a'];
uniquify(arr); // ["__proto__", "a"]

It is not that easy :) Consider this:

var arr = ['_proto__', 'a', 'a'];


Ahh, Math.random() (String(x) is more efficient than x+'', BTW.)


And x.charAt(2) is probably more efficient and compatible than x.slice(2).
I'd still go for this solution instead, it is much more robust.

PointedEars
 
D

Dr J R Stockton

In comp.lang.javascript message <7vydnWj1-sroanXUnZ2dnUVZ_rWdnZ2d@gigane
It's clear that this approach falls short with values identical to any
of the `Object.prototype` properties' names, since such values would be
considered "seen" by a boolean type conversion (after being resolved
from `Object.prototype` as Function objects and so type converted to
`true` as any other object): -

JavaScript, including its internals, was written by and designed to be
read by Americans. Therefore, all that is needed is to "put in" the
Object not the "word" itself but the concatenation of the "word", an
underline, and some word which is exceedingly offensive in their
language, so that it can never have been used within JavaScript itself.
You can enter the word by String.fromCharCode() so that it is not
directly visible.

Actually, using just "\u0175" should suffice; and the following seems to
be a legal way of assigning a zero to X : X = (\u0175 = 0, 33, \u0175).
 
L

Lasse Reichstein Nielsen

Dr J R Stockton said:
Actually, using just "\u0175" should suffice; and the following seems to
be a legal way of assigning a zero to X : X = (\u0175 = 0, 33, \u0175).

It should be, but not all browsers/Javascript implementations allow
full unicode for variable names. It fails in Opera.

The assignment expression
X = (\u0175 = 0, 33, \u0175)
works because \u0175 (aka. "w with a hat") is a valid identifier, so it's
equivalent to:
X = (w = 0, 33, w)
where the parenthesis contains three expressions: an assignment expression
that assigns 0 to "w", a constant that does nothing, and a read of the
variable "w" which becomes the value of the "comma-expression".

So, the above does indeed assign the value zero to X, but it also
creates a global variable called "w-hat" with the value zero.

/L
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top