searching an array, string compare functions, string sorting

C

ckirchho

Hello,

I would like to use an array as a search index. The code defining the
array is not generated with JavaScript.

When searching a word in the index, I do a quick binary search. Which
fails, of course, if the order of the words in the array is not in
exactly the way as JavaScript would sort them.

The words in the index can contain special characters, e.g. from the
german, french or spanish alphabet.

I wonder what might be easier: To teach the external application to
sort the index array in a way that equals the JavaScript sorting, OR
to teach JavaScript to compare strings in a way that the binary search
works correctly.

By the way: Which is the JavaScript way of sorting? It seems that all
standard capital letters come in front of lower case letters, and all
non standard letters come at the end. e.g. ABC...XYZabc...xyzÄÉÖÜßäéöü

Right now I load the array an let JavaScript sort it, which of course
consumes time.

Would be great if anybody could name some sources of information, or
give me some techniques.

Best regards,

Christian Kirchhoff
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]
glegroups.com>, Wed, 10 Oct 2007 03:12:40, (e-mail address removed)
posted:
I would like to use an array as a search index. The code defining the
array is not generated with JavaScript.

When searching a word in the index, I do a quick binary search. Which
fails, of course, if the order of the words in the array is not in
exactly the way as JavaScript would sort them.

The words in the index can contain special characters, e.g. from the
german, french or spanish alphabet.

Where an accented character looks the same in two such languages, the
identical representation is probably used. But if something in Arabic
looks just like something else in Tagalog, the representation would
differ.
I wonder what might be easier: To teach the external application to
sort the index array in a way that equals the JavaScript sorting, OR
to teach JavaScript to compare strings in a way that the binary search
works correctly.

Javascript sort can be given a user-written compare function, and a u-w
c f can be used instead of < = > in your binary search.

However, if a dataset is searched much after transfer to JS, it will
probably be worth pre-processing it so that a natural sort or a trivial
compare can be used.

If we think of your dataset as an array A with elements A[0] to A[N],
you can process that into an array B[0] to B[N] of objects such that
B[k] is {a:A[k], b:Fn(A[k])} where Fn is a function transforming A[k]
into a key such that comparing keys has the desired effect.

For example, if the elements are letters, and Fn converts letters to
lower case, that which naturally would sort to ABCabc would then be
sorted so that the A & a came first (in either order), B & b next ("), C
& c last (").

Your compare function then takes two array elements and compares their
keys with < = >.

Note : if you have fewer than approaching 2^64 possible words, you can
in principle transform each different word to a different signed Number.
Comparing Numbers must be fastest in your search - but the preparation
overhead would be enormous. A mere Gedankenexperiment.
By the way: Which is the JavaScript way of sorting? It seems that all
standard capital letters come in front of lower case letters, and all
non standard letters come at the end. e.g. ABC...XYZabc...xyzÄÉÖÜßäéöü

Our Dutch, etc., members might debate "non-standard"; otherwise, that
order is as implied below.
Right now I load the array an let JavaScript sort it, which of course
consumes time.

If you needs are only as stated, that will be fastest if each loading of
the array is binary-searched a large number of times; but if it is
binary-searched only once then you will do better to use a you-written
comparison function in your binary search such that the loaded array is
already sorted by the criteria of the function. [1]
Would be great if anybody could name some sources of information, or
give me some techniques.

The newsgroup FAQ fails to cite ISO/IEC 16262, which is subsequent to
ECMA-262 which it does cite. My site cites both, but 262 will suffice.
Seek "sort", find 15.4.4.11, which says that the default comparison is
as < = >. Find Relational Operators in 11.8, proceed to 11.8.5, steps
1, 2, 3, 16-21. Strings are compared in the obvious manner, character
by character; and characters are compared by code point order. And code
point is in 6 para 3 - UTF-16 as a 16-bit word.



[1] George Bernard Shaw (1856-1950) :
"The reasonable man adapts himself to the world ;
the unreasonable one persists in trying to adapt the world to himself.
Therefore all progress depends on the unreasonable man."
/Man and Superman/

Adapting the world takes time; so, for one search, be reasonable; for
many, adapt the world.

It's a good idea to read the newsgroup c.l.j and its FAQ. See below.
 
R

RobG

Hello,

I would like to use an array as a search index. The code defining the
array is not generated with JavaScript.

When searching a word in the index, I do a quick binary search. Which
fails, of course, if the order of the words in the array is not in
exactly the way as JavaScript would sort them.

The words in the index can contain special characters, e.g. from the
german, french or spanish alphabet.

I wonder what might be easier: To teach the external application to
sort the index array in a way that equals the JavaScript sorting, OR
to teach JavaScript to compare strings in a way that the binary search
works correctly.

If you don't need to display the "correctly" sorted array, just use
the javascript sort order. Otherwise, you have to create a javascript
compare function that compares the elements the way you want so that
your binary search works.

By the way: Which is the JavaScript way of sorting? It seems that all
standard capital letters come in front of lower case letters, and all
non standard letters come at the end. e.g. ABC...XYZabc...xyzÄÉÖÜßäéöü

Javascript sorts strings essentially by converting the characters to
their Unicode equivalent and sorting that, so:

['a','\u00e4','b'].sort() ==> ['a','b','\u00e4']

Right now I load the array an let JavaScript sort it, which of course
consumes time.

If you can work OK with it in the javascript sort order, I'd expect it
to be better to sort it on the server according to the javascript
rules, then use the native javascript sort on the client.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top