RegExp split for Spell Check

S

SmokeWilliams

Hi,
I am working on a Spell checker for my richtext editor. I cannot use
any open source, and must develop everything myself. I need a RegExp
pattern to split text into a word array. I have been doing it by
splitting by spaces or <p> tags. I run into a probelm with the
richtext part of my editor. When I change the font, it wraps the text
in a tag. the tag has something like <font face="arial>some words</
font> This splits the text at font^face so I need to split on spaces
unless they are within the HTML tag. I am just looking for the
pattern for my regExp. I know there may be better ways for me to do
it, but right now I just need help with this issue.

Thanks in advance.

Pete
 
E

Evertjan.

SmokeWilliams wrote on 23 nov 2007 in comp.lang.javascript:
I am working on a Spell checker for my richtext editor.
I cannot use any open source, and must develop everything myself.

Why? At least look at all the code you can find. Coming up with complex
code from scratch does not give you the benefit of years of code
experimentation of the collective of world's programmers.
I need a RegExp pattern to split text into a word array.

Why? Does it matter how you do it? Parsing seems so much simpler.
I have been doing it by
splitting by spaces or <p> tags. I run into a probelm with the
richtext part of my editor. When I change the font, it wraps the text
in a tag.
the tag has something like <font face="arial>some words</font>

That is last century's code. Why not use said:
This splits the text at font^face so I need to split on spaces
unless they are within the HTML tag.
I am just looking for the pattern for my regExp.
I know there may be better ways for me to do
it, but right now I just need help with this issue.

I think that by stipulating the above unneccessary constraints, you will
get yourself into much trouble.

However try this:

var wordArrray = textString.replace(/(<[^>]*>)/g,' ').split(/\s+/)
 
D

Dr J R Stockton

In comp.lang.javascript message <[email protected]>
However try this:

var wordArrray = textString.replace(/(<[^>]*>)/g,' ').split(/\s+/)

If the page contains <script>...<\/script> then ISTM that the script
will be spell-checked; likewise the content of any textarea and possibly
others.

Could one write the full text to a page or div as HTML (useful anyway)
and read it back as .innerText for spell-checking ?
 
E

Evertjan.

Randy Webb wrote on 23 nov 2007 in comp.lang.javascript:
Evertjan. said the following on 11/23/2007 1:49 PM:

Because that is what the browsers put in the code in a contentEditable
element :)

So why use contentEditable if you cannot control it?

Wouldn't a simple <div> with onkeypress do?
 
D

Dr J R Stockton

In comp.lang.javascript message said:
The idea of spell-checking, in the sense of a true spell-checker is
almost impossible to implement in a browser due to the inherent size of
the dictionary that you must use.

Fifty thousand words is sufficient for ordinary use. I have to hand a
"Universal" pocket dictionary of a language resembling English, with 407
pages of two columns of about 15 words each; so about a quarter of that
size. The Little Oxford dictionary, 606 * 2 * 20, is about 25000 words.

I have to hand the New Testament in Basic English; its Note refers to
Basic English having 850 words, and to the NT using another 150
particular to the topic. It lacks the richness of the King James
version; but the text looks quite normal.

A spell-checker for use by the younger half of school-children would not
need very many words.

An alphabetical list of words, compressed, should not need much more
than two bytes per word.

So the list of words need be no longer than my largest Web pace,
currently 105000 bytes; and that's quite acceptable over broadband if
expected and cached properly.

There should be plenty of room to store such data in Javascript, from
what I've read here in other threads.

Lookup needs be no faster than typing, and properly implemented should
need only O(log2(N)) comparisons when using the main dictionary. It
would seem a potentially smart move to cache in a sub-dictionary the
words actually already seen (right or wrong) in the current text, since
words are often repeated. FAQ 2.3 contains about 675 words, but only
about 343 different ones. One third of its words are in the Top 8, "the
to and of in not a is".

The sub-dictionary can be pre-loaded with the commonest good and bad
spellings, if that helps.

Of course, it would be quite wrong to impose the full OED on an
unsuspecting dial-up user.
 
D

Dr J R Stockton

In comp.lang.javascript message said:
Dr J R Stockton said the following on 11/24/2007 4:32 PM:

I found a text file after looking for almost an hour. It has 213,558
words in it. The text file is 2.4 mbs. The biggest problem with even a
25,000 word dictionary is going to be lookup time. That can be helped a
lot by splitting it up into 26 dictionaries by beginning letter.

and maybe 26^2 by splitting those. I was presuming an intelligent
lookup strategy, not just a linear search along the whole of a list.
Too bad I can't look up half of what those words mean to know what they
mean. What the heck is a zakkeu?

Insert it into an IE6 address bar, and go. It seems to be perhaps
(possibly outdated) Dutch. However, Google Translate does not know it
(correct spelling may be zakeu, also not known). It may be an "error"
inserted to detect copying, as in "The Annihilation of Angkor Apeiron",
and in Wikipedia "Fictitious entry".

But if you don't know what half of those words mean, the dictionary must
be at least about twice as big as is needed to check your writing, I
hope.


But the following code generates an Object containing about 100,000
arbitrarily-named properties, and looks up one of them. The lookup
appears to take no time at all, meaning under about 15 ms. That's good
enough (caveat - building the dictionary is slower!). P4/3G, XP.

Dic = {}
for (J=0 ; J<1e5; J++) Dic[String(Math.sin(J))] = J
T = new Date()
J=1e4 ; while (J--) X = Dic[String(Math.sin(98765-3*J))]
Y = [new Date()-T, X]

Y becomes [172, 98765] or [156, 98765]; each lookup takes about 16
microseconds in a dictionary of 100,000- or is the code not testing it
correctly?

That suggests that neither dictionary size nor lookup time need be a
major problem, if the dictionary is not vast.
 
D

Dr J R Stockton

In comp.lang.javascript message said:
Dr J R Stockton said the following on 11/26/2007 11:27 AM:
I have wanted a customized personal dictionary of my own for a while
now. The biggest problem I have had was trying to find a text file that
had a word list in it that I could trust.

Try Google for a combination of two or three entirely unrelated unusual
words, and you'll start finding possible lists. Taghairm Octothorpe
seems a bit too obscure a pair; but maybe you don't do taghairm in the
USA (it's not in my Websters).

^^^^^^^^^^^
That bit applies to an earlier version of the code :-(.
appears to take no time at all, meaning under about 15 ms. That's good
enough (caveat - building the dictionary is slower!). P4/3G, XP.
Dic = {}
for (J=0 ; J<1e5; J++) Dic[String(Math.sin(J))] = J
T = new Date()
J=1e4 ; while (J--) X = Dic[String(Math.sin(98765-3*J))]
Y = [new Date()-T, X]
Y becomes [172, 98765] or [156, 98765]; each lookup takes about 16
microseconds in a dictionary of 100,000- or is the code not testing it
correctly?

It isn't doing a comparison. It creates the dictionary then it sets the
var X to the value of a possible entry 1000 times. It is looking up
1,000 entries which is 999 more than it has to look up.

The code is designed to do lookups for timing, without bothering with
the trivial matter of reporting success in an appropriate form.

I'd not thought it necessary to explain that doing 1000 different
lookups was in order to take a measurable total time. X is the value of
the last lookup, as a check. BTW, changing the 1e4 did change the time
proportionately (as was confidently expected) and changing the 1e5
changed it much less (as was less confidently expected).

In the timed part, words that should be present are found. For those
who don't make too many errors, the time for a failed lookup is less
important. New test : insert +0.5 after 3*J . Virtually all lookups
now fail. Time taken is unchanged.
The flaw in the test is what made me realize how to do a dictionary
and make it simple and fast. Instead of setting the Dic entry to J, set
it to 1. Then, to find out if a word exists in the dictionary or not
you simply test for it:
if(Dic['word here'])

I merely found it more convincing for the lookup to find the position.
Your code fragment only does the lookup, and does it in the same way as
my code. One can set the entries to true, and return either true or
undefined.
If the entry is there, it will return 1, convert it to true. If the
entry doesn't exist, then it returns undefined and converts it to
false. Let the browser do the lookup.


To correct myself, and admit I was thinking about it wrong, I don't
think the lookup is a problem. A 214,000 word dictionary is roughly 4.5
mbs so a 25,000 word dictionary should, guessing, be around 500kb or
so. Not bad on a broadband connection but murder on a dialup
connection.

A dictionary should compress automatically over modern dial-up, if in
alphabetical order; and one can write algorithms to compress this
special case better. For example, if the first N letters of a word are
the same as the first N of the previous word (including the 26 instances
of N=0), replace them by N encoded in base-36. So I think 500kB, if you
mean that, is an over-estimate. It's still a lot for an arbitrary Web
page; but not unreasonable even on dial-up if fetched on knowing demand
and cached.
Any idea where to find a reliable 25,000 word list?

If you are prepared to consider the spell-checker in a word processor
reliable, then just grab large quantities of plain text off the Net
(Project Gutenberg should have largely correct spellings, as should the
reports of your legislature), sort, deduplicate, and edit in the word
processor. If you take only lower-case words, you'll miss most proper
names.

I don't know how many items DOS sort or javascript sort will do in a
reasonable time, but there's always overnight. Via sig line 3, DEDUPE
is a DOS file line-deduplicator.

Actually, you don't *need* a word list. Any spelling checker should be
able to be told that the word it is currently complaining about is in
fact good, and to remember that either in the current document or
permanently. Start with an empty list, and after a few paragraphs it'll
know the words you commonly use, with your preferred spelling. You'll
just need one Webster lookup for each new word that you're not certain
how to spell.
 
D

Dr J R Stockton

In comp.lang.javascript message said:
I guessed at the 500kb based on 215,000 entries being 4.5Mb. Creating a
test file with 25,000 entries in it where each entry is 6 characters
long - to create an "average" word length - the file was 439Kb so I
wasn't far off. Of course, the actual size would depend on the 25,000
words you used.

25000 6-character words in 7-bit ASCII, with CRLF separators, needs
exactly 200kB. It may use more if created in Word, or if encoded in a
manner allowing letters other than A-Z.

For ordinary English, one only needs A to Z - ' and a separator, so
5-bit characters could be used by mere packing - 25000*5*6/8 -> under
100 kbytes, before any additional compression.

Of course, dictionary words are longer than the average.
 
S

SmokeWilliams

This is exactly what I was afraid of. I know it isn't the best
solution. I know there are better ways. I need a pattern to be used
only in the split because I need to maintain the length of the
string. So again, if anyone knows how to make the pattern to split
text by spaces or cariage returns "\r| " this is the split I am using
now. But as I stated above I need to ignore the spaces within HTML
tags. Please help me. Just the simple pattern will do. Thanks.

Pete
 
S

SmokeWilliams

Hello Evertjan, thanks for replying.
However try this:

var wordArrray = textString.replace(/(<[^>]*>)/g,' ').split(/\s+/)

I need a pattern that will split without replacing. So I need to
split on spaces or carriage returns, but not spaces that are withing
html tags. I know there are better ways, but I am using an IFrame in
IE and I work for a government agency which doesn't allow me to use
open source. I am depending on a RegEx wizard out there to supply me
with the pattern.

So I need a pattern that matches any space or carriage return that is
not within an html tag.

<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test

Thanks for your help.

Pete
 
E

Evertjan.

SmokeWilliams wrote on 03 dec 2007 in comp.lang.javascript:
Hello Evertjan, thanks for replying.
However try this:

var wordArrray = textString.replace(/(<[^>]*>)/g,' ').split(/\s+/)

I need a pattern that will split without replacing. So I need to
split on spaces or carriage returns, but not spaces that are withing
html tags. I know there are better ways, but I am using an IFrame in
IE and I work for a government agency which doesn't allow me to use
open source. I am depending on a RegEx wizard out there to supply me
with the pattern.

IFrame and IE do not make any difference, meseems
So I need a pattern that matches any space or carriage return that is
not within an html tag.

<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test

But why is <p> whitespace?

It can be done:

<script type='text/javascript'>

var t;
t = '<font face="arial" size=2>test</font><p>yo this is a test';

t = t.replace(/<p>/g,' ');
t = t.replace(/(<[^>]*>)/g,
function(a){return a.replace(/\s+/g,'^^#^^')});
t = t.replace(/\s+/g,'^^^^^');
t = t.replace(/\^\^#\^\^/g,' ');
t = t.replace(/</g,'&lt;');
t = t.split(/\^\^\^\^\^/);

document.write(t.join('<br>======<br>'));

</script>
 
P

pr

SmokeWilliams said:
<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test

Try:

alert('<font face="arial" size=2>test</font><p>yo this is a
test'.replace(/\s(?=[^<]*>)/g, "~").split(/<p>|\s/).join("\n"));

You can either replace the '~'s or leave them in; either way, your
string lengths are the same as the original HTML (as long as you clear
up the <p> != whitespace issue).
 
T

Thomas 'PointedEars' Lahn

pr said:
SmokeWilliams said:
<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test

Try:

alert('<font face="arial" size=2>test</font><p>yo this is a
test'.replace(/\s(?=[^<]*>)/g, "~").split(/<p>|\s/).join("\n"));
^^^^^^^
| I need a pattern that will split without replacing.


PointedEars
 
T

Thomas 'PointedEars' Lahn

SmokeWilliams said:
I need a pattern that will split without replacing. So I need to
split on spaces or carriage returns, but not spaces that are withing
html tags. I know there are better ways, but I am using an IFrame in
IE and I work for a government agency which doesn't allow me to use
open source. I am depending on a RegEx wizard out there to supply me
with the pattern.

So I need a pattern that matches any space or carriage return that is
not within an html tag.

<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test

Suppose you have

var s = '<font face="arial" size=2>test</font><p>yo this is a test';

Either you have a weird idea of "html tag" (HTML is an acronym, BTW),
or (which is more likely) instead you want the resulting array to be

['', 'test', '', 'yo', 'this', 'is', 'a', 'test']

This could be achieved by using tags as additional delimiters:

var a = s.split(/<[^>]+>|\s+/);

Microsoft JScript will not include the empty strings in the array.


PointedEars
 
P

pr

Thomas said:
pr said:
alert('<font face="arial" size=2>test</font><p>yo this is a
test'.replace(/\s(?=[^<]*>)/g, "~").split(/<p>|\s/).join("\n"));
^^^^^^^
| I need a pattern that will split without replacing.

Hanged if I can think of a good reason why, but well spotted, Thomas,
this is more efficient in any case:

alert('<font face="arial" size=2>test</font><p>yo this is a
test'.split(/\s(?![^<]*>)|<p>/).join("\n"));
 
T

Thomas 'PointedEars' Lahn

Randy said:
Thomas 'PointedEars' Lahn said the following on 12/4/2007 2:38 PM:
SmokeWilliams said:
I need a pattern that will split without replacing. So I need to
split on spaces or carriage returns, but not spaces that are withing
html tags. I know there are better ways, but I am using an IFrame in
IE and I work for a government agency which doesn't allow me to use
open source. I am depending on a RegEx wizard out there to supply me
with the pattern.

So I need a pattern that matches any space or carriage return that is
not within an html tag.

<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test
Suppose you have

var s = '<font face="arial" size=2>test</font><p>yo this is a test';

Either you have a weird idea of "html tag" (HTML is an acronym, BTW),
or (which is more likely) instead you want the resulting array to be

['', 'test', '', 'yo', 'this', 'is', 'a', 'test']

No, that is not what he said.

Are you stupid or what? I *know* that this is not what he said. However, I
don't think he really knows what he wants. Because it does not make sense
for a spell checker in a structural editor to ignore HTML element content.
And therefore, I posted my solution as it is.
Perhaps you should try reading what he wrote

You should read what I wrote, not what you wanted me to have written.

So much for reading.
and the intended results. Your "solution" leaves out the 1. listed
above.

I know.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Randy said:
Thomas 'PointedEars' Lahn said the following on 12/4/2007 4:25 PM:
Randy said:
Thomas 'PointedEars' Lahn said the following on 12/4/2007 2:38 PM:
SmokeWilliams wrote:
I need a pattern that will split without replacing. So I need to
split on spaces or carriage returns, but not spaces that are withing
html tags. [...]

So I need a pattern that matches any space or carriage return that is
not within an html tag.

<font face="arial" size=2>test</font><p>yo this is a test

Splitting this text should return an array containing:
1: <font face="arial" size=2>test</font>
2: yo
3: this
4: is
5: a
6: test
Suppose you have

var s = '<font face="arial" size=2>test</font><p>yo this is a test';

Either you have a weird idea of "html tag" (HTML is an acronym, BTW),
or (which is more likely) instead you want the resulting array to be

['', 'test', '', 'yo', 'this', 'is', 'a', 'test']
No, that is not what he said.
Are you stupid or what?

If imitation is the sincerest form of flattery, you flatter the shit out
of me sometimes.

I have tried your lower level of language this time so that you may better
understand me. Obviously, I am not very good at it. Sorry.
He knows *exactly* what he wants,

No, he does not. He has the idea of a spell checker and the problem that he
can not simply split on whitespace because of whitespace in HTML tags:

But the his example says otherwise. So I assumed that what he really wants
is to exclude the tags from consideration which leaves only the plain text
for the spell check. And that my solution allows. It is also a solution
that works with any script engine that supports regular expressions, while
solutions including negative lookahead or non-greedy matching do not.
However, these solutions posted so far have assumed that he wants exactly
the result he has posted; they did not take the practical application, or
rather the lack thereof, of that result into account, and they did not take
into account that he may have posted merely a bad example.

Of course, much of that remains speculation until he clears that up. But I
have explicitly stated in my posting that my solution was _not_ to provide
the result that he posted last. And so your followup to that was
unnecessary and the style in which it was written was completely uncalled
for. If you only had read not only *his* postings, but also *my* posting
*properly*.
he just isn't sure how to implement it.

He is pretty much unsure about anything so far.
Subtle difference my friend.

Don't be familiar with me until you have earned it.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Randy said:
Thomas 'PointedEars' Lahn said the following on 12/4/2007 7:48 PM: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
He wants a spell checker. He knows what he wants, he just doesn't know
the best way to implement it. And, the "best solution" doesn't involve a
regular expression, just a simple split on the text.

Learn to read.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,108
Latest member
AlbertEste
Top