How to pick out words from a non-delimited string?

J

Julian Hsiao

Hi,

Since Perl excels at string parsing, I thought this would be the right
group to ask my question. If not, I apologize for the confusion.

I want to know if there's an algorithm that picks out the words from a
string that doesn't delimit each words. For example, given
"helloworld" outputs "hello world". Ideally, it should also be
resistant to spelling errors.

So, does such algorithm exist? Better yet, has someone already
implemented it?

Thanks.

Julian Hsiao
(e-mail address removed)
 
E

Evgeny Trifuntov

Hello, Julian!
You wrote on 26 Jul 2003 23:21:52 -0700:

JH> I want to know if there's an algorithm that picks out the words from a
JH> string that doesn't delimit each words. For example, given
JH> "helloworld" outputs "hello world". Ideally, it should also be
JH> resistant to spelling errors.

Well, at least you should make up kind of "dictionary" for your program with
all the words that can be met in your text.


Evgeny Trifuntov
Technion. Israel Institute of Technology
E-mail: (e-mail address removed)
Good Luck, All.
 
J

Julian Hsiao

Tassilo v. Parseval said:
Also sprach Julian Hsiao:


Such an algorithm could only be used as a preprocessors. The result
would still need to be verified by a human. An approach would be to take
a dictionary file and scan the text for any words that are in the
dictionary. If a word has been found, surround it with spaces.

I've thought about that, except that this approach doesn't handle
errors gracefully. In the event that a search fails (most likely due
to spelling errors), this algorithm will either have to halt, or risk
parsing the rest of the string incorrectly.

As an example, if at the first syntax error, a C compiler halts, or
parses the rest of the code incorrectly, it wouldn't be very useful at
all. Of course, the latter still happens sometimes, but at least most
compilers stll parses *some* of the remaining code correctly.

So the algorithm needs to know when to "jump ahead" a few characters
in case of errors.
Another, more general approach, would try to do the splitting based on
morphological rules for a given language.

Can you please elaborate on what you mean? In my case, the string
will always be in English.
But take the word "deallocate": the above dictionary approach could turn
this into "deal locate".

I don't think this is a problem, since I will be using the algorithm
only as a preprocessor, as you've suggested.

Thanks.

Julian Hsiao
(e-mail address removed)
 
P

Purl Gurl

Julian said:
(snipped)
I don't think this is a problem, since I will be using the algorithm
only as a preprocessor, as you've suggested.


Parsing "deallocate" via dictionary matching would return
several dozen valid words. I can think of six or seven words
spelled "d" right off, such as the musical note d, a letter
grade d, the number 500, d as a noun along with others.

d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate

Many of those represent multiple words.

This strikes me as a rather noticeable problem.


Purl Gurl
 
J

Julian Hsiao

Purl Gurl said:
Parsing "deallocate" via dictionary matching would return
several dozen valid words. I can think of six or seven words
spelled "d" right off, such as the musical note d, a letter
grade d, the number 500, d as a noun along with others.

d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate

Many of those represent multiple words.

This strikes me as a rather noticeable problem.

Hmm...I guess you're right. It's probably easier for me to avoid
having to parse such strings, than trying to come up with an algorithm
to handle them. That's what I'll do instead.

Thanks.

Julian Hsiao
(e-mail address removed)
 
T

Tassilo v. Parseval

Also sprach Julian Hsiao:
Can you please elaborate on what you mean? In my case, the string
will always be in English.

I am not familiar enough with English morphology. However, the idea is
that every language has certain rules how words are built. Certain
combinations of letters never show up in a language (or, if they do,
they must belong to two adjoining syllables; in German for instance
there can't be a 'k' directly following a 'Sch' within a syllable
because the result 'Schk' is no valid German phonem which means it would
be unpronounceable, therefore the word 'Tischkante' has three syllables:
Tisch-kan-te). The word-splitting algorithms in text-processors work
that way.

There are mainly two problems when trying to do it that way: First of
all, you need to have all the rules (they are quite complex for English
as they are for most other languages) and secondly, these rules are only
valid within one word because languages tend to allow any sequence of
words (whereas they do not necessarily allow any sequence of letters
within one word).

So better go for a dictionary-approach. This itself will be hard enough.
I just did a few local tests with a text from the Gutenberg projects and
an English dictionary file. Whichever implementation you choose, the
result will probably be dog-slow and get it wrong the first couple of
attempts. Even when sorting out the brain-dead attempts, the list of
words it spits out is highly suspect.

Tassilo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top