How to pick out words from a non-delimited string?

Julian Hsiao · Jul 27, 2003

Hi,

Since Perl excels at string parsing, I thought this would be the right
group to ask my question. If not, I apologize for the confusion.

I want to know if there's an algorithm that picks out the words from a
string that doesn't delimit each words. For example, given
"helloworld" outputs "hello world". Ideally, it should also be
resistant to spelling errors.

So, does such algorithm exist? Better yet, has someone already
implemented it?

Thanks.

Julian Hsiao
(e-mail address removed)

Gregory Toomey · Jul 27, 2003

Does "therapist" become "the rapist" ???

gtoomey

Evgeny Trifuntov · Jul 27, 2003

Hello, Julian!
You wrote on 26 Jul 2003 23:21:52 -0700:

JH> I want to know if there's an algorithm that picks out the words from a
JH> string that doesn't delimit each words. For example, given
JH> "helloworld" outputs "hello world". Ideally, it should also be
JH> resistant to spelling errors.

Well, at least you should make up kind of "dictionary" for your program with
all the words that can be met in your text.

Evgeny Trifuntov
Technion. Israel Institute of Technology
E-mail: (e-mail address removed)
Good Luck, All.

Julian Hsiao · Jul 28, 2003

Tassilo v. Parseval said:
Also sprach Julian Hsiao:

Such an algorithm could only be used as a preprocessors. The result
would still need to be verified by a human. An approach would be to take
a dictionary file and scan the text for any words that are in the
dictionary. If a word has been found, surround it with spaces.

I've thought about that, except that this approach doesn't handle
errors gracefully. In the event that a search fails (most likely due
to spelling errors), this algorithm will either have to halt, or risk
parsing the rest of the string incorrectly.

As an example, if at the first syntax error, a C compiler halts, or
parses the rest of the code incorrectly, it wouldn't be very useful at
all. Of course, the latter still happens sometimes, but at least most
compilers stll parses *some* of the remaining code correctly.

So the algorithm needs to know when to "jump ahead" a few characters
in case of errors.

Another, more general approach, would try to do the splitting based on
morphological rules for a given language.

Can you please elaborate on what you mean? In my case, the string
will always be in English.

But take the word "deallocate": the above dictionary approach could turn
this into "deal locate".

I don't think this is a problem, since I will be using the algorithm
only as a preprocessor, as you've suggested.

Thanks.

Julian Hsiao
(e-mail address removed)

Purl Gurl · Jul 29, 2003

Julian said:
(snipped)

I don't think this is a problem, since I will be using the algorithm
only as a preprocessor, as you've suggested.

Parsing "deallocate" via dictionary matching would return
several dozen valid words. I can think of six or seven words
spelled "d" right off, such as the musical note d, a letter
grade d, the number 500, d as a noun along with others.

d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate

Many of those represent multiple words.

This strikes me as a rather noticeable problem.

Purl Gurl

Julian Hsiao · Jul 29, 2003

Purl Gurl said:
Parsing "deallocate" via dictionary matching would return
several dozen valid words. I can think of six or seven words
spelled "d" right off, such as the musical note d, a letter
grade d, the number 500, d as a noun along with others.

d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate

Many of those represent multiple words.

This strikes me as a rather noticeable problem.

Hmm...I guess you're right. It's probably easier for me to avoid
having to parse such strings, than trying to come up with an algorithm
to handle them. That's what I'll do instead.

Thanks.

Julian Hsiao
(e-mail address removed)

Tassilo v. Parseval · Jul 29, 2003

Also sprach Julian Hsiao:

Can you please elaborate on what you mean? In my case, the string
will always be in English.

I am not familiar enough with English morphology. However, the idea is
that every language has certain rules how words are built. Certain
combinations of letters never show up in a language (or, if they do,
they must belong to two adjoining syllables; in German for instance
there can't be a 'k' directly following a 'Sch' within a syllable
because the result 'Schk' is no valid German phonem which means it would
be unpronounceable, therefore the word 'Tischkante' has three syllables:
Tisch-kan-te). The word-splitting algorithms in text-processors work
that way.

There are mainly two problems when trying to do it that way: First of
all, you need to have all the rules (they are quite complex for English
as they are for most other languages) and secondly, these rules are only
valid within one word because languages tend to allow any sequence of
words (whereas they do not necessarily allow any sequence of letters
within one word).

So better go for a dictionary-approach. This itself will be hard enough.
I just did a few local tests with a text from the Gutenberg projects and
an English dictionary file. Whichever implementation you choose, the
result will probably be dog-slow and get it wrong the first couple of
attempts. Even when sorting out the brain-dead attempts, the list of
words it spits out is highly suspect.

Tassilo

Finding out the String-based Exceptions for a Module?	2	Dec 12, 2006
How to time out a read from socket	2	Oct 1, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
word_set = set() def should_preceed_with_an(phrase): first_word =	1	Jan 26, 2013
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008

How to pick out words from a non-delimited string?

Julian Hsiao

Gregory Toomey

Evgeny Trifuntov

Julian Hsiao

Purl Gurl

Julian Hsiao

Tassilo v. Parseval

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads