How to pick out words from a non-delimited string?

Discussion in 'Perl Misc' started by Julian Hsiao, Jul 27, 2003.

  1. Julian Hsiao

    Julian Hsiao Guest

    Hi,

    Since Perl excels at string parsing, I thought this would be the right
    group to ask my question. If not, I apologize for the confusion.

    I want to know if there's an algorithm that picks out the words from a
    string that doesn't delimit each words. For example, given
    "helloworld" outputs "hello world". Ideally, it should also be
    resistant to spelling errors.

    So, does such algorithm exist? Better yet, has someone already
    implemented it?

    Thanks.

    Julian Hsiao
     
    Julian Hsiao, Jul 27, 2003
    #1
    1. Advertising

  2. Does "therapist" become "the rapist" ???

    gtoomey
     
    Gregory Toomey, Jul 27, 2003
    #2
    1. Advertising

  3. Hello, Julian!
    You wrote on 26 Jul 2003 23:21:52 -0700:

    JH> I want to know if there's an algorithm that picks out the words from a
    JH> string that doesn't delimit each words. For example, given
    JH> "helloworld" outputs "hello world". Ideally, it should also be
    JH> resistant to spelling errors.

    Well, at least you should make up kind of "dictionary" for your program with
    all the words that can be met in your text.


    Evgeny Trifuntov
    Technion. Israel Institute of Technology
    E-mail:
    Good Luck, All.
     
    Evgeny Trifuntov, Jul 27, 2003
    #3
  4. Julian Hsiao

    Julian Hsiao Guest

    "Tassilo v. Parseval" <> wrote:

    > Also sprach Julian Hsiao:
    >
    >> I want to know if there's an algorithm that picks out the words

    from a
    >> string that doesn't delimit each words. For example, given
    >> "helloworld" outputs "hello world". Ideally, it should also be
    >> resistant to spelling errors.

    >
    > Such an algorithm could only be used as a preprocessors. The result
    > would still need to be verified by a human. An approach would be to take
    > a dictionary file and scan the text for any words that are in the
    > dictionary. If a word has been found, surround it with spaces.


    I've thought about that, except that this approach doesn't handle
    errors gracefully. In the event that a search fails (most likely due
    to spelling errors), this algorithm will either have to halt, or risk
    parsing the rest of the string incorrectly.

    As an example, if at the first syntax error, a C compiler halts, or
    parses the rest of the code incorrectly, it wouldn't be very useful at
    all. Of course, the latter still happens sometimes, but at least most
    compilers stll parses *some* of the remaining code correctly.

    So the algorithm needs to know when to "jump ahead" a few characters
    in case of errors.

    > Another, more general approach, would try to do the splitting based on
    > morphological rules for a given language.


    Can you please elaborate on what you mean? In my case, the string
    will always be in English.

    > But take the word "deallocate": the above dictionary approach could turn
    > this into "deal locate".


    I don't think this is a problem, since I will be using the algorithm
    only as a preprocessor, as you've suggested.

    Thanks.

    Julian Hsiao
     
    Julian Hsiao, Jul 28, 2003
    #4
  5. Julian Hsiao

    Purl Gurl Guest

    Julian Hsiao wrote:

    > Tassilo v. Parseval wrote:
    > > Julian Hsiao wrote:


    (snipped)

    > >> I want to know if there's an algorithm that picks out the words
    > >> from a string that doesn't delimit each words. For example, given
    > >> "helloworld" outputs "hello world". Ideally, it should also be
    > >> resistant to spelling errors.


    > > But take the word "deallocate": the above dictionary approach could turn
    > > this into "deal locate".


    > I don't think this is a problem, since I will be using the algorithm
    > only as a preprocessor, as you've suggested.



    Parsing "deallocate" via dictionary matching would return
    several dozen valid words. I can think of six or seven words
    spelled "d" right off, such as the musical note d, a letter
    grade d, the number 500, d as a noun along with others.

    d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate

    Many of those represent multiple words.

    This strikes me as a rather noticeable problem.


    Purl Gurl
     
    Purl Gurl, Jul 29, 2003
    #5
  6. Julian Hsiao

    Julian Hsiao Guest

    Purl Gurl <> wrote:

    > Parsing "deallocate" via dictionary matching would return
    > several dozen valid words. I can think of six or seven words
    > spelled "d" right off, such as the musical note d, a letter
    > grade d, the number 500, d as a noun along with others.
    >
    > d de dea deal al all ll allocate lo loca oc oca ca cat cate at ate
    >
    > Many of those represent multiple words.
    >
    > This strikes me as a rather noticeable problem.


    Hmm...I guess you're right. It's probably easier for me to avoid
    having to parse such strings, than trying to come up with an algorithm
    to handle them. That's what I'll do instead.

    Thanks.

    Julian Hsiao
     
    Julian Hsiao, Jul 29, 2003
    #6
  7. Also sprach Julian Hsiao:

    > "Tassilo v. Parseval" <> wrote:


    >> Another, more general approach, would try to do the splitting based on
    >> morphological rules for a given language.

    >
    > Can you please elaborate on what you mean? In my case, the string
    > will always be in English.


    I am not familiar enough with English morphology. However, the idea is
    that every language has certain rules how words are built. Certain
    combinations of letters never show up in a language (or, if they do,
    they must belong to two adjoining syllables; in German for instance
    there can't be a 'k' directly following a 'Sch' within a syllable
    because the result 'Schk' is no valid German phonem which means it would
    be unpronounceable, therefore the word 'Tischkante' has three syllables:
    Tisch-kan-te). The word-splitting algorithms in text-processors work
    that way.

    There are mainly two problems when trying to do it that way: First of
    all, you need to have all the rules (they are quite complex for English
    as they are for most other languages) and secondly, these rules are only
    valid within one word because languages tend to allow any sequence of
    words (whereas they do not necessarily allow any sequence of letters
    within one word).

    So better go for a dictionary-approach. This itself will be hard enough.
    I just did a few local tests with a text from the Gutenberg projects and
    an English dictionary file. Whichever implementation you choose, the
    result will probably be dog-slow and get it wrong the first couple of
    attempts. Even when sorting out the brain-dead attempts, the list of
    words it spits out is highly suspect.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
     
    Tassilo v. Parseval, Jul 29, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,090
    Peter Strøiman
    Aug 23, 2005
  2. tom c
    Replies:
    5
    Views:
    399
    tom c
    Nov 1, 2006
  3. RyanL
    Replies:
    6
    Views:
    690
    Paul McGuire
    Aug 28, 2007
  4. BerlinBrown
    Replies:
    6
    Views:
    4,502
  5. pantagruel
    Replies:
    8
    Views:
    449
    Dr John Stockton
    Jul 22, 2006
Loading...

Share This Page