Parser Generator?

S

Steven Bethard

Paul said:
In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthe­grammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to".

Yep, and these kinds of things occur quite frequently with Chinese and
Japanese. The point was not that pyparsing couldn't do it for a small
subset of characters/words, but that pyparsing is probably not the right
solution for general purpose Japanese/Chinese tokenization.

Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,279
Latest member
LaRoseDermaBottle

Latest Threads

Top