Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:
['y','=','a','*','x','**','2','+','b','*','x','+','c']
The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.
In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.
The closest analog would be to ask pyparsing to find the words in the
following sentence:
ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.
Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.
STeVe- Hide quoted text -
- Show quoted text -
Steve -
You mean like this?
from pyparsing import *
knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']
knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."
mush =
"ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."
print sentence.parseString( mush )
prints:
['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']
In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.
-- Paul