Parser Generator?

J

Jack

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.
 
D

Diez B. Roggisch

Jack said:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

There are several options. I personally like spark.py, the most common
answer is pyparsing, and don't forget to check out NLTK, the natural
language toolkit.

Diez
 
B

beginner

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Antlr seems to be able to generate python code, too.
 
T

Tommy Nordgren

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.


--
http://mail.python.org/mailman/listinfo/python-list
Antlr can generate Python code.
However, I don't think a parser generator is suitable for generating
natural language parsers.
They are intended to generate code for computer language parsers.
However, for examples on parsing imperative English sentences, I
suggest taking a look
at the class library for TADS 3 (Text Adventure Development System)
<http://www.tads.org>
The lanuge has a syntax reminding of c++ and Java.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
(e-mail address removed)
 
S

samwyse

Jack said:
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for
sure!"). Instead, go look at what the interactive fiction community
uses. They analyse the statement in multiple passes, first picking out
the verbs, then the noun phrases. Some of their parsers can do
on-the-fly domain-specific spelling correction, etc, and all of them can
ask the user for clarification. (I'm currently cobbling together
something similar for pre-teen users.)
 
J

Jack

Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...
 
A

Alex Martelli

Jack said:
Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK — the Natural Language Toolkit — is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""


Alex
 
J

Jason Evans

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason
 
J

Jack

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.
 
P

Paul McGuire

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful:http://theory.stanford.edu/~amitp/yapps/

There's also PyGgyhttp://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack







- Show quoted text -

Jack -

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul
 
J

Jason Evans

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason
 
J

Jack

Thanks Json. There seem to be a few options that I can pursue. Having a hard
time
chooing one now :)
 
R

Ryan Ginstrom

On Behalf Of Jason Evans
Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom
 
P

Paul McGuire

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+','c']

even though there is not a single delimiting space. But pyparsing
will also render this as a nested parse tree, reflecting the
precedence of operations:

['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+',
'c']]

and will allow you to access individual tokens by field name:
- lhs: y
- rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']

Please feel free to look through the posted examples on the pyparsing
wiki at http://pyparsing.wikispaces.com/Examples, or some of the
applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing,
and you might get a better feel for what kind of tasks pyparsing is
capable of.

-- Paul
 
S

Steven Bethard

Paul said:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+','c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe
 
R

Ryan Ginstrom

On Behalf Of Paul McGuire
using PyParsing.

Did you think pyparsing is so mundane as to require spaces
between tokens? Pyparsing has been doing this type of
token-recognition since Day 1.

Cool! I stand happily corrected. I did write "I think" because although I
couldn't find a way to do it, there might well actually be one <g>. I'll
keep looking to find some examples of parsing Japanese.

BTW, I think PyParsing is great, and I use it for several tasks. I just
could never figure out a way to use it with Japanese (at least on the
applications I had in mind).

Regards,
Ryan Ginstrom
 
P

Paul McGuire

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:
y=a*x**2+b*x+c

['y','=','a','*','x','**','2','+','b','*','x','+','c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

Thepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthe­grammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe- Hide quoted text -

- Show quoted text -

Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top