Parser Generator?

J

Jack

Aug 18, 2007

#1

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

D

Diez B. Roggisch

Aug 18, 2007

#2

Jack said:
Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

There are several options. I personally like spark.py, the most common
answer is pyparsing, and don't forget to check out NLTK, the natural
language toolkit.

Diez

B

beginner

Aug 19, 2007

#3

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

Antlr seems to be able to generate python code, too.

T

Tommy Nordgren

Aug 19, 2007

#4

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

--
http://mail.python.org/mailman/listinfo/python-list

Antlr can generate Python code.
However, I don't think a parser generator is suitable for generating
natural language parsers.
They are intended to generate code for computer language parsers.
However, for examples on parsing imperative English sentences, I
suggest taking a look
at the class library for TADS 3 (Text Adventure Development System)
<http://www.tads.org>
The lanuge has a syntax reminding of c++ and Java.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
(e-mail address removed)

J

Jack

Aug 19, 2007

#5

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack

S

samwyse

Aug 19, 2007

#6

Jack said:
Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful: http://theory.stanford.edu/~amitp/yapps/

There's also PyGgy http://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

I'm going to echo Tommy's reply. If you want to parse natural language,
conventional parsers are going to be worse than useless (because you'll
keep thinking, "Just one more tweak and this time it'll work for
sure!"). Instead, go look at what the interactive fiction community
uses. They analyse the statement in multiple passes, first picking out
the verbs, then the noun phrases. Some of their parsers can do
on-the-fly domain-specific spelling correction, etc, and all of them can
ask the user for clarification. (I'm currently cobbling together
something similar for pre-teen users.)

J

Jack

Aug 19, 2007

#7

Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

A

Alex Martelli

Aug 19, 2007

#8

Jack said:
Thanks for the suggestion. I understand that more work is needed for natural
language
understanding. What I want to do is actually very simple - I pre-screen the
user
typed text. If it's a simple syntax my code understands, like, Weather in
London, I'll
redirect it to a weather site. Or, if it's "What is ... " I'll probably
redirect it to wikipedia.
Otherwise, I'll throw it to a search engine. So, extremelyl simple stuff ...

<http://nltk.sourceforge.net/index.php/Main_Page>

"""
NLTK â€” the Natural Language Toolkit â€” is a suite of open source Python
modules, data sets and tutorials supporting research and development in
natural language processing.
"""

Alex

J

Jack

Aug 20, 2007

#9

Very interesting work. Thanks for the link!

J

Jason Evans

Aug 23, 2007

#10

Hi all, I need to do syntax parsing of simple naturual languages,
for example, "weather of London" or "what is the time", simple
things like these, with Unicode support in the syntax.

In Java, there are JavaCC, Antlr, etc. I wonder what people use
in Python? Antlr also has Python support but I'm not sure how good
it is. Comments/hints are welcome.

I use Parsing.py. I like it a lot, probably because I wrote it.

http://www.canonware.com/Parsing/

Jason

J

Jack

Aug 24, 2007

#11

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

P

Paul McGuire

Aug 25, 2007

#12

Thanks for all the replies!

SPARK looks promising. Its doc doesn't say if it handles unicode
(CJK in particular) encoding though.

Yapps also looks powerful:http://theory.stanford.edu/~amitp/yapps/

There's also PyGgyhttp://lava.net/~newsham/pyggy/

I may also give Antlr a try.

If anyone has experiences using any of the parser generators with CJK
languages, I'd be very interested in hearing that.

Jack

- Show quoted text -

Jack -

Pyparsing was already mentioned once on this thread. Here is an
application using pyparsing that parses Chinese characters to convert
to English Python.

http://pypi.python.org/pypi/zhpy/0.5

-- Paul

J

Jason Evans

Aug 26, 2007

#13

Thanks Jason. Does Parsing.py support Unicode characters (especially CJK)?
I'll take a look.

Parsers typically deal with tokens rather than individual characters,
so the scanner that creates the tokens is the main thing that Unicode
matters to. I have written Unicode-aware scanners for use with
Parsing-based parsers, with no problems. This is pretty easy to do,
since Python has built-in support for Unicode strings.

Jason

J

Jack

Aug 27, 2007

#14

Good to know, thanks Paul.
!

J

Jack

Aug 27, 2007

#15

Thanks Json. There seem to be a few options that I can pursue. Having a hard
time
chooing one now

R

Ryan Ginstrom

Aug 27, 2007

#16

On Behalf Of Jason Evans

Parsers typically deal with tokens rather than individual
characters, so the scanner that creates the tokens is the
main thing that Unicode matters to. I have written
Unicode-aware scanners for use with Parsing-based parsers,
with no problems. This is pretty easy to do, since Python
has built-in support for Unicode strings.

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

P

Paul McGuire

Aug 27, 2007

#17

The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Regards,
Ryan Ginstrom

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+','c']

even though there is not a single delimiting space. But pyparsing
will also render this as a nested parse tree, reflecting the
precedence of operations:

['y', '=', [['a', '*', ['x', '**', 2]], '+',['b', '*', 'x'], '+',
'c']]

and will allow you to access individual tokens by field name:
- lhs: y
- rhs: [['a', '*', ['x', '**', 2]], '+', ['b', '*', 'x'], '+', 'c']

Please feel free to look through the posted examples on the pyparsing
wiki at http://pyparsing.wikispaces.com/Examples, or some of the
applications currently using pyparsing at http://pyparsing.wikispaces.com/WhosUsingPyparsing,
and you might get a better feel for what kind of tasks pyparsing is
capable of.

-- Paul

S

Steven Bethard

Aug 27, 2007

#18

Paul said:
The only caveat being that since Chinese and Japanese scripts don't
typically delimit "words" with spaces, I think you'd have to pass the text
through a tokenizer (like ChaSen for Japanese) before using PyParsing.

Click to expand...

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

as

['y','=','a','*','x','**','2','+','b','*','x','+','c']

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe

R

Ryan Ginstrom

Aug 27, 2007

#19

On Behalf Of Paul McGuire

using PyParsing.

Did you think pyparsing is so mundane as to require spaces
between tokens? Pyparsing has been doing this type of
token-recognition since Day 1.

Cool! I stand happily corrected. I did write "I think" because although I
couldn't find a way to do it, there might well actually be one <g>. I'll
keep looking to find some examples of parsing Japanese.

BTW, I think PyParsing is great, and I use it for several tasks. I just
could never figure out a way to use it with Japanese (at least on the
applications I had in mind).

Regards,
Ryan Ginstrom

P

Paul McGuire

Aug 27, 2007

#20

Did you think pyparsing is so mundane as to require spaces between
tokens? Pyparsing has been doing this type of token-recognition since
Day 1. Looking for tokens without delimiting spaces was one of the
first applications for pyparsing. This issue is not unique to Chinese
or Japanese text. Pyparsing will easily find the tokens in this
string:

y=a*x**2+b*x+c

Click to expand...

as

Click to expand...

['y','=','a','*','x','**','2','+','b','*','x','+','c']

Click to expand...

The difference is that in the expression above (and in many other
tokenization problems) you can determine "word" boundaries by looking at
the class of character, e.g. alphanumeric vs. punctuation vs. whatever.

In Japanese and Chinese tokenization, word boundaries are not marked by
different classes of characters. They only exist in the mind of the
reader who knows which sequences of characters could be words given the
context, and which sequences of characters couldn't.

The closest analog would be to ask pyparsing to find the words in the
following sentence:

ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode.

Most approaches that have been even marginally successful on these kinds
of tasks have used statistical machine learning approaches.

STeVe- Hide quoted text -

- Show quoted text -

Steve -

You mean like this?

from pyparsing import *

knownWords = ['of', 'grammar', 'construct', 'classes', 'a',
'client', 'pyparsing', 'directly', 'the', 'module', 'uses',
'that', 'in', 'python', 'library', 'provides', 'code', 'to']

knownWord = oneOf( knownWords, caseless=True )
sentence = OneOrMore( knownWord ) + "."

mush =
"ThepyparsingmoduleprovidesalibraryofclassesthatclientcodeusestoconstructthegrammardirectlyinPythoncode."

print sentence.parseString( mush )

prints:

['the', 'pyparsing', 'module', 'provides', 'a', 'library', 'of',
'classes', 'that', 'client', 'code', 'uses', 'to', 'construct',
'the', 'grammar', 'directly', 'in', 'python', 'code', '.']

In fact, this is almost the exact scheme used by Zhpy for extracting
Chinese versions of Python keywords, and mapping them back to English/
Latin words. Of course, this is not practical for natural language
processing, as the vocabulary gets too large. And you can get
ambiguous matches, such as a vocabulary containing the words ['in',
'to', 'into'] - the runtogether "into" will always be assumed to be
"into", and never "in to". Fortunately (for pyparsing), your example
was sufficiently friendly as to avoid ambiguities. But if you can
select a suitable vocabulary, even a runon mush is parseable.

-- Paul

PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
Parser	11	Apr 27, 2014
Generator question	4	Dec 22, 2010
LaTeX parser and pstricks generator in python	0	Feb 14, 2010
Parser for a simple expression	14	Sep 14, 2007
How to write a language parser ?	5	Feb 22, 2013
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013

Jack

Diez B. Roggisch

beginner

Tommy Nordgren

Jack

samwyse

Jack

Alex Martelli

Jack

Jason Evans

Jack

Paul McGuire

Jason Evans

Jack

Jack

Ryan Ginstrom

Paul McGuire

Steven Bethard

Ryan Ginstrom

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads