Comparison of parsers in python?

N

Nobody

I did a google search and found various parser in python that can be
used to parse different files in various situation. I don't see a page
that summarizes and compares all the available parsers in python, from
simple and easy-to-use ones to complex and powerful ones.

I am wondering if somebody could list all the available parsers and
compare them.

I have a similar question.

What I want: a tokeniser generator which can take a lex-style grammar (not
necessarily lex syntax, but a set of token specifications defined by
REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
bytes. It must allow the syntax to be defined at run-time.

What I don't want: anything written by someone who doesn't understand the
field (i.e. anything which doesn't use a DFA).
 
G

greg

Nobody said:
What I want: a tokeniser generator which can take a lex-style grammar (not
necessarily lex syntax, but a set of token specifications defined by
REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
bytes. It must allow the syntax to be defined at run-time.

You might find my Plex package useful:

http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

It was written some time ago, so it doesn't know about
the new bytes type yet, but it shouldn't be hard to
adapt it for that if you need to.
What I don't want: anything written by someone who doesn't understand the
field (i.e. anything which doesn't use a DFA).

Plex uses a DFA.
 
A

andrew cooke

I have a similar question.

What I want: a tokeniser generator which can take a lex-style grammar (not
necessarily lex syntax, but a set of token specifications defined by
REs, BNF, or whatever), generate a DFA, then run the DFA on sequences of
bytes. It must allow the syntax to be defined at run-time.

What I don't want: anything written by someone who doesn't understand the
field (i.e. anything which doesn't use a DFA).

lepl will do this, but it's integrated with the rest of the parser
(which is recursive descent).

for example:

float = Token(Float())
word = Token(Word(Lower())
punctuation = ~Token(r'[\.,]')

line = (float | word)[:, punctuation]
parser = line.string_parser()

will generate a lexer with three tokens. here two are specified using
lepl's matchers and one using a regexp, but in all three cases they
are converted to dfas internally.

then a parser is generated that will match a sequence of floats and
words, separated by punctuation. spaces are discarded by the lexer by
default, but that can be changed through the configuration (which
would be passed to the string_parser method).

it's also possible to specify everything using matchers and then get
lepl to compile "as much as possible" of the matcher graph to nfas
before matching (nfas rather than dfas because they are implemented
with a stack to preserve the backtracking abilities of the recursive
descent parser they replace). the problem here is that not all
matchers can be converted (matchers can contain arbitrary python
functions, while my nfa+dfa implementations cannot, and also my
"compiler" isn't very smart), while using tokens explicitly gives you
an error if the automatic compilation fails (in which case the simple
fix is to just give the regexp).

(also, you say "sequence of bytes" rather than strings - lepl will
parse the byte[] type in python3 and even has support for matching
binary values).

disclaimer: newish library, python 2.6+ only, and while i have quite a
few users (or, at least, downloads), i doubt that many use these more
advanced features, and everything is pure python with little
performance tuning so far.

andrew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top