Comparison of parsers in python?

P

Peng Yu

Hi,

I did a google search and found various parser in python that can be
used to parse different files in various situation. I don't see a page
that summarizes and compares all the available parsers in python, from
simple and easy-to-use ones to complex and powerful ones.

I am wondering if somebody could list all the available parsers and
compare them.

Regards,
Peng
 
R

Robert Kern

Peng said:
Hi,

I did a google search and found various parser in python that can be
used to parse different files in various situation. I don't see a page
that summarizes and compares all the available parsers in python, from
simple and easy-to-use ones to complex and powerful ones.

Second hit for "python parser":

http://nedbatchelder.com/text/python-parsers.html

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

TerryP

Peng said:
This is more a less just a list of parsers. I would like some detailed
guidelines on which one to choose for various parsing problems.

Regards,
Peng


It depends on the parsing problem.

Obviously your not going to use an INI parser to work with XML, or
vice versa. Likewise some formats can be parsed in different ways, XML
parsers for example are often build around a SAX or DOM model. The
differences between them (hit Wikipedia) can effect the performance of
your application, more then learning how to use an XML parsers API can
effect the hair on your head.

For flat data, simple unix style rc or dos style ini file will often
suffice, and writing a parser is fairly trivial; in fact writing a
config file parser is an excellent learning exercise, to get a feel
for a given languages standard I/O, string handling, and type
conversion features. These kind of parsers tend to be pretty quick
because of their simplicity, and writing a small but extremely fast
one can be enjoyable at times; one of these days I need to do it in
X86 assembly just for the hell of it. Python includes an INI parser in
the standard library.

XML serves well for hierarchical data models, but can be a royal pain
to write code around the parsers (IMHO anyway!), but often is handy.
Popular parsers for XML include expat and libxml2 - there is also a
more "Pythonic" wrapper for libxml/libxslt called py-lxml; Python also
comes with parsers for XML. Other formats such as JSON, YAML, heck
even S-expressions could be used and parsed. Some programs only parse
enough to slup up code and eval it (not always smart, but sometimes
useful).

In general the issues to consider when selecting a parser for a given
format, involve: speed, size, and time. How long does it take to
process the data set, how much memory (size) does it consume, and how
much bloody time will it take to learn the API ;).


The best way to choose a parser, is experiment with several, test (and
profile!) them according to the project, then pick the one you like
best, out of those that are suitable for the task. Profiling can be
very important.
 
A

andrew cooke

This is more a less just a list of parsers. I would like some detailed
guidelines on which one to choose for various parsing problems.

it would be simpler if you described what you want to do - parsers can
be used for a lot of problems.

also, the parsers do not exist in isolation - you need to worry about
whether they are supported, how good the documentation is, etc.

and different parsers handle different grammars - see
http://wiki.python.org/moin/LanguageParsing - so if you already have a
particular grammar then your life is simpler if you choose a parser
that matches.


these are the three that i know most about - i think all three are
currently maintained:

for simple parsing problems, i think pyparsing is the most commonly
used - http://pyparsing.wikispaces.com/

my own lepl - http://www.acooke.org/lepl/ - tries to combine ease of
use with some more advanced features

the nltk - http://www.nltk.org/ - is particularly targeted at parsing
natural languages and includes a wide variety of tools.


but for parsing a large project you might be better interfacing to a
compiled parser (lepl has memoisation, so should scale quite well, but
it's not something i've looked at in detail yet).

andrew
 
A

andrew cooke

For flat data, simple unix style rc or dos style ini file will often
suffice, and writing a parser is fairly trivial; in fact writing a
[...]

python already includes parsers for ".ini" configuration files.

[...]
The best way to choose a parser, is experiment with several, test (and
profile!) them according to the project, then pick the one you like
best, out of those that are suitable for the task. Profiling can be
very important.

profiling is going to show you the constant complexity, but - unless
you think hard - it's not going to explain how a parser will scale
(how performance changes as the amount of text to be parsed
increases). for that, you need to look at the algorithm used, which
is usually documented somewhere. there's going to be trade-offs -
parsers that handle large texts better could well be more complex and
slower on small texts.

andrew
 
P

Peng Yu

it would be simpler if you described what you want to do - parsers can
be used for a lot of problems.

I have never used any parser. The task at my hand right now is to
parse this http://genome.ucsc.edu/goldenPath/help/wiggle.html, which
is a fairly simple even without any parser package.

I think that it is worthwhile for me to learn some parser packages to
try to parse this format. So that I may in future parse more complex
syntax. Do you have any suggestion what parser I should use for now?

Regards,
Peng
 
A

andrew cooke

I have never used any parser. The task at my hand right now is to
parse thishttp://genome.ucsc.edu/goldenPath/help/wiggle.html, which
is a fairly simple even without any parser package.

I think that it is worthwhile for me to learn some parser packages to
try to parse this format. So that I may in future parse more complex
syntax. Do you have any suggestion what parser I should use for now?

pyparsing would work fine for that, and has a broad community of users
that will probably be helpful.

i am currently working on an extension to lepl that is related, and i
may use that format as an example. if so, i'll tell you. but for
now, i think pyparsing makes more sense for you.

andrew
 
A

andrew cooke

One word of warning - the documentation for that format says at the
beginning that it is compressed in some way. I am not sure if that
means within some program, or on disk. But most parsers will not be
much use with a compressed file - you will need to uncompress it first.
 
P

Peng Yu

pyparsing would work fine for that, and has a broad community of users
that will probably be helpful.

i am currently working on an extension to lepl that is related, and i
may use that format as an example.  if so, i'll tell you.  but for
now, i think pyparsing makes more sense for you.

The file size of a wig file can be very large (GB). Most tasks on this
file format does not need the parser to save all the lines read from
the file in the memory to produce the parsing result. I'm wondering if
pyparsing is capable of parsing large wig files by keeping only
minimum required information in the memory.

Regards,
Peng
 
A

andrew cooke

The file size of a wig file can be very large (GB). Most tasks on this
file format does not need the parser to save all the lines read from
the file in the memory to produce the parsing result. I'm wondering if
pyparsing is capable of parsing large wig files by keeping only
minimum required information in the memory.

ok, now you are getting into the kind of detail where you will need to
ask the authors of individual packages.

lepl is stream oriented and should behave as you want (it will only
keep in memory what it needs, and will read data gradually from a
file) but (1) it's fairly new and i have not tested the memory use -
there may be some unexpected memory leak; (2) it's python 2.6/3 only;
(3) parsing line-based formats like this is not yet supported very
well (you can do it, but you have to explicitly match the newline
character to find the end of line); (4) the community for support is
small.

so i would suggest asking on the pyparsing list for advice on using
that with large data files (you are getting closer to the point where
i would recommend lepl - but of course i am biased as i wrote it).

andrew

ps is there somewhere can download example files? this would be
useful for my own testing. thanks.
 
A

andrew cooke

also, parsing large files may be slow. in which case you may be
better with a non-python solution (even if you call it from python).

your file format is so simple that you may find a lexer is enough for
what you want, and they should be stream oriented. have a look at the
"shlex" package that is already in python. will that help?

alternatively, perhaps plex - http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
- that is pure python, but greg ewing is a good programmer and he says
on that page it is as fast as possible for python, so it is probably
going to be quite fast.

andrew

ps maybe you already know, but a lexer is simpler than a parser in
that it doesn't use the context to decide how to treat things. so it
can recognise something is a number, or a word, or a quoted string,
but not whether it is part of a track definition line or a data value,
for example. but in this case the format is so simple that a lexer
might do quite a ot of what you want, and would make the remaining
plain python program very simple.
 
P

Peng Yu

also, parsing large files may be slow.  in which case you may be
better with a non-python solution (even if you call it from python).

your file format is so simple that you may find a lexer is enough for
what you want, and they should be stream oriented.  have a look at the
"shlex" package that is already in python.  will that help?

alternatively, perhaps plex - http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
- that is pure python, but greg ewing is a good programmer and he says
on that page it is as fast as possible for python, so it is probably
going to be quite fast.

andrew

ps maybe you already know, but a lexer is simpler than a parser in
that it doesn't use the context to decide how to treat things.  so it
can recognise something is a number, or a word, or a quoted string,
but not whether it is part of a track definition line or a data value,
for example.  but in this case the format is so simple that a lexer
might do quite a ot of what you want, and would make the remaining
plain python program very simple.

I don't quite understand this point. If I don't use a parser, since
python can read numbers line by line, why I need a lexer package?

Regards,
Peng
 
A

andrew cooke

I don't quite understand this point.  If I don't use a parser, since
python can read numbers line by line, why I need a lexer package?

for the lines of numbers it would make no difference; for the track
definition lines it would save you some work.

as you said, this is a simple format, so the case for any tool is
marginal - i'm just exploring the options.

andrew
 
P

Peng Yu

for the lines of numbers it would make no difference; for the track
definition lines it would save you some work.

So for the track definition, using a lexer package would be better
than using regex in python, right?
 
A

andrew cooke

So for the track definition, using a lexer package would be better
than using regex in python, right?

they are similar. a lexer is really just a library that packages
regular expressions in a certain way. so you could write your own
code and you would really be writing a simple lexer. the advantage of
writing your own code is that it will be easier to modify and you will
get more experience with regular expressions. the advantage of using
a library (a lexer) is that it has already been tested by other
people, it is already packaged and so your code will be better
structured, and it may have features (perhaps logging, or handling of
quoted strings, for example) that will save you some work in the
future.

andrew
 
R

Robert Kern

Peng said:
The file size of a wig file can be very large (GB). Most tasks on this
file format does not need the parser to save all the lines read from
the file in the memory to produce the parsing result. I'm wondering if
pyparsing is capable of parsing large wig files by keeping only
minimum required information in the memory.

I cannot recommend pyparsing for large amounts of text. Even before you hit
memory limits, you will run into the problem that pyparsing runs many functions
for each character of text. Python function calls are expensive.

Since the format is line-oriented, one option is to use pyparsing or other
parser to handle the track definition lines and just str.split(), float() and
int() for the data lines.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
A

andrew cooke

ps is there somewhere can download example files?  this would be
useful for my own testing.  thanks.

i replied to a lot of your questions here; any chance you could reply
to this one of mine?

the wig format looks like it could be a good test for lepl.

thanks,
andrew
 
P

Peng Yu

i replied to a lot of your questions here; any chance you could reply
to this one of mine?

the wig format looks like it could be a good test for lepl.

I missed your question. I have only some files that I only use a
subset of the syntax in wig. Here is one example.



track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
variableStep chrom=chr10 span=10
3001871 1
3001881 1
3001891 1
3001901 1
track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
variableStep chrom=chr11 span=10
3000331 3
3000341 3
3000351 3
3000361 3
3000371 3
3000381 3
 
A

andrew cooke

I missed your question. I have only some files that I only use a
subset of the syntax in wig. Here is one example.

ah, thanks. i'll see if i can find something on the 'net - i am
hoping to test how / whether gigabytes of data can be parsed.

andrew
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top