Comparison of parsers in python?

Discussion in 'Python' started by Peng Yu, Sep 19, 2009.

  1. Peng Yu

    Peng Yu Guest

    Hi,

    I did a google search and found various parser in python that can be
    used to parse different files in various situation. I don't see a page
    that summarizes and compares all the available parsers in python, from
    simple and easy-to-use ones to complex and powerful ones.

    I am wondering if somebody could list all the available parsers and
    compare them.

    Regards,
    Peng
    Peng Yu, Sep 19, 2009
    #1
    1. Advertising

  2. Peng Yu

    Robert Kern Guest

    Peng Yu wrote:
    > Hi,
    >
    > I did a google search and found various parser in python that can be
    > used to parse different files in various situation. I don't see a page
    > that summarizes and compares all the available parsers in python, from
    > simple and easy-to-use ones to complex and powerful ones.


    Second hit for "python parser":

    http://nedbatchelder.com/text/python-parsers.html

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Sep 20, 2009
    #2
    1. Advertising

  3. Peng Yu

    Peng Yu Guest

    On Sep 19, 6:05 pm, Robert Kern <> wrote:
    > Peng Yu wrote:
    > > Hi,

    >
    > > I did a google search and found various parser in python that can be
    > > used to parse different files in various situation. I don't see a page
    > > that summarizes and compares all the available parsers in python, from
    > > simple and easy-to-use ones to complex and powerful ones.

    >
    > Second hit for "python parser":
    >
    > http://nedbatchelder.com/text/python-parsers.html


    This is more a less just a list of parsers. I would like some detailed
    guidelines on which one to choose for various parsing problems.

    Regards,
    Peng
    Peng Yu, Sep 20, 2009
    #3
  4. Peng Yu

    TerryP Guest

    Peng Yu wrote:
    > This is more a less just a list of parsers. I would like some detailed
    > guidelines on which one to choose for various parsing problems.
    >
    > Regards,
    > Peng



    It depends on the parsing problem.

    Obviously your not going to use an INI parser to work with XML, or
    vice versa. Likewise some formats can be parsed in different ways, XML
    parsers for example are often build around a SAX or DOM model. The
    differences between them (hit Wikipedia) can effect the performance of
    your application, more then learning how to use an XML parsers API can
    effect the hair on your head.

    For flat data, simple unix style rc or dos style ini file will often
    suffice, and writing a parser is fairly trivial; in fact writing a
    config file parser is an excellent learning exercise, to get a feel
    for a given languages standard I/O, string handling, and type
    conversion features. These kind of parsers tend to be pretty quick
    because of their simplicity, and writing a small but extremely fast
    one can be enjoyable at times; one of these days I need to do it in
    X86 assembly just for the hell of it. Python includes an INI parser in
    the standard library.

    XML serves well for hierarchical data models, but can be a royal pain
    to write code around the parsers (IMHO anyway!), but often is handy.
    Popular parsers for XML include expat and libxml2 - there is also a
    more "Pythonic" wrapper for libxml/libxslt called py-lxml; Python also
    comes with parsers for XML. Other formats such as JSON, YAML, heck
    even S-expressions could be used and parsed. Some programs only parse
    enough to slup up code and eval it (not always smart, but sometimes
    useful).

    In general the issues to consider when selecting a parser for a given
    format, involve: speed, size, and time. How long does it take to
    process the data set, how much memory (size) does it consume, and how
    much bloody time will it take to learn the API ;).


    The best way to choose a parser, is experiment with several, test (and
    profile!) them according to the project, then pick the one you like
    best, out of those that are suitable for the task. Profiling can be
    very important.


    --
    TerryP
    TerryP, Sep 20, 2009
    #4
  5. Peng Yu

    andrew cooke Guest

    On Sep 19, 9:34 pm, Peng Yu <> wrote:
    > On Sep 19, 6:05 pm, Robert Kern <> wrote:
    > >http://nedbatchelder.com/text/python-parsers.html

    >
    > This is more a less just a list of parsers. I would like some detailed
    > guidelines on which one to choose for various parsing problems.


    it would be simpler if you described what you want to do - parsers can
    be used for a lot of problems.

    also, the parsers do not exist in isolation - you need to worry about
    whether they are supported, how good the documentation is, etc.

    and different parsers handle different grammars - see
    http://wiki.python.org/moin/LanguageParsing - so if you already have a
    particular grammar then your life is simpler if you choose a parser
    that matches.


    these are the three that i know most about - i think all three are
    currently maintained:

    for simple parsing problems, i think pyparsing is the most commonly
    used - http://pyparsing.wikispaces.com/

    my own lepl - http://www.acooke.org/lepl/ - tries to combine ease of
    use with some more advanced features

    the nltk - http://www.nltk.org/ - is particularly targeted at parsing
    natural languages and includes a wide variety of tools.


    but for parsing a large project you might be better interfacing to a
    compiled parser (lepl has memoisation, so should scale quite well, but
    it's not something i've looked at in detail yet).

    andrew
    andrew cooke, Sep 20, 2009
    #5
  6. Peng Yu

    andrew cooke Guest

    On Sep 19, 11:39 pm, TerryP <> wrote:
    [...]
    > For flat data, simple unix style rc or dos style ini file will often
    > suffice, and writing a parser is fairly trivial; in fact writing a

    [...]

    python already includes parsers for ".ini" configuration files.

    [...]
    > The best way to choose a parser, is experiment with several, test (and
    > profile!) them according to the project, then pick the one you like
    > best, out of those that are suitable for the task. Profiling can be
    > very important.


    profiling is going to show you the constant complexity, but - unless
    you think hard - it's not going to explain how a parser will scale
    (how performance changes as the amount of text to be parsed
    increases). for that, you need to look at the algorithm used, which
    is usually documented somewhere. there's going to be trade-offs -
    parsers that handle large texts better could well be more complex and
    slower on small texts.

    andrew
    andrew cooke, Sep 20, 2009
    #6
  7. Peng Yu

    Peng Yu Guest

    On Sun, Sep 20, 2009 at 6:50 AM, andrew cooke <> wrote:
    > On Sep 19, 9:34 pm, Peng Yu <> wrote:
    >> On Sep 19, 6:05 pm, Robert Kern <> wrote:
    >> >http://nedbatchelder.com/text/python-parsers.html

    >>
    >> This is more a less just a list of parsers. I would like some detailed
    >> guidelines on which one to choose for various parsing problems.

    >
    > it would be simpler if you described what you want to do - parsers can
    > be used for a lot of problems.


    I have never used any parser. The task at my hand right now is to
    parse this http://genome.ucsc.edu/goldenPath/help/wiggle.html, which
    is a fairly simple even without any parser package.

    I think that it is worthwhile for me to learn some parser packages to
    try to parse this format. So that I may in future parse more complex
    syntax. Do you have any suggestion what parser I should use for now?

    Regards,
    Peng

    > also, the parsers do not exist in isolation - you need to worry about
    > whether they are supported, how good the documentation is, etc.
    >
    > and different parsers handle different grammars - see
    > http://wiki.python.org/moin/LanguageParsing - so if you already have a
    > particular grammar then your life is simpler if you choose a parser
    > that matches.
    >
    >
    > these are the three that i know most about - i think all three are
    > currently maintained:
    >
    > for simple parsing problems, i think pyparsing is the most commonly
    > used - http://pyparsing.wikispaces.com/
    >
    > my own lepl - http://www.acooke.org/lepl/ - tries to combine ease of
    > use with some more advanced features
    >
    > the nltk - http://www.nltk.org/ - is particularly targeted at parsing
    > natural languages and includes a wide variety of tools.
    >
    >
    > but for parsing a large project you might be better interfacing to a
    > compiled parser (lepl has memoisation, so should scale quite well, but
    > it's not something i've looked at in detail yet).
    >
    > andrew
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    Peng Yu, Sep 20, 2009
    #7
  8. Peng Yu

    andrew cooke Guest

    On Sep 20, 8:11 am, Peng Yu <> wrote:
    > On Sun, Sep 20, 2009 at 6:50 AM, andrew cooke <> wrote:
    > > On Sep 19, 9:34 pm, Peng Yu <> wrote:
    > >> On Sep 19, 6:05 pm, Robert Kern <> wrote:
    > >> >http://nedbatchelder.com/text/python-parsers.html

    >
    > >> This is more a less just a list of parsers. I would like some detailed
    > >> guidelines on which one to choose for various parsing problems.

    >
    > > it would be simpler if you described what you want to do - parsers can
    > > be used for a lot of problems.

    >
    > I have never used any parser. The task at my hand right now is to
    > parse thishttp://genome.ucsc.edu/goldenPath/help/wiggle.html, which
    > is a fairly simple even without any parser package.
    >
    > I think that it is worthwhile for me to learn some parser packages to
    > try to parse this format. So that I may in future parse more complex
    > syntax. Do you have any suggestion what parser I should use for now?


    pyparsing would work fine for that, and has a broad community of users
    that will probably be helpful.

    i am currently working on an extension to lepl that is related, and i
    may use that format as an example. if so, i'll tell you. but for
    now, i think pyparsing makes more sense for you.

    andrew
    andrew cooke, Sep 20, 2009
    #8
  9. Peng Yu

    andrew cooke Guest

    One word of warning - the documentation for that format says at the
    beginning that it is compressed in some way. I am not sure if that
    means within some program, or on disk. But most parsers will not be
    much use with a compressed file - you will need to uncompress it first.
    andrew cooke, Sep 20, 2009
    #9
  10. Peng Yu

    Peng Yu Guest

    On Sun, Sep 20, 2009 at 7:20 AM, andrew cooke <> wrote:
    > On Sep 20, 8:11 am, Peng Yu <> wrote:
    >> On Sun, Sep 20, 2009 at 6:50 AM, andrew cooke <> wrote:
    >> > On Sep 19, 9:34 pm, Peng Yu <> wrote:
    >> >> On Sep 19, 6:05 pm, Robert Kern <> wrote:
    >> >> >http://nedbatchelder.com/text/python-parsers.html

    >>
    >> >> This is more a less just a list of parsers. I would like some detailed
    >> >> guidelines on which one to choose for various parsing problems.

    >>
    >> > it would be simpler if you described what you want to do - parsers can
    >> > be used for a lot of problems.

    >>
    >> I have never used any parser. The task at my hand right now is to
    >> parse thishttp://genome.ucsc.edu/goldenPath/help/wiggle.html, which
    >> is a fairly simple even without any parser package.
    >>
    >> I think that it is worthwhile for me to learn some parser packages to
    >> try to parse this format. So that I may in future parse more complex
    >> syntax. Do you have any suggestion what parser I should use for now?

    >
    > pyparsing would work fine for that, and has a broad community of users
    > that will probably be helpful.
    >
    > i am currently working on an extension to lepl that is related, and i
    > may use that format as an example.  if so, i'll tell you.  but for
    > now, i think pyparsing makes more sense for you.


    The file size of a wig file can be very large (GB). Most tasks on this
    file format does not need the parser to save all the lines read from
    the file in the memory to produce the parsing result. I'm wondering if
    pyparsing is capable of parsing large wig files by keeping only
    minimum required information in the memory.

    Regards,
    Peng
    Peng Yu, Sep 20, 2009
    #10
  11. Peng Yu

    andrew cooke Guest

    > The file size of a wig file can be very large (GB). Most tasks on this
    > file format does not need the parser to save all the lines read from
    > the file in the memory to produce the parsing result. I'm wondering if
    > pyparsing is capable of parsing large wig files by keeping only
    > minimum required information in the memory.


    ok, now you are getting into the kind of detail where you will need to
    ask the authors of individual packages.

    lepl is stream oriented and should behave as you want (it will only
    keep in memory what it needs, and will read data gradually from a
    file) but (1) it's fairly new and i have not tested the memory use -
    there may be some unexpected memory leak; (2) it's python 2.6/3 only;
    (3) parsing line-based formats like this is not yet supported very
    well (you can do it, but you have to explicitly match the newline
    character to find the end of line); (4) the community for support is
    small.

    so i would suggest asking on the pyparsing list for advice on using
    that with large data files (you are getting closer to the point where
    i would recommend lepl - but of course i am biased as i wrote it).

    andrew

    ps is there somewhere can download example files? this would be
    useful for my own testing. thanks.
    andrew cooke, Sep 20, 2009
    #11
  12. Peng Yu

    andrew cooke Guest

    also, parsing large files may be slow. in which case you may be
    better with a non-python solution (even if you call it from python).

    your file format is so simple that you may find a lexer is enough for
    what you want, and they should be stream oriented. have a look at the
    "shlex" package that is already in python. will that help?

    alternatively, perhaps plex - http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
    - that is pure python, but greg ewing is a good programmer and he says
    on that page it is as fast as possible for python, so it is probably
    going to be quite fast.

    andrew

    ps maybe you already know, but a lexer is simpler than a parser in
    that it doesn't use the context to decide how to treat things. so it
    can recognise something is a number, or a word, or a quoted string,
    but not whether it is part of a track definition line or a data value,
    for example. but in this case the format is so simple that a lexer
    might do quite a ot of what you want, and would make the remaining
    plain python program very simple.
    andrew cooke, Sep 20, 2009
    #12
  13. Peng Yu

    Peng Yu Guest

    On Sun, Sep 20, 2009 at 8:19 AM, andrew cooke <> wrote:
    >
    > also, parsing large files may be slow.  in which case you may be
    > better with a non-python solution (even if you call it from python).
    >
    > your file format is so simple that you may find a lexer is enough for
    > what you want, and they should be stream oriented.  have a look at the
    > "shlex" package that is already in python.  will that help?
    >
    > alternatively, perhaps plex - http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
    > - that is pure python, but greg ewing is a good programmer and he says
    > on that page it is as fast as possible for python, so it is probably
    > going to be quite fast.
    >
    > andrew
    >
    > ps maybe you already know, but a lexer is simpler than a parser in
    > that it doesn't use the context to decide how to treat things.  so it
    > can recognise something is a number, or a word, or a quoted string,
    > but not whether it is part of a track definition line or a data value,
    > for example.  but in this case the format is so simple that a lexer
    > might do quite a ot of what you want, and would make the remaining
    > plain python program very simple.


    I don't quite understand this point. If I don't use a parser, since
    python can read numbers line by line, why I need a lexer package?

    Regards,
    Peng
    Peng Yu, Sep 20, 2009
    #13
  14. Peng Yu

    andrew cooke Guest

    > I don't quite understand this point.  If I don't use a parser, since
    > python can read numbers line by line, why I need a lexer package?


    for the lines of numbers it would make no difference; for the track
    definition lines it would save you some work.

    as you said, this is a simple format, so the case for any tool is
    marginal - i'm just exploring the options.

    andrew
    andrew cooke, Sep 20, 2009
    #14
  15. Peng Yu

    Peng Yu Guest

    On Sun, Sep 20, 2009 at 8:49 AM, andrew cooke <> wrote:
    >> I don't quite understand this point.  If I don't use a parser, since
    >> python can read numbers line by line, why I need a lexer package?

    >
    > for the lines of numbers it would make no difference; for the track
    > definition lines it would save you some work.


    So for the track definition, using a lexer package would be better
    than using regex in python, right?

    > as you said, this is a simple format, so the case for any tool is
    > marginal - i'm just exploring the options.
    >
    > andrew
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    Peng Yu, Sep 20, 2009
    #15
  16. Peng Yu

    andrew cooke Guest

    > So for the track definition, using a lexer package would be better
    > than using regex in python, right?


    they are similar. a lexer is really just a library that packages
    regular expressions in a certain way. so you could write your own
    code and you would really be writing a simple lexer. the advantage of
    writing your own code is that it will be easier to modify and you will
    get more experience with regular expressions. the advantage of using
    a library (a lexer) is that it has already been tested by other
    people, it is already packaged and so your code will be better
    structured, and it may have features (perhaps logging, or handling of
    quoted strings, for example) that will save you some work in the
    future.

    andrew
    andrew cooke, Sep 20, 2009
    #16
  17. Peng Yu

    Robert Kern Guest

    Peng Yu wrote:

    > The file size of a wig file can be very large (GB). Most tasks on this
    > file format does not need the parser to save all the lines read from
    > the file in the memory to produce the parsing result. I'm wondering if
    > pyparsing is capable of parsing large wig files by keeping only
    > minimum required information in the memory.


    I cannot recommend pyparsing for large amounts of text. Even before you hit
    memory limits, you will run into the problem that pyparsing runs many functions
    for each character of text. Python function calls are expensive.

    Since the format is line-oriented, one option is to use pyparsing or other
    parser to handle the track definition lines and just str.split(), float() and
    int() for the data lines.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Sep 20, 2009
    #17
  18. Peng Yu

    andrew cooke Guest

    On Sep 20, 9:12 am, andrew cooke <> wrote:
    > ps is there somewhere can download example files?  this would be
    > useful for my own testing.  thanks.


    i replied to a lot of your questions here; any chance you could reply
    to this one of mine?

    the wig format looks like it could be a good test for lepl.

    thanks,
    andrew
    andrew cooke, Sep 20, 2009
    #18
  19. Peng Yu

    Peng Yu Guest

    On Sun, Sep 20, 2009 at 1:35 PM, andrew cooke <> wrote:
    > On Sep 20, 9:12 am, andrew cooke <> wrote:
    >> ps is there somewhere can download example files?  this would be
    >> useful for my own testing.  thanks.

    >
    > i replied to a lot of your questions here; any chance you could reply
    > to this one of mine?
    >
    > the wig format looks like it could be a good test for lepl.


    I missed your question. I have only some files that I only use a
    subset of the syntax in wig. Here is one example.



    track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
    variableStep chrom=chr10 span=10
    3001871 1
    3001881 1
    3001891 1
    3001901 1
    track type=wiggle_0 name="MACS_counts_after_shifting" description="H3K4me1B"
    variableStep chrom=chr11 span=10
    3000331 3
    3000341 3
    3000351 3
    3000361 3
    3000371 3
    3000381 3
    Peng Yu, Sep 20, 2009
    #19
  20. Peng Yu

    andrew cooke Guest

    On Sep 20, 3:16 pm, Peng Yu <> wrote:
    > On Sun, Sep 20, 2009 at 1:35 PM, andrew cooke <> wrote:
    > > On Sep 20, 9:12 am, andrew cooke <> wrote:
    > >> ps is there somewhere can download example files?  this would be
    > >> useful for my own testing.  thanks.

    >
    > > i replied to a lot of your questions here; any chance you could reply
    > > to this one of mine?

    >
    > > the wig format looks like it could be a good test for lepl.

    >
    > I missed your question. I have only some files that I only use a
    > subset of the syntax in wig. Here is one example.


    ah, thanks. i'll see if i can find something on the 'net - i am
    hoping to test how / whether gigabytes of data can be parsed.

    andrew
    andrew cooke, Sep 20, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John J. Lee
    Replies:
    3
    Views:
    903
    John J. Lee
    Jun 26, 2003
  2. Paul McGuire
    Replies:
    0
    Views:
    277
    Paul McGuire
    Dec 16, 2003
  3. Daniel Carrera

    Parsers with Python

    Daniel Carrera, Feb 17, 2004, in forum: Python
    Replies:
    3
    Views:
    293
    Robert E. Brown
    Feb 18, 2004
  4. bio_enthusiast
    Replies:
    5
    Views:
    422
    Ramon Diaz-Uriarte
    Aug 26, 2006
  5. Deepu
    Replies:
    1
    Views:
    217
    ccc31807
    Feb 7, 2011
Loading...

Share This Page