Python parser that records source ranges

Discussion in 'Python' started by Jonathan Edwards, Sep 29, 2003.

  1. The parser library module only records source line numbers for tokens. I
    need a parser that records ranges of line and character locations for
    each AST node, so I can map back to the source. Does anyone know of such
    a thing? Thanks

    Jonathan
    Jonathan Edwards, Sep 29, 2003
    #1
    1. Advertising

  2. Jonathan Edwards

    Jeff Epler Guest

    The tokenize module will give column information for each token, but
    it produces a stream of tokens only, not an AST.

    Jeff
    Jeff Epler, Sep 29, 2003
    #2
    1. Advertising

  3. Jonathan Edwards <> wrote in message news:<qRKdb.456249$Oz4.260848@rwcrnsc54>...
    > The parser library module only records source line numbers for tokens. I
    > need a parser that records ranges of line and character locations for
    > each AST node, so I can map back to the source. Does anyone know of such
    > a thing? Thanks
    >
    > Jonathan


    You know there's not going to be a one-to-one relationship, right?
    Most ast nodes are symbols and aren't going to match to any tokens.
    Python asts also use a lot of intermediate nodes to enforce operator
    precidence.

    Anyway, I have some rather specialized code in PyXR that syncs tokens
    to an ast. You probably won't be able to use it out of the box but it
    should give you a good start:

    http://www.cathoderaymission.net/~logistix/PyXR/

    The source file of particular interest to you would be astToHtml.py:

    http://tinyurl.com/p3cn
    logistix at cathoderaymission.net, Sep 29, 2003
    #3
  4. So the basic idea is to match up the leaves of the AST with the list of
    tokens from tokenizer, which do contain location info. I had thought of
    that, but was hoping there was a more informative parser out there.
    Thanks.

    Jonathan


    logistix at cathoderaymission.net wrote:

    > Jonathan Edwards <> wrote in message news:<qRKdb.456249$Oz4.260848@rwcrnsc54>...
    >
    >>The parser library module only records source line numbers for tokens. I
    >>need a parser that records ranges of line and character locations for
    >>each AST node, so I can map back to the source. Does anyone know of such
    >>a thing? Thanks
    >>
    >>Jonathan

    >
    >
    > You know there's not going to be a one-to-one relationship, right?
    > Most ast nodes are symbols and aren't going to match to any tokens.
    > Python asts also use a lot of intermediate nodes to enforce operator
    > precidence.
    >
    > Anyway, I have some rather specialized code in PyXR that syncs tokens
    > to an ast. You probably won't be able to use it out of the box but it
    > should give you a good start:
    >
    > http://www.cathoderaymission.net/~logistix/PyXR/
    >
    > The source file of particular interest to you would be astToHtml.py:
    >
    > http://tinyurl.com/p3cn
    Jonathan Edwards, Oct 1, 2003
    #4
  5. Jonathan Edwards <> wrote in message news:<>...
    > So the basic idea is to match up the leaves of the AST with the list of
    > tokens from tokenizer, which do contain location info. I had thought of
    > that, but was hoping there was a more informative parser out there.
    > Thanks.
    >
    > Jonathan
    >
    >



    Its really not that bad. The more I think about it, the code
    reference I sent you is way overcomplicated. General pseudocode for
    walking asts generated via parser.ast2tuple(parser.suite(code)) is:

    def walk_node(node):
    if len(node) == 2 and type(node[1]) is not tuple:
    walk_token(node)
    else:
    return walk_symbol(node)

    def walk_symbol(node):
    symbol_type = node[0]
    symbol_leaves = node[1:]
    for leave in symbol_leaves:
    walk_node(nod)

    def walk_token(node):
    token_type = node[0]
    token_value = node[1]
    logistix at cathoderaymission.net, Oct 1, 2003
    #5
  6. "Jonathan Edwards" <> wrote in message
    news:qRKdb.456249$Oz4.260848@rwcrnsc54...
    > The parser library module only records source line numbers for tokens. I
    > need a parser that records ranges of line and character locations for
    > each AST node, so I can map back to the source. Does anyone know of such
    > a thing? Thanks
    >
    > Jonathan
    >


    If I understand you correctly, then the Simpleparse parser may be just what
    you are looking for:

    http://simpleparse.sourceforge.net

    It is very powerful but still easy to use. The AST it produces gives the
    start and end points of the matching tokens. Below is an example for parsing
    a statement (from a VB grammar) ... you will see each node comprises a tuple
    of (token_name, start_char, end_char, [sub_node1, sub_node2, ...]).

    The example below looks rather complex because of the grammar, but you can
    see that most of the sub_node matches all relate to the same characters in
    the source. You can easily match each token to the corresponding text in the
    source.

    Paul

    >>> c("a = f(20, val)", verbose=1)

    1 15
    [('line_body',
    0,
    15,
    [('single_statement',
    0,
    14,
    [('assignment_statement',
    0,
    14,
    [('object', 0, 1, [('primary', 0, 1, [('identifier', 0, 1, [])])]),
    ('expression',
    4,
    14,
    [('par_expression',
    4,
    14,
    [('base_expression',
    4,
    14,
    [('simple_expr',
    4,
    14,
    [('call',
    4,
    14,
    [('object',
    4,
    14,
    [('primary',
    4,
    5,
    [('identifier', 4, 5, [])]),
    ('parameter_list',
    5,
    14,
    [('list',
    5,
    14,
    [('bare_list',
    6,
    13,
    [('bare_list_item',
    6,
    8,
    [('expression',
    6,
    8,
    [('par_expression',
    6,
    8,
    [('base_expression',
    6,
    8,
    [('simple_expr',
    6,
    8,
    [('atom',
    6,
    8,
    [('literal',
    6,
    8,
    [('integer',
    6,
    8,
    [('decimalinteger',
    6,
    8,
    None)])])])])])])])]),
    ('bare_list_item',
    10,
    13,
    [('expression',
    10,
    13,
    [('par_expression',
    10,
    13,
    [('base_expression',
    10,
    13,
    [('simple_expr',
    10,
    13,
    [('call',
    10,
    13,
    [('object',
    10,
    13,
    [('primary',
    10,
    13,
    [('identifier',
    10,
    13,

    [])])])])])])])])])])])])])])])])])])])]),
    ('line_end', 14, 15, [('NEWLINE', 14, 15, None)])])]
    Paul Paterson, Oct 2, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luke Airig
    Replies:
    0
    Views:
    787
    Luke Airig
    Dec 31, 2003
  2. Replies:
    3
    Views:
    599
    Maxim Yegorushkin
    Aug 30, 2005
  3. Dan

    Delete records or update records

    Dan, May 10, 2004, in forum: ASP General
    Replies:
    1
    Views:
    461
    Ray at
    May 10, 2004
  4. Replies:
    3
    Views:
    654
    Anthony Jones
    Nov 2, 2006
  5. Jonathan Edwards

    Ruby parser with character ranges

    Jonathan Edwards, Dec 12, 2003, in forum: Ruby
    Replies:
    4
    Views:
    116
    Jonathan Edwards
    Dec 13, 2003
Loading...

Share This Page