Pyparsing help

Discussion in 'Python' started by rh0dium, Mar 22, 2008.

  1. rh0dium

    rh0dium Guest

    Hi all,

    I am struggling with parsing the following data:

    test1 = """
    Technology {
    name = "gtc"
    dielectric = 2.75e-05
    unitTimeName = "ns"
    timePrecision = 1000
    unitLengthName = "micron"
    lengthPrecision = 1000
    gridResolution = 5
    unitVoltageName = "v"
    voltagePrecision = 1000000
    unitCurrentName = "ma"
    currentPrecision = 1000
    unitPowerName = "pw"
    powerPrecision = 1000
    unitResistanceName = "kohm"
    resistancePrecision = 10000000
    unitCapacitanceName = "pf"
    capacitancePrecision = 10000000
    unitInductanceName = "nh"
    inductancePrecision = 100
    }

    Tile "unit" {
    width = 0.22
    height = 1.69
    }

    Layer "PRBOUNDARY" {
    layerNumber = 0
    maskName = ""
    visible = 1
    selectable = 1
    blink = 0
    color = "cyan"
    lineStyle = "solid"
    pattern = "blank"
    pitch = 0
    defaultWidth = 0
    minWidth = 0
    minSpacing = 0
    }

    Layer "METAL2" {
    layerNumber = 36
    maskName = "metal2"
    isDefaultLayer = 1
    visible = 1
    selectable = 1
    blink = 0
    color = "yellow"
    lineStyle = "solid"
    pattern = "blank"
    pitch = 0.46
    defaultWidth = 0.2
    minWidth = 0.2
    minSpacing = 0.21
    fatContactThreshold = 1.4
    maxSegLenForRC = 2000
    unitMinResistance = 6.1e-05
    unitNomResistance = 6.3e-05
    unitMaxResistance = 6.9e-05
    unitMinHeightFromSub = 1.21
    unitNomHeightFromSub = 1.237
    unitMaxHeightFromSub = 1.267
    unitMinThickness = 0.25
    unitNomThickness = 0.475
    unitMaxThickness = 0.75
    fatTblDimension = 3
    fatTblThreshold = (0,0.39,10.005)
    fatTblParallelLength = (0,1,0)
    fatTblSpacing = (0.21,0.24,0.6,
    0.24,0.24,0.6,
    0.6,0.6,0.6)
    minArea = 0.144
    }
    """

    So it looks like starting from the inside out
    I have an key and a value where the value can be a QuotedString,
    Word(num), or a list of nums

    So my code to catch this looks like this..

    atflist = Suppress("(") + commaSeparatedList + Suppress(")")
    atfstr = quotedString.setParseAction(removeQuotes)
    atfvalues = ( Word(nums) | atfstr | atflist )

    l = ("36", '"metal2"', '(0.21,0.24,0.6,0.24,0.24,0.6)')

    for x in l:
    print atfvalues.parseString(x)

    But this isn't passing the list commaSeparatedList. Can someone point
    out my errors?

    As a side note: Is this the right approach to using pyparsing. Do we
    start from the inside and work our way out or should I have started
    with looking at the bigger picture ( keyword + "{" + OneOrMore key /
    vals + "}" + ) I started there but could figure out how to look
    multiline - I'm assuming I'd just join them all up?

    Thanks
    rh0dium, Mar 22, 2008
    #1
    1. Advertising

  2. rh0dium

    Paul McGuire Guest

    On Mar 22, 4:11 pm, rh0dium <> wrote:
    > Hi all,
    >
    > I am struggling with parsing the following data:
    >

    <snip>
    > As a side note:  Is this the right approach to using pyparsing.  Do we
    > start from the inside and work our way out or should I have started
    > with looking at the bigger picture ( keyword + "{" + OneOrMore key /
    > vals + "}" + )  I started there but could figure out how to look
    > multiline - I'm assuming I'd just join them all up?
    >
    > Thanks


    I think your "inside-out" approach is just fine. Start by composing
    expressions for the different "pieces" of your input text, then
    steadily build up more and more complex forms.

    I think the main complication you have is that of using
    commaSeparatedList for your list of real numbers. commaSeparatedList
    is a very generic helper expression. From the online example (http://
    pyparsing.wikispaces.com/space/showimage/commasep.py), here is a
    sample of the data that commaSeparatedList will handle:

    "a,b,c,100.2,,3",
    "d, e, j k , m ",
    "'Hello, World', f, g , , 5.1,x",
    "John Doe, 123 Main St., Cleveland, Ohio",
    "Jane Doe, 456 St. James St., Los Angeles , California ",

    In other words, the content of the items between commas is pretty much
    anything that is *not* a comma. If you change your definition of
    atflist to:

    atflist = Suppress("(") + commaSeparatedList # + Suppress(")")

    (that is, comment out the trailing right paren), you'll get this
    successful parse result:

    ['0.21', '0.24', '0.6', '0.24', '0.24', '0.6)']

    In your example, you are parsing a list of floating point numbers, in
    a list delimited by commas, surrounded by parens. This definition of
    atflist should give you more control over the parsing process, and
    give you real floats to boot:

    floatnum = Combine(Word(nums) + "." + Word(nums) +
    Optional('e'+oneOf("+ -")+Word(nums)))
    floatnum.setParseAction(lambda t:float(t[0]))
    atflist = Suppress("(") + delimitedList(floatnum) + Suppress(")")

    Now I get this output for your parse test:

    [0.20999999999999999, 0.23999999999999999, 0.59999999999999998,
    0.23999999999999999, 0.23999999999999999, 0.59999999999999998]

    So you can see that this has actually parsed the numbers and converted
    them to floats.

    I went ahead and added support for scientific notation in floatnum,
    since I see that you have several atfvalues that are standalone
    floats, some using scientific notation. To add these, just expand
    atfvalues to:

    atfvalues = ( floatnum | Word(nums) | atfstr | atflist )

    (At this point, I'll go on to show how to parse the rest of the data
    structure - if you want to take a stab at it yourself, stop reading
    here, and then come back to compare your results with my approach.)

    To parse the overall structure, now that you have expressions for the
    different component pieces, look into using Dict (or more simply using
    the helper function dictOf) to define results names automagically for
    you based on the attribute names in the input. Dict does *not* change
    any of the parsing or matching logic, it just adds named fields in the
    parsed results corresponding to the key names found in the input.

    Dict is a complex pyparsing class, but dictOf simplfies things.
    dictOf takes two arguments:

    dictOf(keyExpression, valueExpression)

    This translates to:

    Dict( OneOrMore( Group(keyExpression + valueExpression) ) )

    For example, to parse the lists of entries that look like:

    name = "gtc"
    dielectric = 2.75e-05
    unitTimeName = "ns"
    timePrecision = 1000
    unitLengthName = "micron"
    etc.

    just define that this is "a dict of entries each composed of a key
    consisting of a Word(alphas), followed by a suppressed '=' sign and an
    atfvalues", that is:

    attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)

    dictOf takes care of all of the repetition and grouping necessary for
    Dict to do its work. These attribute dicts are nested within an outer
    main dict, which is "a dict of entries, each with a key of
    Word(alphas), and a value of an optional quotedString (an alias,
    perhaps?), a left brace, an attrDict, and a right brace," or:

    mainDict = dictOf(
    Word(alphas),
    Optional(quotedString)("alias") +
    Suppress("{") + attrDict + Suppress("}")
    )

    By adding this code to what you already have:

    attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)
    mainDict = dictOf(
    Word(alphas),
    Optional(quotedString)("alias") +
    Suppress("{") + attrDict + Suppress("}")
    )

    You can now write:

    md = mainDict.parseString(test1)
    print md.dump()
    print md.Layer.lineStyle

    and get this output:

    [['Technology', ['name', 'gtc'], ['dielectric',
    2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
    '1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
    ['gridResolution', '5'], ['unitVoltageName', 'v'],
    ['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
    ['currentPrecision', '1000'], ['unitPowerName', 'pw'],
    ['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
    ['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
    ['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
    ['inductancePrecision', '100']], ['Tile', 'unit', ['width', 0.22],
    ['height', 1.6899999999999999]], ['Layer', 'PRBOUNDARY',
    ['layerNumber', '0'], ['maskName', ''], ['visible', '1'],
    ['selectable', '1'], ['blink', '0'], ['color', 'cyan'], ['lineStyle',
    'solid'], ['pattern', 'blank'], ['pitch', '0'], ['defaultWidth', '0'],
    ['minWidth', '0'], ['minSpacing', '0']]]
    - Layer: ['PRBOUNDARY', ['layerNumber', '0'], ['maskName', ''],
    ['visible', '1'], ['selectable', '1'], ['blink', '0'], ['color',
    'cyan'], ['lineStyle', 'solid'], ['pattern', 'blank'], ['pitch', '0'],
    ['defaultWidth', '0'], ['minWidth', '0'], ['minSpacing', '0']]
    - alias: PRBOUNDARY
    - blink: 0
    - color: cyan
    - defaultWidth: 0
    - layerNumber: 0
    - lineStyle: solid
    - maskName:
    - minSpacing: 0
    - minWidth: 0
    - pattern: blank
    - pitch: 0
    - selectable: 1
    - visible: 1
    - Technology: [['name', 'gtc'], ['dielectric',
    2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
    '1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
    ['gridResolution', '5'], ['unitVoltageName', 'v'],
    ['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
    ['currentPrecision', '1000'], ['unitPowerName', 'pw'],
    ['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
    ['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
    ['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
    ['inductancePrecision', '100']]
    - capacitancePrecision: 10000000
    - currentPrecision: 1000
    - dielectric: 2.75e-005
    - gridResolution: 5
    - inductancePrecision: 100
    - lengthPrecision: 1000
    - name: gtc
    - powerPrecision: 1000
    - resistancePrecision: 10000000
    - timePrecision: 1000
    - unitCapacitanceName: pf
    - unitCurrentName: ma
    - unitInductanceName: nh
    - unitLengthName: micron
    - unitPowerName: pw
    - unitResistanceName: kohm
    - unitTimeName: ns
    - unitVoltageName: v
    - voltagePrecision: 1000000
    - Tile: ['unit', ['width', 0.22], ['height', 1.6899999999999999]]
    - alias: unit
    - height: 1.69
    - width: 0.22
    solid

    Cheers!
    -- Paul
    Paul McGuire, Mar 22, 2008
    #2
    1. Advertising

  3. rh0dium

    Paul McGuire Guest

    Oof, I see that you have multiple "Layer" entries, with different
    qualifying labels. Since the dicts use "Layer" as the key, you only
    get the last "Layer" value, with qualifier "PRBOUNDARY", and lose the
    "Layer" for "METAL2". To fix this, you'll have to move the optional
    alias term to the key, and merge "Layer" and "PRBOUNDARY" into a
    single key, perhaps "Layer/PRBOUNDARY" or "Layer(PRBOUNDARY)" - a
    parse action should take care of this for you. Unfortnately, these
    forms will not allow you to use object attribute form
    (md.Layer.lineStyle), you will have to use dict access form
    (md["Layer(PRBOUNDARY)"].lineStyle), since these keys have characters
    that are not valid attribute name characters.

    Or you could add one more level of Dict nesting to your grammar, to
    permit access like "md.Layer.PRBOUNDARY.lineStyle".

    -- Paul
    Paul McGuire, Mar 23, 2008
    #3
  4. rh0dium

    rh0dium Guest

    On Mar 22, 6:30 pm, Paul McGuire <> wrote:
    > Oof, I see that you have multiple "Layer" entries, with different
    > qualifying labels.  Since the dicts use "Layer" as the key, you only
    > get the last "Layer" value, with qualifier "PRBOUNDARY", and lose the
    > "Layer" for "METAL2".  To fix this, you'll have to move the optional
    > alias term to the key, and merge "Layer" and "PRBOUNDARY" into a
    > single key, perhaps "Layer/PRBOUNDARY" or "Layer(PRBOUNDARY)" - a
    > parse action should take care of this for you.  Unfortnately, these
    > forms will not allow you to use object attribute form
    > (md.Layer.lineStyle), you will have to use dict access form
    > (md["Layer(PRBOUNDARY)"].lineStyle), since these keys have characters
    > that are not valid attribute name characters.
    >
    > Or you could add one more level of Dict nesting to your grammar, to
    > permit access like "md.Layer.PRBOUNDARY.lineStyle".
    >
    > -- Paul


    OK - We'll I got as far as you did but I did it a bit differently..
    Then I merged some of your data with my data. But Now I am at the
    point of adding another level of the dict and am struggling.. Here is
    what I have..

    # parse actions
    LPAR = Literal("(")
    RPAR = Literal(")")
    LBRACE = Literal("{")
    RBRACE = Literal("}")
    EQUAL = Literal("=")

    # This will get the values all figured out..
    # "metal2" 1 6.05E-05 30
    cvtInt = lambda toks: int(toks[0])
    cvtReal = lambda toks: float(toks[0])

    integer = Combine(Optional(oneOf("+ -")) + Word(nums))\
    .setParseAction( cvtInt )
    real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
    Optional(Word(nums)) +
    Optional(oneOf("e E")+Optional(oneOf("+ -"))
    +Word(nums)))\
    .setParseAction( cvtReal )
    atfstr = quotedString.setParseAction(removeQuotes)
    atflist = Group( LPAR.suppress() +
    delimitedList(real, ",") +
    RPAR.suppress() )

    atfvalues = ( real | integer | atfstr | atflist )

    # Now this should work out a single line inside a section
    # maskName = "metal2"
    # isDefaultLayer = 1
    # visible = 1
    # fatTblSpacing = (0.21,0.24,0.6,
    # 0.6,0.6,0.6)
    # minArea = 0.144
    atfkeys = Word(alphanums)
    attrDict = dictOf( atfkeys , EQUAL.suppress() + atfvalues)

    # Now we need to take care of the "Metal2" { one or more
    attrDict }
    # "METAL2" {
    # layerNumber = 36
    # maskName = "metal2"
    # isDefaultLayer = 1
    # visible = 1
    # fatTblSpacing =
    (0.21,0.24,0.6,
    #
    0.24,0.24,0.6,
    #
    0.6,0.6,0.6)
    # minArea = 0.144
    # }
    attrType = dictOf(atfstr, LBRACE.suppress() + attrDict +
    RBRACE.suppress())

    # Lastly we need to get the ones without attributes (Technology)
    attrType2 = LBRACE.suppress() + attrDict + RBRACE.suppress()
    mainDict = dictOf(atfkeys, attrType2 | attrType )

    md = mainDict.parseString(test1)


    But I too am only getting the last layer. I thought if broke out the
    "alias" area and then built on that I'd be set but I did something
    wrong.
    rh0dium, Mar 23, 2008
    #4
  5. rh0dium

    Paul McGuire Guest

    There are a couple of bugs in our program so far.

    First of all, our grammar isn't parsing the METAL2 entry at all. We
    should change this line:

    md = mainDict.parseString(test1)

    to

    md = (mainDict+stringEnd).parseString(test1)

    The parser is reading as far as it can, but then stopping once
    successful parsing is no longer possible. Since there is at least one
    valid entry matching the OneOrMore expression, then parseString raises
    no errors. By adding "+stringEnd" to our expression to be parsed, we
    are saying "once parsing is finished, we should be at the end of the
    input string". By making this change, we now get this parse
    exception:

    pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
    col:1)

    So what is the matter with the METAL2 entries? After using brute
    force "divide and conquer" (I deleted half of the entries and got a
    successful parse, then restored half of the entries I removed, until I
    added back the entry that caused the parse to fail), I found these
    lines in the input:

    fatTblThreshold = (0,0.39,10.005)
    fatTblParallelLength = (0,1,0)

    Both of these violate the atflist definition, because they contain
    integers, not just floatnums. So we need to expand the definition of
    aftlist:

    floatnum = Combine(Word(nums) + "." + Word(nums) +
    Optional('e'+oneOf("+ -")+Word(nums)))
    floatnum.setParseAction(lambda t:float(t[0]))
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    atflist = Suppress("(") + delimitedList(floatnum|integer) + \
    Suppress(")")

    Then we need to tackle the issue of adding nesting for those entries
    that have sub-keys. This is actually kind of tricky for your data
    example, because nesting within Dict expects input data to be nested.
    That is, nesting Dict's is normally done with data that is input like:

    main
    Technology
    Layer
    PRBOUNDARY
    METAL2
    Tile
    unit

    But your data is structured slightly differently:

    main
    Technology
    Layer PRBOUNDARY
    Layer METAL2
    Tile unit

    Because Layer is repeated, the second entry creates a new node named
    "Layer" at the second level, and the first "Layer" entry is lost. To
    fix this, we need to combine Layer and the layer id into a composite-
    type of key. I did this by using Group, and adding the Optional alias
    (which I see now is a poor name, "layerId" would be better) as a
    second element of the key:

    mainDict = dictOf(
    Group(Word(alphas)+Optional(quotedString)),
    Suppress("{") + attrDict + Suppress("}")
    )

    But now if we parse the input with this mainDict, we see that the keys
    are no longer nice simple strings, but they are 1- or 2-element
    ParseResults objects. Here is what I get from the command "print
    md.keys()":

    [(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
    'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]

    So to finally clear this up, we need one more parse action, attached
    to the mainDict expression, that rearranges the subdicts using the
    elements in the keys. The parse action looks like this, and it will
    process the overall parse results for the entire data structure:

    def rearrangeSubDicts(toks):
    # iterate over all key-value pairs in the dict
    for key,value in toks.items():
    # key is of the form ['name'] or ['name', 'name2']
    # and the value is the attrDict

    # if key has just one element, use it to define
    # a simple string key
    if len(key)==1:
    toks[key[0]] = value
    else:
    # if the key has two elements, create a
    # subnode with the first element
    if key[0] not in toks:
    toks[key[0]] = ParseResults([])

    # add an entry for the second key element
    toks[key[0]][key[1]] = value

    # now delete the original key that is the form
    # ['name'] or ['name', 'name2']
    del toks[key]

    It looks a bit messy, but the point is to modify the tokens in place,
    by rearranging the attrdicts to nodes with simple string keys, instead
    of keys nested in structures.

    Lastly, we attach the parse action in the usual way:

    mainDict.setParseAction(rearrangeSubDicts)

    Now you can access the fields of the different layers as:

    print md.Layer.METAL2.lineStyle

    I guess this all looks pretty convoluted. You might be better off
    just doing your own Group'ing, and then navigating the nested lists to
    build your own dict or other data structure.

    -- Paul
    Paul McGuire, Mar 23, 2008
    #5
  6. On Mar 22, 9:11 pm, rh0dium <> wrote:
    > Hi all,


    Hi,

    > I am struggling with parsing the following data:
    >
    > test1 = """
    > Technology      {
    >                 name                            = "gtc"
    >                 dielectric                      = 2.75e-05

    [...]

    I know it's cheating, but the grammar of your example is actually
    quite simple and the values are valid python expressions, so here is a
    solution without pyparsing (or regexps, for that matter). *WARNING*
    it uses the exec statement.

    from textwrap import dedent

    def parse(txt):
    globs, parsed = {}, {}
    units = txt.strip().split('}')[:-1]
    for unit in units:
    label, params = unit.split('{')
    paramdict = {}
    exec dedent(params) in globs, paramdict
    try:
    label, key = label.split()
    parsed.setdefault(label, {})[eval(key)] = paramdict
    except ValueError:
    parsed[label.strip()] = paramdict
    return parsed

    >>> p = parse(test1)
    >>> p['Layer']['PRBOUNDARY']

    {'maskName': '', 'defaultWidth': 0, 'color': 'cyan', 'pattern':
    'blank', 'layerNumber': 0, 'minSpacing': 0, 'blink': 0, 'minWidth': 0,
    'visible': 1, 'pitch': 0, 'selectable': 1, 'lineStyle': 'solid'}
    >>> p['Layer']['METAL2']['maskName']

    'metal2'
    >>> p['Technology']['gridResolution']

    5
    >>>


    HTH

    --
    Arnaud
    Arnaud Delobelle, Mar 23, 2008
    #6
  7. rh0dium

    rh0dium Guest

    On Mar 23, 12:26 am, Paul McGuire <> wrote:
    > There are a couple of bugs in our program so far.
    >
    > First of all, our grammar isn't parsing the METAL2 entry at all.  We
    > should change this line:
    >
    >     md = mainDict.parseString(test1)
    >
    > to
    >
    >     md = (mainDict+stringEnd).parseString(test1)
    >
    > The parser is reading as far as it can, but then stopping once
    > successful parsing is no longer possible.  Since there is at least one
    > valid entry matching the OneOrMore expression, then parseString raises
    > no errors.  By adding "+stringEnd" to our expression to be parsed, we
    > are saying "once parsing is finished, we should be at the end of the
    > input string".  By making this change, we now get this parse
    > exception:
    >
    > pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
    > col:1)
    >
    > So what is the matter with the METAL2 entries?  After using brute
    > force "divide and conquer" (I deleted half of the entries and got a
    > successful parse, then restored half of the entries I removed, until I
    > added back the entry that caused the parse to fail), I found these
    > lines in the input:
    >
    >     fatTblThreshold                 = (0,0.39,10.005)
    >     fatTblParallelLength            = (0,1,0)
    >
    > Both of these violate the atflist definition, because they contain
    > integers, not just floatnums.  So we need to expand the definition of
    > aftlist:
    >
    >     floatnum = Combine(Word(nums) + "." + Word(nums) +
    >         Optional('e'+oneOf("+ -")+Word(nums)))
    >     floatnum.setParseAction(lambda t:float(t[0]))
    >     integer = Word(nums).setParseAction(lambda t:int(t[0]))
    >     atflist = Suppress("(") + delimitedList(floatnum|integer) + \
    >                 Suppress(")")
    >
    > Then we need to tackle the issue of adding nesting for those entries
    > that have sub-keys.  This is actually kind of tricky for your data
    > example, because nesting within Dict expects input data to be nested.
    > That is, nesting Dict's is normally done with data that is input like:
    >
    > main
    >   Technology
    >   Layer
    >     PRBOUNDARY
    >     METAL2
    >   Tile
    >     unit
    >
    > But your data is structured slightly differently:
    >
    > main
    >   Technology
    >   Layer PRBOUNDARY
    >   Layer METAL2
    >   Tile unit
    >
    > Because Layer is repeated, the second entry creates a new node named
    > "Layer" at the second level, and the first "Layer" entry is lost.  To
    > fix this, we need to combine Layer and the layer id into a composite-
    > type of key.  I did this by using Group, and adding the Optional alias
    > (which I see now is a poor name, "layerId" would be better) as a
    > second element of the key:
    >
    >     mainDict = dictOf(
    >         Group(Word(alphas)+Optional(quotedString)),
    >         Suppress("{") + attrDict + Suppress("}")
    >         )
    >
    > But now if we parse the input with this mainDict, we see that the keys
    > are no longer nice simple strings, but they are 1- or 2-element
    > ParseResults objects.  Here is what I get from the command "print
    > md.keys()":
    >
    > [(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
    > 'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]
    >
    > So to finally clear this up, we need one more parse action, attached
    > to the mainDict expression, that rearranges the subdicts using the
    > elements in the keys.  The parse action looks like this, and it will
    > process the overall parse results for the entire data structure:
    >
    >     def rearrangeSubDicts(toks):
    >         # iterate over all key-value pairs in the dict
    >         for key,value in toks.items():
    >             # key is of the form ['name'] or ['name', 'name2']
    >             # and the value is the attrDict
    >
    >             # if key has just one element, use it to define
    >             # a simple string key
    >             if len(key)==1:
    >                 toks[key[0]] = value
    >             else:
    >                 # if the key has two elements, create a
    >                 # subnode with the first element
    >                 if key[0] not in toks:
    >                     toks[key[0]] = ParseResults([])
    >
    >                 # add an entry for the second key element
    >                 toks[key[0]][key[1]] = value
    >
    >             # now delete the original key that is the form
    >             # ['name'] or ['name', 'name2']
    >             del toks[key]
    >
    > It looks a bit messy, but the point is to modify the tokens in place,
    > by rearranging the attrdicts to nodes with simple string keys, instead
    > of keys nested in structures.
    >
    > Lastly, we attach the parse action in the usual way:
    >
    >     mainDict.setParseAction(rearrangeSubDicts)
    >
    > Now you can access the fields of the different layers as:
    >
    >     print md.Layer.METAL2.lineStyle
    >
    > I guess this all looks pretty convoluted.  You might be better off
    > just doing your own Group'ing, and then navigating the nested lists to
    > build your own dict or other data structure.
    >
    > -- Paul


    Hi Paul,

    Before I continue this I must thank you for your help. You really did
    do an outstanding job on this code and it is really straight forward
    to use and learn from. This was a fun weekend task and I really
    wanted to use pyparsing to do it. Because this is one of several type
    of files I want to parse. I (as I'm sure you would agree) think the
    rearrangeSubDicts is a bit of a hack but never the less absolutely
    required and due to the limitations of the data I am parsing. Once
    again thanks for your great help. Now the problem..

    I attempted to use this code on another testcase. This testcase had
    tabs in it. I think 1.4.11 is missing the expandtabs attribute. I
    ran my code (which had tabs) and I got this..

    AttributeError: 'builtin_function_or_method' object has no attribute
    'expandtabs'

    Ugh oh. Is this a pyparsing problem or am I just an idiot..

    Thanks again!
    rh0dium, Mar 23, 2008
    #7
  8. rh0dium

    rh0dium Guest

    On Mar 23, 1:48 pm, rh0dium <> wrote:
    > On Mar 23, 12:26 am, Paul McGuire <> wrote:
    >
    >
    >
    > > There are a couple of bugs in our program so far.

    >
    > > First of all, our grammar isn't parsing the METAL2 entry at all.  We
    > > should change this line:

    >
    > >     md = mainDict.parseString(test1)

    >
    > > to

    >
    > >     md = (mainDict+stringEnd).parseString(test1)

    >
    > > The parser is reading as far as it can, but then stopping once
    > > successful parsing is no longer possible.  Since there is at least one
    > > valid entry matching the OneOrMore expression, then parseString raises
    > > no errors.  By adding "+stringEnd" to our expression to be parsed, we
    > > are saying "once parsing is finished, we should be at the end of the
    > > input string".  By making this change, we now get this parse
    > > exception:

    >
    > > pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
    > > col:1)

    >
    > > So what is the matter with the METAL2 entries?  After using brute
    > > force "divide and conquer" (I deleted half of the entries and got a
    > > successful parse, then restored half of the entries I removed, until I
    > > added back the entry that caused the parse to fail), I found these
    > > lines in the input:

    >
    > >     fatTblThreshold                 = (0,0.39,10.005)
    > >     fatTblParallelLength            = (0,1,0)

    >
    > > Both of these violate the atflist definition, because they contain
    > > integers, not just floatnums.  So we need to expand the definition of
    > > aftlist:

    >
    > >     floatnum = Combine(Word(nums) + "." + Word(nums) +
    > >         Optional('e'+oneOf("+ -")+Word(nums)))
    > >     floatnum.setParseAction(lambda t:float(t[0]))
    > >     integer = Word(nums).setParseAction(lambda t:int(t[0]))
    > >     atflist = Suppress("(") + delimitedList(floatnum|integer) + \
    > >                 Suppress(")")

    >
    > > Then we need to tackle the issue of adding nesting for those entries
    > > that have sub-keys.  This is actually kind of tricky for your data
    > > example, because nesting within Dict expects input data to be nested.
    > > That is, nesting Dict's is normally done with data that is input like:

    >
    > > main
    > >   Technology
    > >   Layer
    > >     PRBOUNDARY
    > >     METAL2
    > >   Tile
    > >     unit

    >
    > > But your data is structured slightly differently:

    >
    > > main
    > >   Technology
    > >   Layer PRBOUNDARY
    > >   Layer METAL2
    > >   Tile unit

    >
    > > Because Layer is repeated, the second entry creates a new node named
    > > "Layer" at the second level, and the first "Layer" entry is lost.  To
    > > fix this, we need to combine Layer and the layer id into a composite-
    > > type of key.  I did this by using Group, and adding the Optional alias
    > > (which I see now is a poor name, "layerId" would be better) as a
    > > second element of the key:

    >
    > >     mainDict = dictOf(
    > >         Group(Word(alphas)+Optional(quotedString)),
    > >         Suppress("{") + attrDict + Suppress("}")
    > >         )

    >
    > > But now if we parse the input with this mainDict, we see that the keys
    > > are no longer nice simple strings, but they are 1- or 2-element
    > > ParseResults objects.  Here is what I get from the command "print
    > > md.keys()":

    >
    > > [(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
    > > 'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]

    >
    > > So to finally clear this up, we need one more parse action, attached
    > > to the mainDict expression, that rearranges the subdicts using the
    > > elements in the keys.  The parse action looks like this, and it will
    > > process the overall parse results for the entire data structure:

    >
    > >     def rearrangeSubDicts(toks):
    > >         # iterate over all key-value pairs in the dict
    > >         for key,value in toks.items():
    > >             # key is of the form ['name'] or ['name', 'name2']
    > >             # and the value is the attrDict

    >
    > >             # if key has just one element, use it to define
    > >             # a simple string key
    > >             if len(key)==1:
    > >                 toks[key[0]] = value
    > >             else:
    > >                 # if the key has two elements, create a
    > >                 # subnode with the first element
    > >                 if key[0] not in toks:
    > >                     toks[key[0]] = ParseResults([])

    >
    > >                 # add an entry for the second key element
    > >                 toks[key[0]][key[1]] = value

    >
    > >             # now delete the original key that is the form
    > >             # ['name'] or ['name', 'name2']
    > >             del toks[key]

    >
    > > It looks a bit messy, but the point is to modify the tokens in place,
    > > by rearranging the attrdicts to nodes with simple string keys, instead
    > > of keys nested in structures.

    >
    > > Lastly, we attach the parse action in the usual way:

    >
    > >     mainDict.setParseAction(rearrangeSubDicts)

    >
    > > Now you can access the fields of the different layers as:

    >
    > >     print md.Layer.METAL2.lineStyle

    >
    > > I guess this all looks pretty convoluted.  You might be better off
    > > just doing your own Group'ing, and then navigating the nested lists to
    > > build your own dict or other data structure.

    >
    > > -- Paul

    >
    > Hi Paul,
    >
    > Before I continue this I must thank you for your help.  You really did
    > do an outstanding job on this code and it is really straight forward
    > to use and learn from.  This was a fun weekend task and I really
    > wanted to use pyparsing to do it.  Because this is one of several type
    > of files I want to parse.  I (as I'm sure you would agree) think the
    > rearrangeSubDicts is a bit of a hack but never the less absolutely
    > required and due to the limitations of the data I am parsing.   Once
    > again thanks for your great help.  Now the problem..
    >
    > I attempted to use this code on another testcase.  This testcase had
    > tabs in it.  I think 1.4.11 is missing the expandtabs attribute.  I
    > ran my code (which had tabs) and I got this..
    >
    > AttributeError: 'builtin_function_or_method' object has no attribute
    > 'expandtabs'
    >
    > Ugh oh.  Is this a pyparsing problem or am I just an idiot..
    >
    > Thanks again!


    Doh!! Nevermind I am an idiot. Nope I got it what a bonehead..

    I needed to tweak it a bit to ignore the comments.. Namely this fixed
    it up..

    mainDict = dictOf(
    Group(Word(alphas)+Optional(quotedString)),
    Suppress("{") + attrDict + Suppress("}")
    ) | cStyleComment.suppress()

    Thanks again. Now I just need to figure out how to use your dicts to
    do some work..
    rh0dium, Mar 23, 2008
    #8
  9. Il Sat, 22 Mar 2008 14:11:16 -0700, rh0dium ha scritto:

    > Hi all,
    >
    > I am struggling with parsing the following data:
    >
    > test1 = """
    > Technology {
    > name = "gtc" dielectric
    > = 2.75e-05 unitTimeName
    > = "ns" timePrecision = 1000
    > unitLengthName = "micron"
    > lengthPrecision = 1000 gridResolution
    > = 5
    > unitVoltageName = "v" voltagePrecision
    > = 1000000 unitCurrentName =
    > "ma" currentPrecision = 1000
    > unitPowerName = "pw" powerPrecision
    > = 1000 unitResistanceName =
    > "kohm" resistancePrecision = 10000000
    > unitCapacitanceName = "pf"
    > capacitancePrecision = 10000000
    > unitInductanceName = "nh"
    > inductancePrecision = 100
    > }
    >
    > Tile "unit" {
    > width = 0.22 height
    > = 1.69
    > }
    >
    >


    Did you think of using something a bit more sofisticated than pyparsing?
    I have had a good experience to using ply, a pure-python implementation
    of yacc/lex tools, which I used to extract significant data from C
    programs to automatize documentation.

    I never used before yacc or similar tools, but having a bit of experience
    with BNF notation, I found ply easy enough. In my case, the major problem
    was to cope with yacc limitation in describing C syntax (which I solved
    by "oelaxing" the rules a bit, since I was going to process only already-
    compiled C code). In your much simpler case, I'd say that a few
    production rules should be enough.

    P.S : there are others, faster and maybe more complete python parser, but
    as I said ply is pure python: no external libraries and runs everywhere.

    Ciao
    -------
    FB
    Francesco Bochicchio, Mar 24, 2008
    #9
  10. rh0dium

    Paul McGuire Guest

    On Mar 23, 4:04 pm, rh0dium <> wrote:
    >
    > I needed to tweak it a bit to ignore the comments..  Namely this fixed
    > it up..
    >
    >     mainDict = dictOf(
    >             Group(Word(alphas)+Optional(quotedString)),
    >             Suppress("{") + attrDict + Suppress("}")
    >             ) | cStyleComment.suppress()
    >
    > Thanks again.  Now I just need to figure out how to use your dicts to
    > do some work..- Hide quoted text -
    >
    > - Show quoted text -


    I'm glad this is coming around to some reasonable degree of completion
    for you. One last thought - your handling of comments is a bit crude,
    and will not handle comments that crop up in the middle of dict
    entries, as in:

    color = /* using non-standard color during testing */
    "plum"

    The more comprehensive way to handle comments is to call ignore.
    Using ignore will propagate the comment handling to all embedded
    expressions, so you only need to call ignore once on the top-most
    pyparsing expression, as in:

    mainDict.ignore(cStyleComment)

    Also, ignore does token suppression automatically.

    -- Paul
    Paul McGuire, Mar 24, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve
    Replies:
    3
    Views:
    511
    Paul McGuire
    Sep 12, 2007
  2. Just Another Victim of the Ambient Morality

    Need help parsing with pyparsing...

    Just Another Victim of the Ambient Morality, Oct 22, 2007, in forum: Python
    Replies:
    6
    Views:
    599
    Dennis Lee Bieber
    Oct 23, 2007
  3. avidfan
    Replies:
    2
    Views:
    625
    avidfan
    Oct 31, 2007
  4. Neal Becker

    help with pyparsing

    Neal Becker, Oct 31, 2007, in forum: Python
    Replies:
    1
    Views:
    467
    Paul McGuire
    Oct 31, 2007
  5. Prabhu Gurumurthy

    help with pyparsing

    Prabhu Gurumurthy, Dec 10, 2007, in forum: Python
    Replies:
    3
    Views:
    395
    Prabhu Gurumurthy
    Dec 10, 2007
Loading...

Share This Page