Pyparsing help

R

rh0dium

Hi all,

I am struggling with parsing the following data:

test1 = """
Technology {
name = "gtc"
dielectric = 2.75e-05
unitTimeName = "ns"
timePrecision = 1000
unitLengthName = "micron"
lengthPrecision = 1000
gridResolution = 5
unitVoltageName = "v"
voltagePrecision = 1000000
unitCurrentName = "ma"
currentPrecision = 1000
unitPowerName = "pw"
powerPrecision = 1000
unitResistanceName = "kohm"
resistancePrecision = 10000000
unitCapacitanceName = "pf"
capacitancePrecision = 10000000
unitInductanceName = "nh"
inductancePrecision = 100
}

Tile "unit" {
width = 0.22
height = 1.69
}

Layer "PRBOUNDARY" {
layerNumber = 0
maskName = ""
visible = 1
selectable = 1
blink = 0
color = "cyan"
lineStyle = "solid"
pattern = "blank"
pitch = 0
defaultWidth = 0
minWidth = 0
minSpacing = 0
}

Layer "METAL2" {
layerNumber = 36
maskName = "metal2"
isDefaultLayer = 1
visible = 1
selectable = 1
blink = 0
color = "yellow"
lineStyle = "solid"
pattern = "blank"
pitch = 0.46
defaultWidth = 0.2
minWidth = 0.2
minSpacing = 0.21
fatContactThreshold = 1.4
maxSegLenForRC = 2000
unitMinResistance = 6.1e-05
unitNomResistance = 6.3e-05
unitMaxResistance = 6.9e-05
unitMinHeightFromSub = 1.21
unitNomHeightFromSub = 1.237
unitMaxHeightFromSub = 1.267
unitMinThickness = 0.25
unitNomThickness = 0.475
unitMaxThickness = 0.75
fatTblDimension = 3
fatTblThreshold = (0,0.39,10.005)
fatTblParallelLength = (0,1,0)
fatTblSpacing = (0.21,0.24,0.6,
0.24,0.24,0.6,
0.6,0.6,0.6)
minArea = 0.144
}
"""

So it looks like starting from the inside out
I have an key and a value where the value can be a QuotedString,
Word(num), or a list of nums

So my code to catch this looks like this..

atflist = Suppress("(") + commaSeparatedList + Suppress(")")
atfstr = quotedString.setParseAction(removeQuotes)
atfvalues = ( Word(nums) | atfstr | atflist )

l = ("36", '"metal2"', '(0.21,0.24,0.6,0.24,0.24,0.6)')

for x in l:
print atfvalues.parseString(x)

But this isn't passing the list commaSeparatedList. Can someone point
out my errors?

As a side note: Is this the right approach to using pyparsing. Do we
start from the inside and work our way out or should I have started
with looking at the bigger picture ( keyword + "{" + OneOrMore key /
vals + "}" + ) I started there but could figure out how to look
multiline - I'm assuming I'd just join them all up?

Thanks
 
P

Paul McGuire

Hi all,

I am struggling with parsing the following data:
As a side note:  Is this the right approach to using pyparsing.  Do we
start from the inside and work our way out or should I have started
with looking at the bigger picture ( keyword + "{" + OneOrMore key /
vals + "}" + )  I started there but could figure out how to look
multiline - I'm assuming I'd just join them all up?

Thanks

I think your "inside-out" approach is just fine. Start by composing
expressions for the different "pieces" of your input text, then
steadily build up more and more complex forms.

I think the main complication you have is that of using
commaSeparatedList for your list of real numbers. commaSeparatedList
is a very generic helper expression. From the online example (http://
pyparsing.wikispaces.com/space/showimage/commasep.py), here is a
sample of the data that commaSeparatedList will handle:

"a,b,c,100.2,,3",
"d, e, j k , m ",
"'Hello, World', f, g , , 5.1,x",
"John Doe, 123 Main St., Cleveland, Ohio",
"Jane Doe, 456 St. James St., Los Angeles , California ",

In other words, the content of the items between commas is pretty much
anything that is *not* a comma. If you change your definition of
atflist to:

atflist = Suppress("(") + commaSeparatedList # + Suppress(")")

(that is, comment out the trailing right paren), you'll get this
successful parse result:

['0.21', '0.24', '0.6', '0.24', '0.24', '0.6)']

In your example, you are parsing a list of floating point numbers, in
a list delimited by commas, surrounded by parens. This definition of
atflist should give you more control over the parsing process, and
give you real floats to boot:

floatnum = Combine(Word(nums) + "." + Word(nums) +
Optional('e'+oneOf("+ -")+Word(nums)))
floatnum.setParseAction(lambda t:float(t[0]))
atflist = Suppress("(") + delimitedList(floatnum) + Suppress(")")

Now I get this output for your parse test:

[0.20999999999999999, 0.23999999999999999, 0.59999999999999998,
0.23999999999999999, 0.23999999999999999, 0.59999999999999998]

So you can see that this has actually parsed the numbers and converted
them to floats.

I went ahead and added support for scientific notation in floatnum,
since I see that you have several atfvalues that are standalone
floats, some using scientific notation. To add these, just expand
atfvalues to:

atfvalues = ( floatnum | Word(nums) | atfstr | atflist )

(At this point, I'll go on to show how to parse the rest of the data
structure - if you want to take a stab at it yourself, stop reading
here, and then come back to compare your results with my approach.)

To parse the overall structure, now that you have expressions for the
different component pieces, look into using Dict (or more simply using
the helper function dictOf) to define results names automagically for
you based on the attribute names in the input. Dict does *not* change
any of the parsing or matching logic, it just adds named fields in the
parsed results corresponding to the key names found in the input.

Dict is a complex pyparsing class, but dictOf simplfies things.
dictOf takes two arguments:

dictOf(keyExpression, valueExpression)

This translates to:

Dict( OneOrMore( Group(keyExpression + valueExpression) ) )

For example, to parse the lists of entries that look like:

name = "gtc"
dielectric = 2.75e-05
unitTimeName = "ns"
timePrecision = 1000
unitLengthName = "micron"
etc.

just define that this is "a dict of entries each composed of a key
consisting of a Word(alphas), followed by a suppressed '=' sign and an
atfvalues", that is:

attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)

dictOf takes care of all of the repetition and grouping necessary for
Dict to do its work. These attribute dicts are nested within an outer
main dict, which is "a dict of entries, each with a key of
Word(alphas), and a value of an optional quotedString (an alias,
perhaps?), a left brace, an attrDict, and a right brace," or:

mainDict = dictOf(
Word(alphas),
Optional(quotedString)("alias") +
Suppress("{") + attrDict + Suppress("}")
)

By adding this code to what you already have:

attrDict = dictOf(Word(alphas), Suppress("=") + atfvalues)
mainDict = dictOf(
Word(alphas),
Optional(quotedString)("alias") +
Suppress("{") + attrDict + Suppress("}")
)

You can now write:

md = mainDict.parseString(test1)
print md.dump()
print md.Layer.lineStyle

and get this output:

[['Technology', ['name', 'gtc'], ['dielectric',
2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
'1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
['gridResolution', '5'], ['unitVoltageName', 'v'],
['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
['currentPrecision', '1000'], ['unitPowerName', 'pw'],
['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
['inductancePrecision', '100']], ['Tile', 'unit', ['width', 0.22],
['height', 1.6899999999999999]], ['Layer', 'PRBOUNDARY',
['layerNumber', '0'], ['maskName', ''], ['visible', '1'],
['selectable', '1'], ['blink', '0'], ['color', 'cyan'], ['lineStyle',
'solid'], ['pattern', 'blank'], ['pitch', '0'], ['defaultWidth', '0'],
['minWidth', '0'], ['minSpacing', '0']]]
- Layer: ['PRBOUNDARY', ['layerNumber', '0'], ['maskName', ''],
['visible', '1'], ['selectable', '1'], ['blink', '0'], ['color',
'cyan'], ['lineStyle', 'solid'], ['pattern', 'blank'], ['pitch', '0'],
['defaultWidth', '0'], ['minWidth', '0'], ['minSpacing', '0']]
- alias: PRBOUNDARY
- blink: 0
- color: cyan
- defaultWidth: 0
- layerNumber: 0
- lineStyle: solid
- maskName:
- minSpacing: 0
- minWidth: 0
- pattern: blank
- pitch: 0
- selectable: 1
- visible: 1
- Technology: [['name', 'gtc'], ['dielectric',
2.7500000000000001e-005], ['unitTimeName', 'ns'], ['timePrecision',
'1000'], ['unitLengthName', 'micron'], ['lengthPrecision', '1000'],
['gridResolution', '5'], ['unitVoltageName', 'v'],
['voltagePrecision', '1000000'], ['unitCurrentName', 'ma'],
['currentPrecision', '1000'], ['unitPowerName', 'pw'],
['powerPrecision', '1000'], ['unitResistanceName', 'kohm'],
['resistancePrecision', '10000000'], ['unitCapacitanceName', 'pf'],
['capacitancePrecision', '10000000'], ['unitInductanceName', 'nh'],
['inductancePrecision', '100']]
- capacitancePrecision: 10000000
- currentPrecision: 1000
- dielectric: 2.75e-005
- gridResolution: 5
- inductancePrecision: 100
- lengthPrecision: 1000
- name: gtc
- powerPrecision: 1000
- resistancePrecision: 10000000
- timePrecision: 1000
- unitCapacitanceName: pf
- unitCurrentName: ma
- unitInductanceName: nh
- unitLengthName: micron
- unitPowerName: pw
- unitResistanceName: kohm
- unitTimeName: ns
- unitVoltageName: v
- voltagePrecision: 1000000
- Tile: ['unit', ['width', 0.22], ['height', 1.6899999999999999]]
- alias: unit
- height: 1.69
- width: 0.22
solid

Cheers!
-- Paul
 
P

Paul McGuire

Oof, I see that you have multiple "Layer" entries, with different
qualifying labels. Since the dicts use "Layer" as the key, you only
get the last "Layer" value, with qualifier "PRBOUNDARY", and lose the
"Layer" for "METAL2". To fix this, you'll have to move the optional
alias term to the key, and merge "Layer" and "PRBOUNDARY" into a
single key, perhaps "Layer/PRBOUNDARY" or "Layer(PRBOUNDARY)" - a
parse action should take care of this for you. Unfortnately, these
forms will not allow you to use object attribute form
(md.Layer.lineStyle), you will have to use dict access form
(md["Layer(PRBOUNDARY)"].lineStyle), since these keys have characters
that are not valid attribute name characters.

Or you could add one more level of Dict nesting to your grammar, to
permit access like "md.Layer.PRBOUNDARY.lineStyle".

-- Paul
 
R

rh0dium

Oof, I see that you have multiple "Layer" entries, with different
qualifying labels.  Since the dicts use "Layer" as the key, you only
get the last "Layer" value, with qualifier "PRBOUNDARY", and lose the
"Layer" for "METAL2".  To fix this, you'll have to move the optional
alias term to the key, and merge "Layer" and "PRBOUNDARY" into a
single key, perhaps "Layer/PRBOUNDARY" or "Layer(PRBOUNDARY)" - a
parse action should take care of this for you.  Unfortnately, these
forms will not allow you to use object attribute form
(md.Layer.lineStyle), you will have to use dict access form
(md["Layer(PRBOUNDARY)"].lineStyle), since these keys have characters
that are not valid attribute name characters.

Or you could add one more level of Dict nesting to your grammar, to
permit access like "md.Layer.PRBOUNDARY.lineStyle".

-- Paul

OK - We'll I got as far as you did but I did it a bit differently..
Then I merged some of your data with my data. But Now I am at the
point of adding another level of the dict and am struggling.. Here is
what I have..

# parse actions
LPAR = Literal("(")
RPAR = Literal(")")
LBRACE = Literal("{")
RBRACE = Literal("}")
EQUAL = Literal("=")

# This will get the values all figured out..
# "metal2" 1 6.05E-05 30
cvtInt = lambda toks: int(toks[0])
cvtReal = lambda toks: float(toks[0])

integer = Combine(Optional(oneOf("+ -")) + Word(nums))\
.setParseAction( cvtInt )
real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
Optional(Word(nums)) +
Optional(oneOf("e E")+Optional(oneOf("+ -"))
+Word(nums)))\
.setParseAction( cvtReal )
atfstr = quotedString.setParseAction(removeQuotes)
atflist = Group( LPAR.suppress() +
delimitedList(real, ",") +
RPAR.suppress() )

atfvalues = ( real | integer | atfstr | atflist )

# Now this should work out a single line inside a section
# maskName = "metal2"
# isDefaultLayer = 1
# visible = 1
# fatTblSpacing = (0.21,0.24,0.6,
# 0.6,0.6,0.6)
# minArea = 0.144
atfkeys = Word(alphanums)
attrDict = dictOf( atfkeys , EQUAL.suppress() + atfvalues)

# Now we need to take care of the "Metal2" { one or more
attrDict }
# "METAL2" {
# layerNumber = 36
# maskName = "metal2"
# isDefaultLayer = 1
# visible = 1
# fatTblSpacing =
(0.21,0.24,0.6,
#
0.24,0.24,0.6,
#
0.6,0.6,0.6)
# minArea = 0.144
# }
attrType = dictOf(atfstr, LBRACE.suppress() + attrDict +
RBRACE.suppress())

# Lastly we need to get the ones without attributes (Technology)
attrType2 = LBRACE.suppress() + attrDict + RBRACE.suppress()
mainDict = dictOf(atfkeys, attrType2 | attrType )

md = mainDict.parseString(test1)


But I too am only getting the last layer. I thought if broke out the
"alias" area and then built on that I'd be set but I did something
wrong.
 
P

Paul McGuire

There are a couple of bugs in our program so far.

First of all, our grammar isn't parsing the METAL2 entry at all. We
should change this line:

md = mainDict.parseString(test1)

to

md = (mainDict+stringEnd).parseString(test1)

The parser is reading as far as it can, but then stopping once
successful parsing is no longer possible. Since there is at least one
valid entry matching the OneOrMore expression, then parseString raises
no errors. By adding "+stringEnd" to our expression to be parsed, we
are saying "once parsing is finished, we should be at the end of the
input string". By making this change, we now get this parse
exception:

pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
col:1)

So what is the matter with the METAL2 entries? After using brute
force "divide and conquer" (I deleted half of the entries and got a
successful parse, then restored half of the entries I removed, until I
added back the entry that caused the parse to fail), I found these
lines in the input:

fatTblThreshold = (0,0.39,10.005)
fatTblParallelLength = (0,1,0)

Both of these violate the atflist definition, because they contain
integers, not just floatnums. So we need to expand the definition of
aftlist:

floatnum = Combine(Word(nums) + "." + Word(nums) +
Optional('e'+oneOf("+ -")+Word(nums)))
floatnum.setParseAction(lambda t:float(t[0]))
integer = Word(nums).setParseAction(lambda t:int(t[0]))
atflist = Suppress("(") + delimitedList(floatnum|integer) + \
Suppress(")")

Then we need to tackle the issue of adding nesting for those entries
that have sub-keys. This is actually kind of tricky for your data
example, because nesting within Dict expects input data to be nested.
That is, nesting Dict's is normally done with data that is input like:

main
Technology
Layer
PRBOUNDARY
METAL2
Tile
unit

But your data is structured slightly differently:

main
Technology
Layer PRBOUNDARY
Layer METAL2
Tile unit

Because Layer is repeated, the second entry creates a new node named
"Layer" at the second level, and the first "Layer" entry is lost. To
fix this, we need to combine Layer and the layer id into a composite-
type of key. I did this by using Group, and adding the Optional alias
(which I see now is a poor name, "layerId" would be better) as a
second element of the key:

mainDict = dictOf(
Group(Word(alphas)+Optional(quotedString)),
Suppress("{") + attrDict + Suppress("}")
)

But now if we parse the input with this mainDict, we see that the keys
are no longer nice simple strings, but they are 1- or 2-element
ParseResults objects. Here is what I get from the command "print
md.keys()":

[(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]

So to finally clear this up, we need one more parse action, attached
to the mainDict expression, that rearranges the subdicts using the
elements in the keys. The parse action looks like this, and it will
process the overall parse results for the entire data structure:

def rearrangeSubDicts(toks):
# iterate over all key-value pairs in the dict
for key,value in toks.items():
# key is of the form ['name'] or ['name', 'name2']
# and the value is the attrDict

# if key has just one element, use it to define
# a simple string key
if len(key)==1:
toks[key[0]] = value
else:
# if the key has two elements, create a
# subnode with the first element
if key[0] not in toks:
toks[key[0]] = ParseResults([])

# add an entry for the second key element
toks[key[0]][key[1]] = value

# now delete the original key that is the form
# ['name'] or ['name', 'name2']
del toks[key]

It looks a bit messy, but the point is to modify the tokens in place,
by rearranging the attrdicts to nodes with simple string keys, instead
of keys nested in structures.

Lastly, we attach the parse action in the usual way:

mainDict.setParseAction(rearrangeSubDicts)

Now you can access the fields of the different layers as:

print md.Layer.METAL2.lineStyle

I guess this all looks pretty convoluted. You might be better off
just doing your own Group'ing, and then navigating the nested lists to
build your own dict or other data structure.

-- Paul
 
A

Arnaud Delobelle

Hi all,
Hi,

I am struggling with parsing the following data:

test1 = """
Technology      {
                name                            = "gtc"
                dielectric                      = 2.75e-05
[...]

I know it's cheating, but the grammar of your example is actually
quite simple and the values are valid python expressions, so here is a
solution without pyparsing (or regexps, for that matter). *WARNING*
it uses the exec statement.

from textwrap import dedent

def parse(txt):
globs, parsed = {}, {}
units = txt.strip().split('}')[:-1]
for unit in units:
label, params = unit.split('{')
paramdict = {}
exec dedent(params) in globs, paramdict
try:
label, key = label.split()
parsed.setdefault(label, {})[eval(key)] = paramdict
except ValueError:
parsed[label.strip()] = paramdict
return parsed
p = parse(test1)
p['Layer']['PRBOUNDARY']
{'maskName': '', 'defaultWidth': 0, 'color': 'cyan', 'pattern':
'blank', 'layerNumber': 0, 'minSpacing': 0, 'blink': 0, 'minWidth': 0,
'visible': 1, 'pitch': 0, 'selectable': 1, 'lineStyle': 'solid'}
p['Layer']['METAL2']['maskName'] 'metal2'
p['Technology']['gridResolution'] 5

HTH
 
R

rh0dium

There are a couple of bugs in our program so far.

First of all, our grammar isn't parsing the METAL2 entry at all.  We
should change this line:

    md = mainDict.parseString(test1)

to

    md = (mainDict+stringEnd).parseString(test1)

The parser is reading as far as it can, but then stopping once
successful parsing is no longer possible.  Since there is at least one
valid entry matching the OneOrMore expression, then parseString raises
no errors.  By adding "+stringEnd" to our expression to be parsed, we
are saying "once parsing is finished, we should be at the end of the
input string".  By making this change, we now get this parse
exception:

pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
col:1)

So what is the matter with the METAL2 entries?  After using brute
force "divide and conquer" (I deleted half of the entries and got a
successful parse, then restored half of the entries I removed, until I
added back the entry that caused the parse to fail), I found these
lines in the input:

    fatTblThreshold                 = (0,0.39,10.005)
    fatTblParallelLength            = (0,1,0)

Both of these violate the atflist definition, because they contain
integers, not just floatnums.  So we need to expand the definition of
aftlist:

    floatnum = Combine(Word(nums) + "." + Word(nums) +
        Optional('e'+oneOf("+ -")+Word(nums)))
    floatnum.setParseAction(lambda t:float(t[0]))
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    atflist = Suppress("(") + delimitedList(floatnum|integer) + \
                Suppress(")")

Then we need to tackle the issue of adding nesting for those entries
that have sub-keys.  This is actually kind of tricky for your data
example, because nesting within Dict expects input data to be nested.
That is, nesting Dict's is normally done with data that is input like:

main
  Technology
  Layer
    PRBOUNDARY
    METAL2
  Tile
    unit

But your data is structured slightly differently:

main
  Technology
  Layer PRBOUNDARY
  Layer METAL2
  Tile unit

Because Layer is repeated, the second entry creates a new node named
"Layer" at the second level, and the first "Layer" entry is lost.  To
fix this, we need to combine Layer and the layer id into a composite-
type of key.  I did this by using Group, and adding the Optional alias
(which I see now is a poor name, "layerId" would be better) as a
second element of the key:

    mainDict = dictOf(
        Group(Word(alphas)+Optional(quotedString)),
        Suppress("{") + attrDict + Suppress("}")
        )

But now if we parse the input with this mainDict, we see that the keys
are no longer nice simple strings, but they are 1- or 2-element
ParseResults objects.  Here is what I get from the command "print
md.keys()":

[(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]

So to finally clear this up, we need one more parse action, attached
to the mainDict expression, that rearranges the subdicts using the
elements in the keys.  The parse action looks like this, and it will
process the overall parse results for the entire data structure:

    def rearrangeSubDicts(toks):
        # iterate over all key-value pairs in the dict
        for key,value in toks.items():
            # key is of the form ['name'] or ['name', 'name2']
            # and the value is the attrDict

            # if key has just one element, use it to define
            # a simple string key
            if len(key)==1:
                toks[key[0]] = value
            else:
                # if the key has two elements, create a
                # subnode with the first element
                if key[0] not in toks:
                    toks[key[0]] = ParseResults([])

                # add an entry for the second key element
                toks[key[0]][key[1]] = value

            # now delete the original key that is the form
            # ['name'] or ['name', 'name2']
            del toks[key]

It looks a bit messy, but the point is to modify the tokens in place,
by rearranging the attrdicts to nodes with simple string keys, instead
of keys nested in structures.

Lastly, we attach the parse action in the usual way:

    mainDict.setParseAction(rearrangeSubDicts)

Now you can access the fields of the different layers as:

    print md.Layer.METAL2.lineStyle

I guess this all looks pretty convoluted.  You might be better off
just doing your own Group'ing, and then navigating the nested lists to
build your own dict or other data structure.

-- Paul

Hi Paul,

Before I continue this I must thank you for your help. You really did
do an outstanding job on this code and it is really straight forward
to use and learn from. This was a fun weekend task and I really
wanted to use pyparsing to do it. Because this is one of several type
of files I want to parse. I (as I'm sure you would agree) think the
rearrangeSubDicts is a bit of a hack but never the less absolutely
required and due to the limitations of the data I am parsing. Once
again thanks for your great help. Now the problem..

I attempted to use this code on another testcase. This testcase had
tabs in it. I think 1.4.11 is missing the expandtabs attribute. I
ran my code (which had tabs) and I got this..

AttributeError: 'builtin_function_or_method' object has no attribute
'expandtabs'

Ugh oh. Is this a pyparsing problem or am I just an idiot..

Thanks again!
 
R

rh0dium

There are a couple of bugs in our program so far.
First of all, our grammar isn't parsing the METAL2 entry at all.  We
should change this line:
    md = mainDict.parseString(test1)

    md = (mainDict+stringEnd).parseString(test1)
The parser is reading as far as it can, but then stopping once
successful parsing is no longer possible.  Since there is at least one
valid entry matching the OneOrMore expression, then parseString raises
no errors.  By adding "+stringEnd" to our expression to be parsed, we
are saying "once parsing is finished, we should be at the end of the
input string".  By making this change, we now get this parse
exception:
pyparsing.ParseException: Expected stringEnd (at char 1948), (line:54,
col:1)
So what is the matter with the METAL2 entries?  After using brute
force "divide and conquer" (I deleted half of the entries and got a
successful parse, then restored half of the entries I removed, until I
added back the entry that caused the parse to fail), I found these
lines in the input:
    fatTblThreshold                 = (0,0.39,10.005)
    fatTblParallelLength            = (0,1,0)
Both of these violate the atflist definition, because they contain
integers, not just floatnums.  So we need to expand the definition of
aftlist:
    floatnum = Combine(Word(nums) + "." + Word(nums) +
        Optional('e'+oneOf("+ -")+Word(nums)))
    floatnum.setParseAction(lambda t:float(t[0]))
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    atflist = Suppress("(") + delimitedList(floatnum|integer) + \
                Suppress(")")
Then we need to tackle the issue of adding nesting for those entries
that have sub-keys.  This is actually kind of tricky for your data
example, because nesting within Dict expects input data to be nested.
That is, nesting Dict's is normally done with data that is input like:
main
  Technology
  Layer
    PRBOUNDARY
    METAL2
  Tile
    unit
But your data is structured slightly differently:
main
  Technology
  Layer PRBOUNDARY
  Layer METAL2
  Tile unit
Because Layer is repeated, the second entry creates a new node named
"Layer" at the second level, and the first "Layer" entry is lost.  To
fix this, we need to combine Layer and the layer id into a composite-
type of key.  I did this by using Group, and adding the Optional alias
(which I see now is a poor name, "layerId" would be better) as a
second element of the key:
    mainDict = dictOf(
        Group(Word(alphas)+Optional(quotedString)),
        Suppress("{") + attrDict + Suppress("}")
        )
But now if we parse the input with this mainDict, we see that the keys
are no longer nice simple strings, but they are 1- or 2-element
ParseResults objects.  Here is what I get from the command "print
md.keys()":
[(['Technology'], {}), (['Tile', 'unit'], {}), (['Layer',
'PRBOUNDARY'], {}), (['Layer', 'METAL2'], {})]
So to finally clear this up, we need one more parse action, attached
to the mainDict expression, that rearranges the subdicts using the
elements in the keys.  The parse action looks like this, and it will
process the overall parse results for the entire data structure:
    def rearrangeSubDicts(toks):
        # iterate over all key-value pairs in the dict
        for key,value in toks.items():
            # key is of the form ['name'] or ['name', 'name2']
            # and the value is the attrDict
            # if key has just one element, use it to define
            # a simple string key
            if len(key)==1:
                toks[key[0]] = value
            else:
                # if the key has two elements, create a
                # subnode with the first element
                if key[0] not in toks:
                    toks[key[0]] = ParseResults([])
                # add an entry for the second key element
                toks[key[0]][key[1]] = value
            # now delete the original key that is the form
            # ['name'] or ['name', 'name2']
            del toks[key]
It looks a bit messy, but the point is to modify the tokens in place,
by rearranging the attrdicts to nodes with simple string keys, instead
of keys nested in structures.
Lastly, we attach the parse action in the usual way:
    mainDict.setParseAction(rearrangeSubDicts)
Now you can access the fields of the different layers as:
    print md.Layer.METAL2.lineStyle
I guess this all looks pretty convoluted.  You might be better off
just doing your own Group'ing, and then navigating the nested lists to
build your own dict or other data structure.

Hi Paul,

Before I continue this I must thank you for your help.  You really did
do an outstanding job on this code and it is really straight forward
to use and learn from.  This was a fun weekend task and I really
wanted to use pyparsing to do it.  Because this is one of several type
of files I want to parse.  I (as I'm sure you would agree) think the
rearrangeSubDicts is a bit of a hack but never the less absolutely
required and due to the limitations of the data I am parsing.   Once
again thanks for your great help.  Now the problem..

I attempted to use this code on another testcase.  This testcase had
tabs in it.  I think 1.4.11 is missing the expandtabs attribute.  I
ran my code (which had tabs) and I got this..

AttributeError: 'builtin_function_or_method' object has no attribute
'expandtabs'

Ugh oh.  Is this a pyparsing problem or am I just an idiot..

Thanks again!

Doh!! Nevermind I am an idiot. Nope I got it what a bonehead..

I needed to tweak it a bit to ignore the comments.. Namely this fixed
it up..

mainDict = dictOf(
Group(Word(alphas)+Optional(quotedString)),
Suppress("{") + attrDict + Suppress("}")
) | cStyleComment.suppress()

Thanks again. Now I just need to figure out how to use your dicts to
do some work..
 
F

Francesco Bochicchio

Il Sat, 22 Mar 2008 14:11:16 -0700, rh0dium ha scritto:
Hi all,

I am struggling with parsing the following data:

test1 = """
Technology {
name = "gtc" dielectric
= 2.75e-05 unitTimeName
= "ns" timePrecision = 1000
unitLengthName = "micron"
lengthPrecision = 1000 gridResolution
= 5
unitVoltageName = "v" voltagePrecision
= 1000000 unitCurrentName =
"ma" currentPrecision = 1000
unitPowerName = "pw" powerPrecision
= 1000 unitResistanceName =
"kohm" resistancePrecision = 10000000
unitCapacitanceName = "pf"
capacitancePrecision = 10000000
unitInductanceName = "nh"
inductancePrecision = 100
}

Tile "unit" {
width = 0.22 height
= 1.69
}

Did you think of using something a bit more sofisticated than pyparsing?
I have had a good experience to using ply, a pure-python implementation
of yacc/lex tools, which I used to extract significant data from C
programs to automatize documentation.

I never used before yacc or similar tools, but having a bit of experience
with BNF notation, I found ply easy enough. In my case, the major problem
was to cope with yacc limitation in describing C syntax (which I solved
by "oelaxing" the rules a bit, since I was going to process only already-
compiled C code). In your much simpler case, I'd say that a few
production rules should be enough.

P.S : there are others, faster and maybe more complete python parser, but
as I said ply is pure python: no external libraries and runs everywhere.

Ciao
 
P

Paul McGuire

I needed to tweak it a bit to ignore the comments..  Namely this fixed
it up..

    mainDict = dictOf(
            Group(Word(alphas)+Optional(quotedString)),
            Suppress("{") + attrDict + Suppress("}")
            ) | cStyleComment.suppress()

Thanks again.  Now I just need to figure out how to use your dicts to
do some work..- Hide quoted text -

- Show quoted text -

I'm glad this is coming around to some reasonable degree of completion
for you. One last thought - your handling of comments is a bit crude,
and will not handle comments that crop up in the middle of dict
entries, as in:

color = /* using non-standard color during testing */
"plum"

The more comprehensive way to handle comments is to call ignore.
Using ignore will propagate the comment handling to all embedded
expressions, so you only need to call ignore once on the top-most
pyparsing expression, as in:

mainDict.ignore(cStyleComment)

Also, ignore does token suppression automatically.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top