Parsing files -- pyparsing to the rescue?

R

rh0dium

Hi all,

I have a file which I need to parse and I need to be able to break it
down by sections. I know it's possible but I can't seem to figure this
out.

The sections are broken by <> with one or more keywords in the <>.
What I want to do is to be able to pars a particular section of the
file. So for example I need to be able to look at the SYSLIB section.
Presumably the sections are


<SYSLIB>
Sys Data
Sys-Data
asdkData
Data
<LOGLVS>
Data
Data
Data
Data
<SOME SECTION>
Data
Data
Data
Data
<NETLIST>
Data
Data
Data
Data
<NET>

So if I wanted to break them down..

Sections are broken down by this..

secH=pyparsing.LineStart() + pyparsing.Suppress(
pyparsing.Literal("<")) +
pyparsing.OneOrMore(pyparsing.Word(pyparsing.alphanums)) +
pyparsing.Suppress( pyparsing.Literal(">"))

But how do I say that <SECTIONn> stops at the start of the next
<SECTIONm>?
 
G

Giovanni Bajo

rh0dium said:
I have a file which I need to parse and I need to be able to break it
down by sections. I know it's possible but I can't seem to figure this
out.

The sections are broken by <> with one or more keywords in the <>.
What I want to do is to be able to pars a particular section of the
file. So for example I need to be able to look at the SYSLIB section.
Presumably the sections are


<SYSLIB>
Sys Data
Sys-Data
asdkData
Data
<LOGLVS>
Data
Data
Data
Data
<SOME SECTION>
Data
Data
Data
Data
<NETLIST>
Data
Data
Data
Data
<NET>

Given your description, pyparsing doesn't feel like the correct tool:

secs = {}
for L in file("foo.txt", "rU"):
L = L.rstrip("\n")
if re.match(r"<.*>", L):
name = L[1:-1]
secs[name] = []
else:
secs[name].append(L)
 
P

Paul McGuire

rh0dium said:
Hi all,

I have a file which I need to parse and I need to be able to break it
down by sections. I know it's possible but I can't seem to figure this
out.

The sections are broken by <> with one or more keywords in the <>.
But how do I say that <SECTIONn> stops at the start of the next
<SECTIONm>?

See the attached working example - the comments and definition of dataLine
show how this is done.

This is something of a trick in pyparsing, but it is a basic characteristic
of the pyparsing recursive descent parser.

-- Paul

data="""<SYSLIB>
Sys Data
Sys-Data
asdkData
Data
<LOGLVS>
Data
Data
Data
Data
<SOME SECTION>
Data
Data
Data
Data
<NETLIST>
Data
Data
Data
Data
<NET>
"""

from pyparsing import *

# basic pyparsing version
secLabel = Suppress("<") + OneOrMore(Word(alphas)) + Suppress(">") +
LineEnd().suppress()
# need to indicate which entries are *not* valid datalines - next secLabel,
or end of string
dataLine = ~secLabel + ~StringEnd() + restOfLine + LineEnd().suppress()

# a data section is a section label, followed by zero or more data lines
section = Group(secLabel + ZeroOrMore(dataLine))

# a config data contains one or more sections
configData = OneOrMore(section)

# parse the input data and print the results
res = configData.parseString(data)
print res

# prints:
# [['SYSLIB', 'Sys Data', 'Sys-Data', 'asdkData', 'Data'], ['LOGLVS',
'Data', 'Data', 'Data', 'Data'], ['SOME', 'SECTION', 'Data', 'Data', 'Data',
'Data'], ['NETLIST', 'Data', 'Data', 'Data', 'Data'], ['NET']]


# enhanced version, constructing a ParseResults with dict-like access
# (reuses previous expression definitions)

# combine multiword keys into a single string
# - want <SOME SECTION> to return 'SOME SECTION', not
# 'SOME', 'SECTION'
def joinKeyWords(s,l,t):
return " ".join(t)
secLabel.setParseAction(joinKeyWords)
section = Group(secLabel + ZeroOrMore(dataLine))
configData = Dict(OneOrMore(section))

# parse the input data, and access the results by section name
res = configData.parseString(data)
print res
print res["SYSLIB"]
print res["SOME SECTION"]
print res.keys()


# prints:
#[['SYSLIB', 'Sys Data', 'Sys-Data', 'asdkData', 'Data'], ['LOGLVS', 'Data',
'Data', 'Data', 'Data'], ['SOME SECTION', 'Data', 'Data', 'Data', 'Data'],
['NETLIST', 'Data', 'Data', 'Data', 'Data'], ['NET']]
#['Sys Data', 'Sys-Data', 'asdkData', 'Data']
#['Data', 'Data', 'Data', 'Data']
#['LOGLVS', 'NET', 'NETLIST', 'SYSLIB', 'SOME SECTION']
 
A

Allan Zhang

Try this

code
=====
import re
p = re.compile(r'<SYSLIB>([^<]*)<')
s = open("file").read()
m = re.search(p, s)
if m: res = m.groups()[0]
res = res.lstrip("\n")
res = res.rstrip("\n")
print res


result:
=======
%python parser.py
Sys Data
Sys-Data
asdkData
Data
%

Thanks
Allan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top