pyparsing question

H

hubritic

I am trying to parse data that looks like this:

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
2BFA76F6 1208230607 T S SYSPROC SYSTEM
SHUTDOWN BY USER
A6D1BD62 1215230807 I
H Firmware Event

My problem is that sometimes there is a RESOURCE_NAME and sometimes
not, so I wind up with "Firmware" as my RESOURCE_NAME and "Event" as
my DESCRIPTION. The formating seems to use a set number of spaces.

I have tried making RESOURCE_NAME an Optional(Word(alphanums))) and
Description OneOrMore(Word(alphas) + LineEnd(). So the question is,
how can I avoid having the first word of Description sucked into
RESOURCE_NAME when that field should be blank?


The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser? I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

thanks
 
J

John Machin

The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser?

The purpose of a parser is to parse. Data in fixed columns does not
need parsing.
I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

An extremely misguided "principle". Would you use an AK47 on the
flies around your barbecue? A better principle is to choose the best
tool for the job.
 
H

hubritic

The purpose of a parser is to parse. Data in fixed columns does not
need parsing.


An extremely misguided "principle". Would you use an AK47 on the
flies around your barbecue? A better principle is to choose the best
tool for the job.

Your principle is no doubt the saner one for the real world, but your
example of AK47 is a bit off.
We generally know enough about an AK47 to know that it is not
something to kill flies with. Consider, though, if
someone unfamiliar with the concept of guns and mayhem got an AK47 for
xmas and was only told that it was
really good for killing things. He would try it out and would discover
that indeed it kills all sorts of things.
So he might try killing flies. Then he would discover the limitations;
those already familiar with guns would wonder
why he would waste his time.
 
P

Paul McGuire

I am trying to parse data that looks like this:

IDENTIFIER    TIMESTAMP   T  C   RESOURCE_NAME   DESCRIPTION
2BFA76F6     1208230607   T   S   SYSPROC                    SYSTEM
SHUTDOWN BY USER
A6D1BD62   1215230807     I
H                                            Firmware Event
The data I have has a fixed number of characters per field, so I could
split it up that way, but wouldn't that defeat the purpose of using a
parser?  

I think you have this backwards. I use pyparsing for a lot of text
processing, but if it is not a good fit, or if str.split is all that
is required, there is no real rationale for using anything more
complicated.
I am determined to become proficient with pyparsing so I am
using it even when it could be considered overkill; thus, it has gone
past mere utility now, this is a matter of principle!

Well, I'm glad you are driven to learn pyparsing if it kills you, but
John Machin has a good point. This data is really so amenable to
something as simple as:

for line in logfile:
id,timestamp,t,c resource_and_description = line.split(None,4)

that it is difficult to recommend pyparsing for this case. The sample
you posted was space-delimited, but if it is tab-delimited, and there
is a pair of tabs between the "H" and "Firmware Event" on the second
line, then just use split("\t") for your data and be done.

Still, pyparsing may be helpful in disambiguating that RESOURCE_NAME
and DESCRIPTION text. One approach would be to enumerate (if
possible) the different values of RESOURCE_NAME. Something like this:

ident = Word(alphanums)
timestamp = Word(nums,exact=10)

# I don't know what these are, I'm just getting the values
# from the sample text you posted
t_field = oneOf("T I")
c_field = oneOf("S H")

# I'm just guessing here, you'll need to provide the actual
# values from your log file
resource_name = oneOf("SYSPROC USERPROC IOSUBSYS whatever")

logline = ident("identifier") + timestamp("time") + \
t_field("T") + c_field("C") + \
Optional(resource_name, default="")("resource") + \
Optional(restOfLine, default="")("description")


Another tack to take might be to use a parse action on the resource
name, to verify the column position of the found token by using the
pyparsing method col:

def matchOnlyAtCol(n):
def verifyCol(strg,locn,toks):
if col(locn,strg) != n: raise
ParseException(strg,locn,"matched token not at column %d" % n)
return verifyCol

resource_name = Word(alphas).setParseAction(matchOnlyAtCol(35))

This will only work if your data really is columnar - the example text
that you posted isn't. (Hmm, I like that matchOnlyAtCol method, I
think I'll add that to the next release of pyparsing...)

Here are some similar parsers that might give you some other ideas:
http://pyparsing.wikispaces.com/space/showimage/httpServerLogParser.py
http://mail.python.org/pipermail/python-list/2005-January/thread.html#301450

In the second link, I made a similar remark, that pyparsing may not be
the first tool to try, but the variability of the input file made the
non-pyparsing options pretty hairy-looking with special case code, so
in the end, pyparsing was no more complex to use.

Good luck!
-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top