pyparsing: how to negate a grammar

K

knguyen

Hi,

I want to define a rule for a line that does NOT start with a given
Literal. How do I do that? I try the following and my program just hang
there:

BodyLine = ~Literal("HTTP/1.1") + restOfLine

Thanks,
Khoa
 
P

Paul McGuire

Hi,

I want to define a rule for a line that does NOT start with a given
Literal. How do I do that? I try the following and my program just hang
there:

BodyLine = ~Literal("HTTP/1.1") + restOfLine

Thanks,
Khoa
Khoa -

pyparsing can be run in several modes, one of which tokenizes and extracts
data according to a given grammar, one of which scans for pattern matches,
and one which translates matched patterns into other patterns. Its not
clear from your e-mail what you are trying to do. There is nothing in your
statement that would cause Python to "just hang there", what else is your
program doing?

-- Paul
 
K

knguyen

Hi Paul,

I am trying to extract HTTP response codes from a HTTP page send from
a web server. Below is my test program. The program just hangs.

Thanks,
Khoa
##################################################

#!/usr/bin/python

from pyparsing import ParseException, Dict, CharsNotIn,
Group,Literal,Word,ZeroOrMore,OneOrMore,
Suppress,nums,alphas,alphanums,printables,restOfLine


data = """HTTP/1.1 200 OK
body line some text here
body line some text here
HTTP/1.1 400 Bad request
body line some text here
body line some text here

HTTP/1.1 500 Bad request
body line some text here
body line some text here
"""

print "================="
print data
print "================="

HTTPVersion = (Literal("HTTP/1.1")).setResultsName("HTTPVersion")
StatusCode = (Word(nums)).setResultsName("StatusCode")
ReasonPhrase = restOfLine.setResultsName("ReasonPhrase")
StatusLine = Group(HTTPVersion + StatusCode + ReasonPhrase)

nonHTTP = ~Literal("HTTP/1.1")
BodyLine = Group(nonHTTP + restOfLine)
Response = OneOrMore(StatusLine + ZeroOrMore(BodyLine))
respFields = Response.parseString(data)
print respFields
 
P

Paul McGuire

Hi Paul,

I am trying to extract HTTP response codes from a HTTP page send from
a web server. Below is my test program. The program just hangs.

Thanks,
Khoa
##################################################
Khoa -

Thanks for supplying a little more information to go on. The problem you
are struggling with has to do with pyparsing's handling or non-handling of
whitespace, which I'll admit takes some getting used to.

In general, pyparsing works its way through the input string, matching input
characters against the defined pattern. This gets a little tricky when
dealing with whitespace (which includes '\n' characters). In particular,
restOfLine will read up to the next '\n', but will not go past it - AND
restOfLine will match an empty string. So if you have a grammar that
includes repetition, such as OneOrMore(restOfLine), this will read up to the
next '\n', and then just keep matching forever. This is just about the case
you have in your code, ZeroOrMore(BodyLine), in which BodyLine is
BodyLine = Group(nonHTTP + restOfLine)
You need to include something to consume the terminating '\n', which is the
purpose of the LineEnd() class. Change BodyLine to
BodyLine = Group(nonHTTP + restOfLine + LineEnd())
and this will break the infinite looping that occurs at the end of the first
body line. (If you like, use LineEnd.suppress(), to keep the '\n' tokens
from getting included with your other parsed data.)

Now there is one more problem - another infinite loop at the end of the
string. By similar reasoning, it is resolved by changing
nonHTTP = ~Literal("HTTP/1.1")
to
nonHTTP = ~Literal("HTTP/1.1") + ~StringEnd()

After making those two changes, your program runs to completion on my
system.

Usually, when someone has some problems with this kind of "line-sensitive"
parsing, I recommend that they consider using pyparsing in a different
manner, or use some other technique. For instance, you might use
pyparsing's scanString generator to match on the HTTP lines, as in

for toks,start,end in StatusLine.scanString(data):
print toks,toks[0].StatusCode, toks[0].ReasonPhrase
print start,end

which gives
[['HTTP/1.1', '200', ' OK']] 200 OK
0 15
[['HTTP/1.1', '400', ' Bad request']] 400 Bad request
66 90
[['HTTP/1.1', '500', ' Bad request']] 500 Bad request
142 166

If you need the intervening body text, you can use the start and end values
to extract it in slices from the input data string.

Or, since your data is reasonably well-formed, you could just use readlines,
or data.split('\n'), and find the HTTP lines using startswith(). While this
is a brute force approach, it will run certainly many times faster than
pyparsing.

In any event, best of luck using pyparsing, and write back if you have other
questions.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top