regular expression extracting groups

clawsicus · Aug 10, 2008

Hi list,

I'm trying to use regular expressions to help me quickly extract the
contents of messages that my application will receive. I have worked
out most of the regex but the last section of the message has me
stumped. This is mostly because I want to pull the content out into
regex groups that I can easily access later. I have a regex to extract
the key/value pairs but it ends up with only the contents of the last
key/value pair encountered.

An example of the section of the message that is troubling me appears
like this:

{
option=value
foo=bar
another=42
option=7
}

So it's basically a bunch of lines. Every line is terminated with a
'\n' character. The number of key/value fields changes depending on
the particular message. Also notice that there are two 'option' keys.
This is allowable and I need to cater for it.

A couple of example messages are:
xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=*\n}
\nhbeat.basic\n{\ninterval=10\n}\n

xpl-stat\n{\nhop=1\nsource=vendor-device.instance\ntarget=vendor-
device.instance\n}\nconfig.list\n{\nreconf=newconf\noption=interval
\noption=group[16]\noption=filter[16]\n}\n

As all messages follow the same pattern I'm hoping to develop a
generic regex, instead of one for each message kind - because there
are many, that can pull a message from a received packet.

The regex I came up with looks like this:
# This should match any xPL message

GROUP_MESSAGE_TYPE = 'message_type'
GROUP_HOP = 'hop'
GROUP_SOURCE = 'source'
GROUP_TARGET = 'target'
GROUP_SRC_VENDOR_ID = 'source_vendor_id'
GROUP_SRC_DEVICE_ID = 'source_device_id'
GROUP_SRC_INSTANCE_ID = 'source_instance_id'
GROUP_TGT_VENDOR_ID = 'target_vendor_id'
GROUP_TGT_DEVICE_ID = 'target_device_id'
GROUP_TGT_INSTANCE_ID = 'target_instance_id'
GROUP_IDENTIFIER_TYPE = 'identifier_type'
GROUP_SCHEMA = 'schema'
GROUP_SCHEMA_CLASS = 'schema_class'
GROUP_SCHEMA_TYPE = 'schema_type'
GROUP_OPTION_KEY = 'key'
GROUP_OPTION_VALUE = 'value'

XplMessageGroupsRe = r'''(?P<%s>xpl-(cmnd|stat|trig))
\n # message type
\
{\n
#
hop=(?P<%s>[1-9]{1})
\n # hop
count
source=(?P<%s>(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16}))\n # source identifier
target=(?P<%s>(\*|(?P<%s>[a-z0-9]{1,8})-(?P<%s>[a-z0-9]{1,8})\.(?P<
%s>[a-z0-9]{1,16})))\n # target identifier
\}
\n
#
(?P<%s>(?P<%s>[a-z0-9]{1,8})\.(?P<%s>[a-z0-9]{1,8}))\n
# schema
\
{\n
#
(?

?P<%s>[a-z0-9\-]{1,16})=(?P<%s>[\x20-\x7E]{0,128})\n){1,64} #
key/value pairs
\}\n''' % (GROUP_MESSAGE_TYPE,
GROUP_HOP,
GROUP_SOURCE,
GROUP_SRC_VENDOR_ID,
GROUP_SRC_DEVICE_ID,
GROUP_SRC_INSTANCE_ID,
GROUP_TARGET,
GROUP_TGT_VENDOR_ID,
GROUP_TGT_DEVICE_ID,
GROUP_TGT_INSTANCE_ID,
GROUP_SCHEMA,
GROUP_SCHEMA_CLASS,
GROUP_SCHEMA_TYPE,
GROUP_OPTION_KEY,
GROUP_OPTION_VALUE)

XplMessageGroups = re.compile(XplMessageGroupsRe, re.VERBOSE |
re.DOTALL)

If I pass the second example message through this regex the 'key'
group ends up containing 'option' and the 'value' group ends up
containing 'filter[16]' which are the last key/value pairs in that
message.

So the problem I have lies in the key/value regex extraction section.
It handles multiple occurrences of the pattern and writes the content
into the single key/value group hence I can't extract and access all
fields.

Is there some other way to do this which allows me to store all the
key/value pairs into the regex match object for later retrieval?
Perhaps using the standard unnamed number groups?

Thanks,
Chris

Paul Hankin · Aug 10, 2008

I'm trying to use regular expressions to help me quickly extract the
contents of messages that my application will receive.

Don't use regexps for parsing complex data; they're limited,
completely unreadable, and hugely difficult to debug. Your code is
well written, and you've already reached the limits of the power of
regexps, and it's difficult to read.

Have a look at pyparsing for a simple solution to your problem.
http://pyparsing.wikispaces.com/

Paul McGuire · Aug 10, 2008

Don't use regexps for parsing complex data; they're limited,
completely unreadable, and hugely difficult to debug. Your code is
well written, and you've already reached the limits of the power of
regexps, and it's difficult to read.

Have a look at pyparsing for a simple solution to your problem.http://pyparsing.wikispaces.com/

Well, predictably, the pyparsing solution is simple UNTIL we get to
the "multidict" options field. Pyparsing has a Dict construct that
has the same limitations as Python's dict - only the last key-value
would be retained. So I had to write a parse action to manually
stitch the key-value groups into the parsed tokens' internal key-value
dict.

With the basic grammar implemented in pyparsing, it would now be very
easy to make some of these internal expressions optional (using
Optional wrappers), or parseable in any order (using '&' operator
instead of '+' - '&' enforces presence of all values, but in any
order).

-- Paul

from pyparsing import Suppress, Literal, Combine, oneOf, Word,
alphanums, \
restOfLine, ZeroOrMore, Group, ParseResults

LBRACE,RBRACE,EQ = map(Suppress,"{}=")
keylabel = lambda s : Literal(s) + EQ
grp_msg_type = Combine("xpl-" + oneOf("cmnd stat trig"))
(GROUP_MESSAGE_TYPE)
grp_hop = keylabel("hop") + Word("123456789",exact=1)(GROUP_HOP)
grp_source = keylabel("source") + Combine(Word(alphanums,max=8)
(GROUP_SRC_VENDOR_ID) + '-' +
Word(alphanums,max=8)
(GROUP_SRC_DEVICE_ID) + '.' +
Word(alphanums,max=16)
(GROUP_SRC_INSTANCE_ID)
)(GROUP_SOURCE)
grp_target = keylabel("target") + Combine('*'|Word(alphanums,max=8)
(GROUP_TGT_VENDOR_ID) + '-' +
Word(alphanums,max=8)
(GROUP_TGT_DEVICE_ID) + '.' +
Word(alphanums,max=16)
(GROUP_TGT_INSTANCE_ID)
)(GROUP_TARGET)
grp_schema = Combine(Word(alphanums,max=8)(GROUP_SCHEMA_CLASS) + '.' +
Word(alphanums,max=8)(GROUP_SCHEMA_TYPE)
)(GROUP_SCHEMA)

option_key = Word(alphanums+'-',max=16)
#~ option_val = Word(printables+' ',max=64)
option_val = restOfLine
options = (LBRACE +
ZeroOrMore(Group(option_key("key") + EQ + option_val("value"))) +
RBRACE)("options")

# this parse action will take the raw key=value groups and add them
to
# the current results' named tokens
def make_options_dict(tokens):
for k,v in tokens.asList():
if k not in tokens:
tokens[k] = ParseResults([])
tokens[k] += ParseResults(v)
# delete redundant key-value created by pyparsing
del tokens["options"]
return tokens
options.setParseAction(make_options_dict)

msgFormat = (grp_msg_type +
LBRACE + grp_hop + grp_source + grp_target + RBRACE +
grp_schema +
options)

# parse each message
for msgstr in msgdata:
msg = msgFormat.parseString(msgstr)
#~ print msg.dump()
print "Message type:", msg.message_type
print "Hop:", msg.hop
print "Options:"
print msg.options.dump()
print

Prints:

Message type: xpl-stat
Hop: 1
Options:
[['interval', '10']]
- interval: ['10']

Message type: xpl-stat
Hop: 1
Options:
[['reconf', 'newconf'], ['option', 'interval '],
['option', 'group[16]'], ['option', 'filter[16]']]
- option: ['interval ', 'group[16]', 'filter[16]']
- reconf: ['newconf']

clawsicus · Aug 11, 2008

Thanks all for your responses, especially Paul McGuire for the
excellent example usage of pyparsing.
I'm off to check out pyparsing.

Thanks,
Chris

Regular expression problem	13	Mar 10, 2013
help on python regular expression named group	3	Jul 16, 2013
Problem creating a regular expression to parse open-iscsi, iscsiadmoutput (help?)	5	Jun 13, 2013
Regular expression to structure HTML	11	Oct 2, 2009
Collect Excel Data from Website	5	Apr 30, 2022
Regular expressions, capture repeated groups	4	Jul 8, 2010
Using a function for regular expression substitution	5	Aug 29, 2010
Regular expression help	4	Jul 18, 2008

regular expression extracting groups

clawsicus

Paul Hankin

Paul McGuire

clawsicus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads