Ideas for parsing this text?

E

Eric Wertman

I have a set of files with this kind of content (it's dumped from WebSphere):

[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
information catalogued in the DB2 database directory."]
[name databaseName]
[required true]
[type java.lang.String]
[value DB2Foo]] [[description "The JDBC connectivity-type of a data
source. If you want to use a type 4 driver, set the value to 4. If you
want to use a type 2 driver, set the value to 2. Use of driverType 2
is not supported on WAS z/OS."]
[name driverType]
[required true]
[type java.lang.Integer]
[value 4]] [[description "The TCP/IP address or host name for the DRDA server."]
[name serverName]
[required false]
[type java.lang.String]
[value ServerFoo]] [[description "The TCP/IP port number where the
DRDA server resides."]
[name portNumber]
[required false]
[type java.lang.Integer]
[value 007]] [[description "The description of this datasource."]
[name description]
[required false]
[type java.lang.String]
[value []]] [[description "The DB2 trace level for logging to the
logWriter or trace file. Possible trace levels are: TRACE_NONE =
0,TRACE_CONNECTION_CALLS = 1,TRACE_STATEMENT_CALLS =
2,TRACE_RESULT_SET_CALLS = 4,TRACE_DRIVER_CONFIGURATION =
16,TRACE_CONNECTS = 32,TRACE_DRDA_FLOWS =
64,TRACE_RESULT_SET_META_DATA = 128,TRACE_PARAMETER_META_DATA =
256,TRACE_DIAGNOSTICS = 512,TRACE_SQLJ = 1024,TRACE_ALL = -1, ."]
[name traceLevel]
[required false]
[type java.lang.Integer]
[value []]] [[description "The trace file to store the trace output.
If you specify the trace file, the DB2 Jcc trace will be logged in
this trace file. If this property is not specified and the
WAS.database trace group is enabled, then both WebSphere trace and DB2
trace will be logged into the WebSphere trace file."]

I'm trying to figure out the best way to feed it all into
dictionaries, without having to know exactly what the contents of the
file are. There are a number of things going on, The nesting is
preserved in [] pairs, and in some cases in between double quotes.
There are also cases where double quotes are only there to preserve
spaces in a string though. I managed to get what I needed in the
short term by just stripping the nesting all together, and flattening
out the key/value pairs, but I had to do some things that were
specific to the file contents to make it work.

Any ideas? I was considering making a list of string combinations, like so:

junk = ['[[','"[',']]']

and just using re.sub to covert them into a single character that I
could start to do split() actions on. There must be something else I
can do.. those brackets can't be a coincidence. The output came from
a jython script.

Thanks!
 
G

George Sakkis

I have a set of files with this kind of content (it's dumped from WebSphere):

[snipped]

I'm trying to figure out the best way to feed it all into
dictionaries, without having to know exactly what the contents of the
file are.  

It would be pretty pointless if you had to know in advance the exact
file content, but you still have to know the structure of the files,
that is the grammar they conform to.
Any ideas?  I was considering making a list of string combinations, like so:

junk = ['[[','"[',']]']

and just using re.sub to covert them into a single character that I
could start to do split() actions on.  There must be something else I
can do..

Yes, find out the formal grammar of these files and use a parser
generator [1] to specify it.

HTH,
George

[1] http://wiki.python.org/moin/LanguageParsing
 
P

Paul McGuire

I have a set of files with this kind of content (it's dumped from WebSphere):

[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
...

A couple of comments first:
- What is the significance of '"[' vs. '[' ? I stripped them all out
using
text = text.replace('"[','[')
- Your input text was missing 5 trailing ]'s.

Here's the parser I used, using pyparsing:


from pyparsing import nestedExpr,Word,alphanums,QuotedString
from pprint import pprint

content = Word(alphanums+"_.") | QuotedString('"',multiline=True)
structure = nestedExpr("[", "]", content).parseString(text)

pprint(structure.asList())


Prints (I've truncated the long lines, but the long quoted strings do
parse intact):

[['propertySet',
[['resourceProperties',
[[['description',
'This is a required \nproperty. This is an actual data...
['name', 'databaseName'],
['required', 'true'],
['type', 'java.lang.String'],
['value', 'DB2Foo']],
[['description',
'The JDBC connectivity-type of a data \nsource. If you...
['name', 'driverType'],
['required', 'true'],
['type', 'java.lang.Integer'],
['value', '4']],
[['description',
'"The TCP/IP address or host name for the DRDA server."'],
['name', 'serverName'],
['required', 'false'],
['type', 'java.lang.String'],
['value', 'ServerFoo']],
[['description',
'The TCP/IP port number where the \nDRDA server resides.'],
['name', 'portNumber'],
['required', 'false'],
['type', 'java.lang.Integer'],
['value', '007']],
[['description', '"The description of this datasource."'],
['name', 'description'],
['required', 'false'],
['type', 'java.lang.String'],
['value', []]],
[['description',
'The DB2 trace level for logging to the \nlogWriter ...
['name', 'traceLevel'],
['required', 'false'],
['type', 'java.lang.Integer'],
['value', []]],
[['description',
'The trace file to store the trace output. \nIf you ...
]]]]]]]

-- Paul
The pyparsing wiki is at http://pyparsing.wikispaces.com.
 
G

Gerard Flanagan

I have a set of files with this kind of content (it's dumped from WebSphere):
[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
...

A couple of comments first:
- What is the significance of '"[' vs. '[' ? I stripped them all out
using

The data can be thought of as a serialised object. A simple attribute
looks like:

[name someWebsphereObject]

or

[jndiName []]

if 'jndiName is None'.

A complex attribute is an attribute whose value is itself an object
(or dict if you prefer). The *value* is indicated with "[...]":

[connectionPool "[[agedTimeout 0]
[connectionTimeout 180]
[freePoolDistributionTableSize 0]
[maxConnections 10]
[minConnections 1]
[numberOfFreePoolPartitions 0]
[numberOfSharedPoolPartitions 0]
[unusedTimeout 1800]]"]

However, 'propertySet' is effectively a keyword and its value may be
thought of as a 'data table' or 'list of data rows', where 'data row'
== dict/object

You can see how the posted example is incomplete because the last
'row' is missing all but one 'column'.
text = text.replace('"[','[')
- Your input text was missing 5 trailing ]'s.

I think only 2 (the original isn't Python). To fix the example, remove
the last 'description' and add two ]'s
Here's the parser I used, using pyparsing:

from pyparsing import nestedExpr,Word,alphanums,QuotedString
from pprint import pprint

content = Word(alphanums+"_.") | QuotedString('"',multiline=True)
structure = nestedExpr("[", "]", content).parseString(text)

pprint(structure.asList())

By the way, I think this would be a good example for the pyparsing
recipes page (even an IBM developerworks article?)

http://www.ibm.com/developerworks/websphere/library/techarticles/0801_simms/0801_simms.html

Gerard

example data (copied and pasted; doesn't have the case where a complex
attribute has a complex attribute):

[authDataAlias []]
[authMechanismPreference BASIC_PASSWORD]
[connectionPool "[[agedTimeout 0]
[connectionTimeout 180]
[freePoolDistributionTableSize 0]
[maxConnections 10]
[minConnections 1]
[numberOfUnsharedPoolPartitions 0]
[properties []]
[purgePolicy FailingConnectionOnly]
[reapTime 180]
[surgeThreshold -1]
[testConnection false]
[testConnectionInterval 0]
[unusedTimeout 1800]]"]
[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
information catalogued in the DB2 database directory."]
[name databaseName]
[required true]
[type java.lang.String]
[value DB2Foo]] [[description "The JDBC connectivity-type of a data
source. If you want to use a type 4 driver, set the value to 4. If you
want to use a type 2 driver, set the value to 2. Use of driverType 2
is not supported on WAS z/OS."]
[name driverType]
[required true]
[type java.lang.Integer]
[value 4]] [[description "The TCP/IP address or name for the DRDA
server."]
[name serverName]
[required false]
[type java.lang.String]
[value ServerFoo]] [[description "The TCP/IP port number where the
DRDA server resides."]
[name portNumber]
[required false]
[type java.lang.Integer]
[value 007]] [[description "The description of this datasource."]
[name description]
[required false]
[type java.lang.String]
[value []]] [[description "The DB2 trace level for logging to the
logWriter or trace file. Possible trace levels are: TRACE_NONE =
0,TRACE_CONNECTION_CALLS = 1,TRACE_STATEMENT_CALLS =
2,TRACE_RESULT_SET_CALLS = 4,TRACE_DRIVER_CONFIGURATION =
16,TRACE_CONNECTS = 32,TRACE_DRDA_FLOWS =
64,TRACE_RESULT_SET_META_DATA = 128,TRACE_PARAMETER_META_DATA =
256,TRACE_DIAGNOSTICS = 512,TRACE_SQLJ = 1024,TRACE_ALL = -1, ."]
[name traceLevel]
[required false]
[type java.lang.Integer]
[value []]]
]]
 
M

Mark Wooding

Eric Wertman said:
I have a set of files with this kind of content (it's dumped from
WebSphere):

[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
information catalogued in the DB2 database directory."]
[name databaseName]
[required true]
[type java.lang.String]
[value DB2Foo]] ...>

Looks to me like S-expressions with square brackets instead of the
normal round ones. I'll bet that the correct lexical analysis is
approximately

[ open-list
propertySet symbol
" open-string
[ open-list
[ open-list
resourceProperties symbol
" open-string (not close-string!)
...

so it also looks as if strings aren't properly escaped.

This is definitely not a pretty syntax. I'd suggest an initial
tokenization pass for the lexical syntax

[ open-list
] close-list
"[ open-qlist
]" close-qlist
"..." string
whitespace ignore
anything-else symbol

Correct nesting should give you two kinds of lists -- which I've shown
as `list' and `qlist' (for quoted-list), though given the nastiness of
the dump you showed, there's no guarantee of correctness.

Turn the input string (or file) into a list (generator?) of lexical
objects above; then scan that recursively. The lists (or qlists) seem
to have two basic forms:

* properties, that is a list of the form [SYMBOL VALUE ...] which can
be thought of as a declaration that some property, named by the
SYMBOL, has a particular VALUE (or maybe VALUEs); and

* property lists, which are just lists of properties.

Property lists can be usefully turned into Python dictionaries, indexed
by their SYMBOLs, assuming that they don't try to declare the same
property twice.

There are, alas, other kinds of lists too -- one of the property lists
contains a property `[value []]' which simply contains an empty list.

The right first-cut rule for disambiguation is probably that a property
list is a non-empty list, all of whose items look like properties, and a
property is an entry in a property list, and (initially at least)
restrict properties to the simple form [SYMBOL VALUE] rather than
allowing multiple values.

Does any of this help?

(In fact, this syntax looks so much like a demented kind of S-expression
that I'd probably try to parse it, initially at least, by using a Common
Lisp system's reader and a custom readtable, but that may not be useful
to you.)

-- [mdw]
 
E

Eric Wertman

Thanks to everyone for the help and feedback. It's amazing to me that
I've been dealing with odd log files and other outputs for quite a
while, and never really stumbled onto a parser as a solution.


I got this far, with Paul's help, which manages my current set of files:

from pyparsing import nestedExpr,Word,alphanums,QuotedString
from pprint import pprint
import re
import glob

files = glob.glob('wsout/*')

for file in files :
text = open(file).read()
text = re.sub('"\[',' [',text) # These 2 lines just drop double quotes
text = re.sub('\]"','] ',text) # that aren't related to a string
text = re.sub('\[\]','None',text) # this drops the empty []
text = '[ ' + text + ' ]' # Needs an outer layer

content = Word(alphanums+"-_./()*=#\\${}| :,;\t\n\r@?&%%") |
QuotedString('"',multiline=True)
structure = nestedExpr("[", "]", content).parseString(text)

pprint(structure[0].asList())

I'm sure there are cooler ways to do some of that. I spent most of my
time expanding the characters that constitute content. I'm concerned
that over time I'll have things break as other characters show up.
Specifically a few of the nodes are of German locale.. so I could get
some odd international characters.

It looks like pyparser has a constant for printable characters. I'm
not sure if I can just use that, without worrying about it?

At any rate, thumbs up on the parser! Definitely going to add to my toolbox.


Eric Wertman said:
I have a set of files with this kind of content (it's dumped from
WebSphere):

[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
information catalogued in the DB2 database directory."]
[name databaseName]
[required true]
[type java.lang.String]
[value DB2Foo]] ...>

Looks to me like S-expressions with square brackets instead of the
normal round ones. I'll bet that the correct lexical analysis is
approximately

[ open-list
propertySet symbol
" open-string
[ open-list
[ open-list
resourceProperties symbol
" open-string (not close-string!)
...

so it also looks as if strings aren't properly escaped.

This is definitely not a pretty syntax. I'd suggest an initial
tokenization pass for the lexical syntax

[ open-list
] close-list
"[ open-qlist
]" close-qlist
"..." string
whitespace ignore
anything-else symbol

Correct nesting should give you two kinds of lists -- which I've shown
as `list' and `qlist' (for quoted-list), though given the nastiness of
the dump you showed, there's no guarantee of correctness.

Turn the input string (or file) into a list (generator?) of lexical
objects above; then scan that recursively. The lists (or qlists) seem
to have two basic forms:

* properties, that is a list of the form [SYMBOL VALUE ...] which can
be thought of as a declaration that some property, named by the
SYMBOL, has a particular VALUE (or maybe VALUEs); and

* property lists, which are just lists of properties.

Property lists can be usefully turned into Python dictionaries, indexed
by their SYMBOLs, assuming that they don't try to declare the same
property twice.

There are, alas, other kinds of lists too -- one of the property lists
contains a property `[value []]' which simply contains an empty list.

The right first-cut rule for disambiguation is probably that a property
list is a non-empty list, all of whose items look like properties, and a
property is an entry in a property list, and (initially at least)
restrict properties to the simple form [SYMBOL VALUE] rather than
allowing multiple values.

Does any of this help?

(In fact, this syntax looks so much like a demented kind of S-expression
that I'd probably try to parse it, initially at least, by using a Common
Lisp system's reader and a custom readtable, but that may not be useful
to you.)

-- [mdw]
 
P

Paul McGuire

I'm sure there are cooler ways to do some of that.  I spent most of my
time expanding the characters that constitute content.  I'm concerned
that over time I'll have things break as other characters show up.
Specifically a few of the nodes are of German locale.. so I could get
some odd international characters.
If you want to add international characters without going to Unicode,
a first cut would be to add pyparsing's string constant "ascii8bit".
It looks like pyparser has a constant for printable characters.  I'm
not sure if I can just use that, without worrying about it?
I would discourage you from using printables, since it also includes
'[', ']', and '"', which are significant to other elements of the
parser (but you could create your own variable initialized with
printables, and then use replace("[","") etc. to strip out the
offending characters). I'm also a little concerned that you needed to
add \t and \n to the content word - was this really necessary? None
of your examples showed such words, and I would rather have you let
pyparsing skip over the whitespace as is its natural behavior.

-- Paul
 
E

Eric Wertman

I would discourage you from using printables, since it also includes
'[', ']', and '"', which are significant to other elements of the
parser (but you could create your own variable initialized with
printables, and then use replace("[","") etc. to strip out the
offending characters). I'm also a little concerned that you needed to
add \t and \n to the content word - was this really necessary? None
of your examples showed such words, and I would rather have you let
pyparsing skip over the whitespace as is its natural behavior.

-- Paul

You are right... I have taken those out and it still works. I was
adding everything I could think of at one point in trying to determine
what was breaking the parser. Some of the data in there is input free
form... which means that any developer could have put just about
anything in there... I find a lot of ^M stuff from day to day in
other places.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top