Using Regular Expressions to Parse SQL

ct60 · Feb 5, 2008

Hello again -

I do not seem to be able to get a handle on non-greedy pattern
matching.

I am trying to parse the following - note that there are no line
breaks in the string:

" FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"

What I would like to do is be able to split on the "AND" and the "="
to come up with the following:
qry_Scores_Lookup1.desc
CSS_Rpt1.desc
qry_Scores_Lookup1.lastcdu
CSS_Rpt1.lastcdu

The following is one of my many attempts to do this:

import re

s= " FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"

pat = " FROM .+ (?:INNER|LEFT|RIGHT) JOIN .+ ON (?:AND)*?((.+)=(.+))"

m = re.match(pat, s)

if m is None:
print "No Match"
else:
for mat in m.groups():
print mat

My pattern does not even come close.

Any help would be greatly appreciated. My goal is to analyse a large
number of SQL querys to try to identify the join field and see where
indexing might make sense.

While I am mostly interested in understanding regular expressions, I
would also be interested in knowing about any Python SQL parsers out
there.

Thanks in advance.

Chris ([email protected])

Paul McGuire · Feb 5, 2008

Hello again -

I do not seem to be able to get a handle on non-greedy pattern
matching.

Regexps wont cut it when you have to parse nested ()'s in a logical
expression.

Here is a pyparsing solution. For your specific application, you will
need to sift through WHERE clauses too, you should be able to use this
as a starting point for creating a WHERE clause parser.

-- Paul

from pyparsing import *

# define terminal items
identifier = Word(alphas,alphanums+'_')
column_name = Combine( identifier + ZeroOrMore('.' + identifier) )
integer = Word(nums)
value = integer | sglQuotedString

relationalOperator = oneOf("< = > >= <= != <>")
simple_comparison = Group(column_name + relationalOperator +
(column_name | value))

AND_cl = CaselessLiteral("AND")
OR_cl = CaselessLiteral("OR")
NOT_cl = CaselessLiteral("NOT")

logical_expression = operatorPrecedence( simple_comparison,
[
(NOT_cl, 1, opAssoc.RIGHT),
(OR_cl, 2, opAssoc.LEFT),
(AND_cl, 2, opAssoc.LEFT),
])

# syntax for FROM ... join expression
FROM_cl = CaselessLiteral("FROM")
ON_cl = CaselessLiteral("ON")
INNER_cl = CaselessLiteral("INNER")
JOIN_cl = CaselessLiteral("JOIN")

join_expr = identifier('table1') + INNER_cl + JOIN_cl + \
identifier('table2') + ON_cl + \
logical_expression('join_condition')
from_clause = FROM_cl + '(' + join_expr + ')'

test = """ FROM (qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))"""

# parsing the FROM clause normally
join_fields = from_clause.parseString(test)
print "\nParsed tokens, plus any named fields"
print join_fields.dump()
print "\nAccessing some named fields directly"
print "table1 =", join_fields.table1
print "join_condition =", join_fields.join_condition

# create parse action to tally up column names as they are found
# during the parsing process
def tally_names(tokens):
column_names.append(tokens[0])
column_name.setParseAction(tally_names)

# parse the FROM clause, but just let the parse action do the work
print "\nExtracting the column names"
column_names = []
from_clause.parseString(test)
print column_names

Prints:

Parsed tokens, plus any named fields
['FROM', '(', 'qry_Scores_Lookup1', 'INNER', 'JOIN', 'CSS_Rpt1', 'ON',
[['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'], 'AND',
['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']], ')']
- join_condition: [['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'],
'AND', ['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']]
- table1: qry_Scores_Lookup1
- table2: CSS_Rpt1

Accessing some named fields directly
table1 = qry_Scores_Lookup1
join_condition = [['qry_Scores_Lookup1.desc', '=', 'CSS_Rpt1.desc'],
'AND', ['qry_Scores_Lookup1.lastcdu', '=', 'CSS_Rpt1.lastcdu']]

Extracting the column names
['qry_Scores_Lookup1.desc', 'CSS_Rpt1.desc',
'qry_Scores_Lookup1.lastcdu', 'CSS_Rpt1.lastcdu']

Reedick, Andrew · Feb 5, 2008

-----Original Message-----
From: [email protected] [mailtoython-
[email protected]] On Behalf Of (e-mail address removed)
Sent: Tuesday, February 05, 2008 9:31 AM
To: (e-mail address removed)
Subject: Using Regular Expressions to Parse SQL

My pattern does not even come close.

Any help would be greatly appreciated. My goal is to analyse a large
number of SQL querys to try to identify the join field and see where
indexing might make sense.

While I am mostly interested in understanding regular expressions, I
would also be interested in knowing about any Python SQL parsers out
there.

Python's regexes are a tad odd compared to Perl (or Perl's regexes are
odd. Relativity.) You need re.DOTALL to handle newlines in the sql:
DOTALL
Make the "." special character match any character at all,
including a newline; without this flag, "." will match anything except a
newline.

import re

s= ''' FROM ((qry_Scores_Lookup1 INNER JOIN CSS_Rpt1 ON
(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) AND
(qry_Scores_Lookup1.lastcdu = CSS_Rpt1.lastcdu))'''

pat =
r"(^|\s+)FROM\s.+\s(?:INNER|LEFT|RIGHT)\s+JOIN\s.+\sON\s+((\s*(?:\S+)\s*
=\s*(?:\S+))(?:\s+|\s+AND\s+|$))+"
m = re.compile(pat, re.DOTALL).match(s)
if m:
print m

('(qry_Scores_Lookup1.desc = CSS_Rpt1.desc) ', '(qry_Scores_Lookup1.desc
= CSS_Rpt1.desc)')

You should be able to tweak the regex to get the exact matches you want.

An alternative to writing a lengthy regex might be to just grab
everything after the ON and then string split on "AND".

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA623

ct60 · Feb 5, 2008

Firstly, thanks to those who posted.

I just do not understand how the non-greedy operator works.

Using the following code:

import re

s = "qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu"

pat = "(.+=)+?(.+)"

m = re.match(pat, s)

if m is None:
print "No Match"
else:
for mat in m.groups():
print mat

I would expect that the first group would match one or more times with
the fewest amount of text. However the result is:

qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu

The first match of the "=" char is still greedy. I would have
expected:
qry_Lookup.desc =

CSS_Rpt1.desc AND qry_Lookup.lcdu1
=
CSS_Rpt1.lcdu

I'm obviously missing something because the non-greedy match seems to
not be acting as expected.

Any insights would be greatly appreciated.

Thanks in advance,

Chris ([email protected])

MRAB · Feb 6, 2008

Firstly, thanks to those who posted.

I just do not understand how the non-greedy operator works.

Using the following code:

import re

s = "qry_Lookup.desc = CSS_Rpt1.desc AND qry_Lookup.lcdu1 =
CSS_Rpt1.lcdu"

pat = "(.+=)+?(.+)"

^^
| This is your problem.

The ".+" is greedy and is matching up to the last "=". Try "(.+?=)(.
+)" instead.

Python Regular Expressions	4	Jun 22, 2011
I keep getting this error when im trying to show category name.	0	Dec 26, 2023
Large regular expressions	1	Mar 15, 2010
Using Xpath to parse a Yahoo Finance page	4	Dec 3, 2012
Regular Expressions to parse HTML	2	May 2, 2006
Processing regular expressions?	2	Oct 15, 2010
Python regular expressions just ain't PCRE	13	May 5, 2007
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013

Using Regular Expressions to Parse SQL

ct60

Paul McGuire

Reedick, Andrew

ct60

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads