Regexp: unexspected splitting of string in several groups

P

Piet

Hello,
I have a very strange problem with regular expressions. The problem
consists of analyzing the properties of columns of a MySQL database.
When I request the column type, I get back a string with the following
composition:
vartype(width[,decimals]|list) further variable attributes.
vartype is a simple string(varchar, tinyint ...) which might be
followed by a string in curved brackets. This bracketed string is
either composed of a single number, two numbers separated by a comma,
or a list of strings separated by a comma. After the bracketed string,
there might be a list of further strings (separated by blanks)
describing some more properties of the column.
Typical examples are:
char(30) binary
int(10) zerofill
float(3,2)...
I would like to extract the vartype, the bracketed string and the
further properties separately and thus defined the following regular
expression:
#snip
vartypePattern = re.compile("([a-zA-Z]+)(\(.*\))*([^(].*[^)])")
vartypeSplit = vartypePattern.match("float(3,2) not null")
#snip
That works for some expressions with a bracketed expression. E.g. the
above expression gives back:
vartypeSplit.groups() = ('float', '(30,2)', ' not null').
However, simple one-string expressions like
vartypeSplit = vartypePattern.match("float")
are always splitted into two strings. The result is:
vartypeSplit.groups() = ('flo', None, 'at').
I would have either expected ('float',None,None) or ('float','','').
For other strings, the last two characters are also found in a
separate group.
Is this a bug or a feature? ;-)
Can anybody point me in the right direction to solve the problem.
Many thanks
Piet
 
R

Roy Smith

vartypePattern = re.compile("([a-zA-Z]+)(\(.*\))*([^(].*[^)])")
[...]
simple one-string expressions like
vartypeSplit = vartypePattern.match("float")
are always splitted into two strings. The result is:
vartypeSplit.groups() = ('flo', None, 'at').

I think I see your problem.

Let's take a simplier pattern, "(a*)(a*)", that says to match "zero or
more a's followed by zero or more a's". If you feed it a string like
"aaa", it's ambigious. Any of the following matches would satisfy the
basic pattern:

()(aaa)
(a)(aa)
(aa)(a)
(aaa)()

This is sort of what's going on here. Your regex is:

([a-zA-Z]+)(\(.*\))*([^(].*[^)])

which breaks down into three groups:

([a-zA-Z]+) # one or more letters
(\(.*\))* # any string inside ()'s, zero or more times
([^(].*[^)]) # any char not '(', any string, any char not ')'

In your case, the first group matches "flo", the next group matches
nothing, and the third group matches "at". It's not what you expected,
but you've got an ambigious RE, and this is one of the (many) ways the
group matches could be satisfied.

Looking at your english description of the pattern:
vartype is a simple string(varchar, tinyint ...) which might be
followed by a string in curved brackets. This bracketed string is
either composed of a single number, two numbers separated by a comma,
or a list of strings separated by a comma. After the bracketed string,
there might be a list of further strings (separated by blanks)

I don't think you described it right. If the bracketed string is
missing, and "list of further strings" is present, there needs to be
whitespace between the vartype and the beginning of the list, right?
I'm not completely sure this can be expressed in a single regex. I
suspect it can, but I also suspect it's more trouble than it's worth.

I think it would be simplier (and clearer) to break this up into a
couple of steps. First match the vartype and remove it from the string
(re.split, slicing, whatever). Then see if what you've got left starts
with a '('. If if does, match everything up to the ')' and remove it
from the string. What's left is the "list of further strings". It's a
bit more verbose than one huge regex, and most likely slower too, but
it'll be a lot easier to read and debug. You should only worry about it
being slower if profiling shows that this is a critical section of code.
 
C

Christos TZOTZIOY Georgiou

vartype is a simple string(varchar, tinyint ...) which might be
followed by a string in curved brackets. This bracketed string is
either composed of a single number, two numbers separated by a comma,
or a list of strings separated by a comma. After the bracketed string,
there might be a list of further strings (separated by blanks)
describing some more properties of the column.
Typical examples are:
char(30) binary
int(10) zerofill
float(3,2)...
I would like to extract the vartype, the bracketed string and the
further properties separately and thus defined the following regular
expression:

Does this RE work for you?

tre= re.compile(r"(\w+)"
r"(?:\(([\d\w]+(?:,[\d\w]+)*)\))?"
r"(\s+\w+)*")

For your examples:
('float', '3,2', None)

PS1 if you make the re slightly more complex, you can avoid the initial
space in the third "properties" group. I also assumed no space between
the "vartype" and the left parenthesis (if it is there).

PS2 redemo.py somewhere in your python's installation is a good friend
of yours.

PS3 I am a fan of regular expressions for years, and I often overuse
them. Perhaps somebody else might give you a better advice than me.
 
P

Piet van Oostrum

(e-mail address removed) (Piet) (P) wrote:

P> vartypePattern = re.compile("([a-zA-Z]+)(\(.*\))*([^(].*[^)])")

P> However, simple one-string expressions like
P> vartypeSplit = vartypePattern.match("float")
P> are always splitted into two strings. The result is:
P> vartypeSplit.groups() = ('flo', None, 'at').
P> I would have either expected ('float',None,None) or ('float','','').
P> For other strings, the last two characters are also found in a
P> separate group.
P> Is this a bug or a feature? ;-)

It is a feature:
The last part: [^(].*[^)] says: a character which is not (, possibly more
characters and a character which is not ). So at least two characters.

Maybe you mean something like [^()]*
Or would you like to accept )xxx) or )yyy(?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top