splitting words with brackets

Q

Qiangning Hong

I've got some strings to split. They are main words, but some words
are inside a pair of brackets and should be considered as one unit. I
prefer to use re.split, but haven't written a working one after hours
of work.

Example:

"a (b c) d [e f g] h i"
should be splitted to
["a", "(b c)", "d", "[e f g]", "h", "i"]

As speed is a factor to consider, it's best if there is a single line
regular expression can handle this. I tried this but failed:
re.split(r"(?![\(\[].*?)\s+(?!.*?[\)\]])", s). It work for "(a b) c"
but not work "a (b c)" :(

Any hint?
 
F

faulkner

er,
....|\[[^\]]*\]|...
^_^
re.findall('\([^\)]*\)|\[[^\]]*|\S+', s)

Qiangning said:
I've got some strings to split. They are main words, but some words
are inside a pair of brackets and should be considered as one unit. I
prefer to use re.split, but haven't written a working one after hours
of work.

Example:

"a (b c) d [e f g] h i"
should be splitted to
["a", "(b c)", "d", "[e f g]", "h", "i"]

As speed is a factor to consider, it's best if there is a single line
regular expression can handle this. I tried this but failed:
re.split(r"(?![\(\[].*?)\s+(?!.*?[\)\]])", s). It work for "(a b) c"
but not work "a (b c)" :(

Any hint?
 
Q

Qiangning Hong

faulkner said:
re.findall('\([^\)]*\)|\[[^\]]*|\S+', s)

sorry i forgot to give a limitation: if a letter is next to a bracket,
they should be considered as one word. i.e.:
"a(b c) d" becomes ["a(b c)", "d"]
because there is no blank between "a" and "(".
 
J

Justin Azoff

faulkner said:
er,
...|\[[^\]]*\]|...
^_^

That's why it is nice to use re.VERBOSE:

def splitup(s):
return re.findall('''
\( [^\)]* \) |
\[ [^\]]* \] |
\S+
''', s, re.VERBOSE)

Much less error prone this way
 
T

Tim Chase

"a (b c) d [e f g] h i"
> should be splitted to
> ["a", "(b c)", "d", "[e f g]", "h", "i"]
>
> As speed is a factor to consider, it's best if there is a
> single line regular expression can handle this. I tried
> this but failed:
> re.split(r"(?![\(\[].*?)\s+(?!.*?[\)\]])", s). It work
> for "(a b) c" but not work "a (b c)" :(
>
> Any hint?

[and later added]
> sorry i forgot to give a limitation: if a letter is next
> to a bracket, they should be considered as one word. i.e.:
> "a(b c) d" becomes ["a(b c)", "d"] because there is no
> blank between "a" and "(".
>>> import re
>>> s ='a (b c) d [e f g] h ia abcd(b c)xyz d [e f g] h i'
>>> r = re.compile(r'(?:\S*(?:\([^\)]*\)|\[[^\]]*\])\S*)|\S+')
>>> r.findall(s)
['a', '(b c)', 'd', '[e f g]', 'h', 'ia', 'abcd(b c)xyz', 'd',
'[e f g]', 'h', 'i']

I'm sure there's a *much* more elegant pyparsing solution to
this, but I don't have the pyparsing module on this machine.
It's much better/clearer and will be far more readable when
you come back to it later.

However, the above monstrosity passes the tests I threw at
it.

-tkc
 
S

Simon Forman

Qiangning said:
faulkner said:
re.findall('\([^\)]*\)|\[[^\]]*|\S+', s)

sorry i forgot to give a limitation: if a letter is next to a bracket,
they should be considered as one word. i.e.:
"a(b c) d" becomes ["a(b c)", "d"]
because there is no blank between "a" and "(".

This variation seems to do it:

import re

s = "a (b c) d [e f g] h i(j k) l [m n o]p q"

def splitup(s):
return re.findall('''
\S*\( [^\)]* \)\S* |
\S*\[ [^\]]* \]\S* |
\S+
''', s, re.VERBOSE)

print splitup(s)

# Prints

['a', '(b c)', 'd', '[e f g]', 'h', 'i(j k)', 'l', '[m n o]p', 'q']


Peace,
~Simon
 
Q

Qiangning Hong

Tim said:
import re
s ='a (b c) d [e f g] h ia abcd(b c)xyz d [e f g] h i'
r = re.compile(r'(?:\S*(?:\([^\)]*\)|\[[^\]]*\])\S*)|\S+')
r.findall(s)
['a', '(b c)', 'd', '[e f g]', 'h', 'ia', 'abcd(b c)xyz', 'd',
'[e f g]', 'h', 'i']
[...]
However, the above monstrosity passes the tests I threw at
it.

but it can't pass this one: "(a c)b(c d) e"
the above regex gives out ['(a c)b(c', 'd)', 'e'], but the correct one
should be ['(a c)b(c d)', 'e']
 
Q

Qiangning Hong

Simon said:
def splitup(s):
return re.findall('''
\S*\( [^\)]* \)\S* |
\S*\[ [^\]]* \]\S* |
\S+
''', s, re.VERBOSE)

Yours is the same as Tim's, it can't handle a word with two or more
brackets pairs, too.

I tried to change the "\S*\([^\)]*\)\S*" part to "(\S|\([^\)]*\))*",
but it turns out to a mess.
 
S

Simon Forman

Qiangning said:
Tim said:
import re
s ='a (b c) d [e f g] h ia abcd(b c)xyz d [e f g] h i'
r = re.compile(r'(?:\S*(?:\([^\)]*\)|\[[^\]]*\])\S*)|\S+')
r.findall(s)
['a', '(b c)', 'd', '[e f g]', 'h', 'ia', 'abcd(b c)xyz', 'd',
'[e f g]', 'h', 'i']
[...]
However, the above monstrosity passes the tests I threw at
it.

but it can't pass this one: "(a c)b(c d) e"
the above regex gives out ['(a c)b(c', 'd)', 'e'], but the correct one
should be ['(a c)b(c d)', 'e']

What are the desired results in cases like this:

"(a b)[c d]" or "(a b)(c d)" ?
 
T

Tim Chase

but it can't pass this one: "(a c)b(c d) e" the above regex
gives out ['(a c)b(c', 'd)', 'e'], but the correct one should
be ['(a c)b(c d)', 'e']

Ah...the picture is becoming a little more clear:
>>> r = re.compile(r'(?:\([^\)]*\)|\[[^\]]*\]|\S)+')
>>> r.findall(s)
['(a c)b(c d)', 'e']

It also works on my original test data, and is a cleaner regexp
than the original.

The clearer the problem, the clearer the answer. :)

-tkc
 
Q

Qiangning Hong

Tim said:
Ah...the picture is becoming a little more clear:
r = re.compile(r'(?:\([^\)]*\)|\[[^\]]*\]|\S)+')
r.findall(s)
['(a c)b(c d)', 'e']

It also works on my original test data, and is a cleaner regexp
than the original.

The clearer the problem, the clearer the answer. :)

Ah, it's exactly what I want! I thought the left and right sides of
"|" are equal, but it is not true. I think I must sleep right now,
lacking of sleep makes me a dull :p. Thank you and Simon for your
kindly help!
 
P

Paul McGuire

Tim Chase said:
I'm sure there's a *much* more elegant pyparsing solution to
this, but I don't have the pyparsing module on this machine.
It's much better/clearer and will be far more readable when
you come back to it later.

However, the above monstrosity passes the tests I threw at
it.

-tkc

:) Cute! (but how come no pyparsing on your machine?)

Ok, I confess I looked at the pyparsing list parser to see how it compares.
Pyparsing's examples include a list parser that comprehends nested lists
within lists, but this is a bit different, and really more straightforward.

Here's my test program for this modified case:

wrd = Word(alphas)
parenList = Combine( Optional(wrd) + "(" + SkipTo(")") + ")" +
Optional(wrd) )
brackList = Combine( Optional(wrd) + "[" + SkipTo("]") + "]" +
Optional(wrd) )
listExpr = ZeroOrMore( parenList | brackList | wrd )

txt = "a (b c) d [e f g] h i(j k) l [m n o]p q"
print listExpr.parseString(txt)

Gives:
['a', '(b c)', 'd', '[e f g]', 'h', 'i(j k)', 'l', '[m n o]p', 'q']


Comparitive timing of pyparsing vs. re comes in at about 2ms for pyparsing,
vs. 0.13 for re's, so about 15x faster for re's. If psyco is used (and we
skip the first call, which incurs all the compiling overhead), the speed
difference drops to about 7-10x. I did try compiling the re, but this
didn't appear to make any difference - probably user error.

Since the OP indicates a concern for speed (he must be compiling a lot of
strings, I guess), it would be tough to recommend pyparsing - especially in
the face of a working re that so neatly does the trick. But if at some
point it became necessary to add support for {}'s and <>'s, or quoted
strings, I'd rather be working with a pyparsing grammar than that crazy re
gibberish!

-- Paul
 
P

Paul McGuire

Ah, I had just made the same change!


from pyparsing import *

wrd = Word(alphas)
parenList = "(" + SkipTo(")") + ")"
brackList = "[" + SkipTo("]") + "]"
listExpr = ZeroOrMore( Combine( OneOrMore( parenList | brackList | wrd ) ) )

t = "a (b c) d [e f g] h i(j k) l [m n o]p q r (t u)v(w) (x)(y)z"
print listExpr.parseString(t)


Gives:
['a', '(b c)', 'd', '[e f g]', 'h', 'i(j k)', 'l', '[m n o]p', 'q', 'r',
'(t u)v(w)', '(x)(y)z']
 
T

Tim Chase

r = re.compile(r'(?:\([^\)]*\)|\[[^\]]*\]|\S)+')
r.findall(s)
['(a c)b(c d)', 'e']

Ah, it's exactly what I want! I thought the left and right
sides of "|" are equal, but it is not true.

In theory, they *should* be equal. I was baffled by the nonparity
of the situation. You *should" be able to swap the two sides of
the "|" and have it treated the same. Yet, when I tried it with
the above regexp, putting the \S first, it seemed to choke and
give different results. I'd love to know why.
Thank you and Simon for your kindly help!

My pleasure. A nice diversion from swatting spammers and getting
our network back up and running today. I had hoped to actually
get something productive done (i.e. writing some python code)
rather than putting out fires. Sigh.

-tkc
 
J

Justin Azoff

Paul said:
Comparitive timing of pyparsing vs. re comes in at about 2ms for pyparsing,
vs. 0.13 for re's, so about 15x faster for re's. If psyco is used (and we
skip the first call, which incurs all the compiling overhead), the speed
difference drops to about 7-10x. I did try compiling the re, but this
didn't appear to make any difference - probably user error.

That is because of how the methods in the sre module are implemented...
Compiling a regex really just saves you a dictionary lookup.

def findall(pattern, string, flags=0):
"""snip"""
return _compile(pattern, flags).findall(string)

def compile(pattern, flags=0):
"""snip"""
return _compile(pattern, flags)

def _compile(*key):
# internal: compile pattern
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None:
return p
#snip
 
P

Paul McGuire

Tim Chase said:
r = re.compile(r'(?:\([^\)]*\)|\[[^\]]*\]|\S)+')
r.findall(s)
['(a c)b(c d)', 'e']

Ah, it's exactly what I want! I thought the left and right
sides of "|" are equal, but it is not true.

In theory, they *should* be equal. I was baffled by the nonparity
of the situation. You *should" be able to swap the two sides of
the "|" and have it treated the same. Yet, when I tried it with
the above regexp, putting the \S first, it seemed to choke and
give different results. I'd love to know why.
Does the re do left-to-right matching? If so, then the \S will eat the
opening parens/brackets, and never get into the other alternative patterns.
\S is the most "matchable" pattern, so if it comes ahead of the other
alternatives, then it will always be the one matched. My guess is that if
you put \S first, you will only get the contiguous character groups,
regardless of ()'s and []'s. The expression might as well just be \S+.

Or I could be completely wrong...

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top