Issue with regular expressions

Julien · Apr 29, 2008

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien

Paul McGuire · Apr 29, 2008

I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Julien -

I dabbled with re's for a few minutes trying to get your solution,
then punted and used pyparsing instead. Pyparsing will run slower
than re, but many people find it much easier to work with readable
class names and instances rather than re's typoglyphics:

from pyparsing import OneOrMore, Word, printables, dblQuotedString,
removeQuotes

# when a quoted string is found, remove the quotes,
# then strip whitespace from the contents
dblQuotedString.setParseAction(removeQuotes,
lambda s:s[0].strip())

# define terms to be found in query string
term = dblQuotedString | Word(printables)
query_terms = OneOrMore(term)

# parse query string to extract terms
query = ' " some words" with and "without quotes " '
print tuple(query_terms.parseString(query))

Gives:
('some words', 'with', 'and', 'without quotes')

The pyparsing wiki is at http://pyparsing.wikispaces.com. You'll find
an examples page that includes a search query parser, and pointers to
a number of online documentation and presentation sources.

-- Paul

Paul Melis · Apr 29, 2008

Julien said:
Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Here's one way with a single regexp plus an extra filter function.

>>> import re
>>> p = re.compile('("([^"]+)")|([^ \t]+)')
>>> m = p.findall(q)
>>> m

Click to expand...

Click to expand...

[('" some words"', ' some words', ''), ('', '', 'with'), ('', '',
'and'), ('"without quotes "', 'without quotes ', '')].... if t[0] == '':
.... return t[2]
.... else:
.... return t[1]
....[' some words', 'with', 'and', 'without quotes ']

If you want to strip away the leading/trailing whitespace from the
quoted strings, then change the last return statement to
be "return t[1].strip()".

Paul

Robert Bossy · Apr 29, 2008

Julien said:
Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Hi,

I think re is not the best tool for you. Maybe there's a regular
expression that does what you want but it will be quite complex and hard
to maintain.

I suggest you split the query with the double quotes and process
alternate inside/outside chunks. Something like:

import re

def spulit(s):
inq = False
for term in s.split('"'):
if inq:
yield re.sub('\s+', ' ', term.strip())
else:
for word in term.split():
yield word
inq = not inq

for token in spulit(' " some words" with and "without quotes " '):
print token

Cheers,
RB

Hrvoje Niksic · Apr 29, 2008

Julien said:
I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I don't think you can achieve this with a single regular expression.
Your best bet is to use p.findall() to find all plausible matches, and
then rework them a bit. For example:

p = re.compile(r'"[^"]*"|[\S]+')
p.findall(query)
['" some words"', 'with', 'and', '"without quotes "']

At that point, you can easily iterate through the list and remove the
quotes and excess whitespace.

cokofreedom · Apr 29, 2008

harvey.thomas · Apr 29, 2008

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien

You can't do it simply and completely with regular expressions alone
because of the requirement to strip the quotes and normalize
whitespace, but its not too hard to write a function to do it. Viz:

import re

wordre = re.compile('"[^"]+"|[a-zA-Z]+').findall
def findwords(src):
ret = []
for x in wordre(src):
if x[0] == '"':
#strip off the quotes and normalise spaces
ret.append(' '.join(x[1:-1].split()))
else:
ret.append(x)
return ret

query = ' " Some words" with and "without quotes " '
print findwords(query)

Running this gives
['Some words', 'with', 'and', 'without quotes']

HTH

Harvey

Matimus · Apr 29, 2008

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien

I don't know if it is possible to do it all with one regex, but it
doesn't seem practical. I would check-out the shlex module.
[' some words', 'with', 'and', 'without quotes ']

To get rid of the leading and trailing space you can then use strip:

[s.strip() for s in shlex.split(query)]

Click to expand...

Click to expand...

['some words', 'with', 'and', 'without quotes']

The only problem is getting rid of the extra white-space in the middle
of the expression, for which re might still be a good solution.

import re
[re.sub(r"\s+", ' ', s.strip()) for s in shlex.split(query)]

Click to expand...

Click to expand...

['some words', 'with', 'and', 'without quotes']

Matt

Paul McGuire · Apr 29, 2008

Oh! It wasn't until Matimus's post that I saw that you wanted the
interior whitespace within the quoted strings collapsed also. Just
add another parse action to the chain of functions on dblQuotedString:

# when a quoted string is found, remove the quotes,
# then strip whitespace from the contents, then
# collapse interior whitespace
dblQuotedString.setParseAction(removeQuotes,
lambda s:s[0].strip(),
lambda s:" ".join(s[0].split()))

Plugging this into the previous script now gives:
('some words', 'with', 'and', 'without quotes')

-- Paul

George Sakkis · Apr 29, 2008

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

As other replies mention, there is no single expression since you are
doing two things: find all matches and substitute extra spaces within
the quoted matches. It can be done with two expressions though:

def normquery(text, findterms=re.compile(r'"([^"]+)"|(\S+)').findall,
normspace=re.compile(r'\s{2,}').sub):
return [normspace(' ', (t[0] or t[1]).strip()) for t in
findterms(text)]

normquery(' "some words" with and "without quotes " ')
['some words', 'with', 'and', 'without quotes']

Click to expand...

Click to expand...

HTH,
George

Gerard Flanagan · Apr 30, 2008

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

With simpleparse:

----------------------------------------------------------

from simpleparse.parser import Parser
from simpleparse.common import strings
from simpleparse.dispatchprocessor import DispatchProcessor, getString

grammar = '''
text := (quoted / unquoted / ws)+
quoted := string
unquoted := -ws+
ws := [ \t\r\n]+
'''

class MyProcessor(DispatchProcessor):

def __init__(self, groups):
self.groups = groups

def quoted(self, val, buffer):
self.groups.append(' '.join(getString(val, buffer)
[1:-1].split()))

def unquoted(self, val, buffer):
self.groups.append(getString(val, buffer))

def ws(self, val, buffer):
pass

groups = []
parser = Parser(grammar, 'text')
proc = MyProcessor(groups)
parser.parse(TESTS[1][1][0], processor=proc)

print groups

Python Regular Expressions	4	Jun 22, 2011
Utility to locate errors in regular expressions	3	May 24, 2013
Large regular expressions	1	Mar 15, 2010
Groups in regular expressions don't repeat as expected	7	Apr 20, 2011
The power of regular expressions without regular expressions.	0	Jul 17, 2013
Trouble with regular expressions	6	Feb 7, 2009
Regular expression issue	8	Aug 8, 2010
find and replace with regular expressions	6	Jul 31, 2008

Issue with regular expressions

Julien

Paul McGuire

Paul Melis

Robert Bossy

Hrvoje Niksic

cokofreedom

harvey.thomas

Matimus

Paul McGuire

George Sakkis

Gerard Flanagan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads