Issue with regular expressions

J

Julien

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien
 
P

Paul McGuire

I'd like to select terms in a string, so I can then do a search in my
database.

query = '   "  some words"  with and "without    quotes   "  '
p = re.compile(magic_regular_expression)   $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Julien -

I dabbled with re's for a few minutes trying to get your solution,
then punted and used pyparsing instead. Pyparsing will run slower
than re, but many people find it much easier to work with readable
class names and instances rather than re's typoglyphics:

from pyparsing import OneOrMore, Word, printables, dblQuotedString,
removeQuotes

# when a quoted string is found, remove the quotes,
# then strip whitespace from the contents
dblQuotedString.setParseAction(removeQuotes,
lambda s:s[0].strip())

# define terms to be found in query string
term = dblQuotedString | Word(printables)
query_terms = OneOrMore(term)

# parse query string to extract terms
query = ' " some words" with and "without quotes " '
print tuple(query_terms.parseString(query))

Gives:
('some words', 'with', 'and', 'without quotes')

The pyparsing wiki is at http://pyparsing.wikispaces.com. You'll find
an examples page that includes a search query parser, and pointers to
a number of online documentation and presentation sources.

-- Paul
 
P

Paul Melis

Julien said:
Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Here's one way with a single regexp plus an extra filter function.
>>> import re
>>> p = re.compile('("([^"]+)")|([^ \t]+)')
>>> m = p.findall(q)
>>> m
[('" some words"', ' some words', ''), ('', '', 'with'), ('', '',
'and'), ('"without quotes "', 'without quotes ', '')].... if t[0] == '':
.... return t[2]
.... else:
.... return t[1]
....[' some words', 'with', 'and', 'without quotes ']

If you want to strip away the leading/trailing whitespace from the
quoted strings, then change the last return statement to
be "return t[1].strip()".

Paul
 
R

Robert Bossy

Julien said:
Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.
Hi,

I think re is not the best tool for you. Maybe there's a regular
expression that does what you want but it will be quite complex and hard
to maintain.

I suggest you split the query with the double quotes and process
alternate inside/outside chunks. Something like:

import re

def spulit(s):
inq = False
for term in s.split('"'):
if inq:
yield re.sub('\s+', ' ', term.strip())
else:
for word in term.split():
yield word
inq = not inq

for token in spulit(' " some words" with and "without quotes " '):
print token


Cheers,
RB
 
H

Hrvoje Niksic

Julien said:
I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I don't think you can achieve this with a single regular expression.
Your best bet is to use p.findall() to find all plausible matches, and
then rework them a bit. For example:

p = re.compile(r'"[^"]*"|[\S]+')
p.findall(query)
['" some words"', 'with', 'and', '"without quotes "']

At that point, you can easily iterate through the list and remove the
quotes and excess whitespace.
 
C

cokofreedom

| # ---- Double Quote Text ----
| " # match a double quote
| ( # - Two Possiblities:
| \\. # match two backslashes followed by anything
(include newline)
| | # OR
| [^"] # do not match a single quote
| )* # - from zero to many
| " # finally match a double quote
|
| | # ======== OR ========
|
| # ---- Single Quote Text ----
| ' # match a single quote
| ( # - Two Possiblities:
| \\. # match two backslashes followed by anything
(include newline)
| | # OR
| [^'] # do not match a single quote
| )* # - from zero to many
| ' # finally match a single quote
| """, DOTALL|VERBOSE)

Used this before (minus those | at the beginning) to find double
quotes and single quotes in a file (there is more to this that looks
for C++ and C style quotes but that isn't needed here), perhaps you
can take it another step to not do changes to these matches?

r""""(\\.|[^"])*"|'(\\.|[^'])*'""", DOTALL)

is it in a single line :)
 
H

harvey.thomas

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = '   "  some words"  with and "without    quotes   "  '
p = re.compile(magic_regular_expression)   $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien

You can't do it simply and completely with regular expressions alone
because of the requirement to strip the quotes and normalize
whitespace, but its not too hard to write a function to do it. Viz:

import re

wordre = re.compile('"[^"]+"|[a-zA-Z]+').findall
def findwords(src):
ret = []
for x in wordre(src):
if x[0] == '"':
#strip off the quotes and normalise spaces
ret.append(' '.join(x[1:-1].split()))
else:
ret.append(x)
return ret

query = ' " Some words" with and "without quotes " '
print findwords(query)

Running this gives
['Some words', 'with', 'and', 'without quotes']

HTH

Harvey
 
M

Matimus

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

Thanks!!

Julien

I don't know if it is possible to do it all with one regex, but it
doesn't seem practical. I would check-out the shlex module.
[' some words', 'with', 'and', 'without quotes ']

To get rid of the leading and trailing space you can then use strip:
[s.strip() for s in shlex.split(query)]
['some words', 'with', 'and', 'without quotes']

The only problem is getting rid of the extra white-space in the middle
of the expression, for which re might still be a good solution.
import re
[re.sub(r"\s+", ' ', s.strip()) for s in shlex.split(query)]
['some words', 'with', 'and', 'without quotes']

Matt
 
P

Paul McGuire

Oh! It wasn't until Matimus's post that I saw that you wanted the
interior whitespace within the quoted strings collapsed also. Just
add another parse action to the chain of functions on dblQuotedString:

# when a quoted string is found, remove the quotes,
# then strip whitespace from the contents, then
# collapse interior whitespace
dblQuotedString.setParseAction(removeQuotes,
lambda s:s[0].strip(),
lambda s:" ".join(s[0].split()))

Plugging this into the previous script now gives:
('some words', 'with', 'and', 'without quotes')

-- Paul
 
G

George Sakkis

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

As other replies mention, there is no single expression since you are
doing two things: find all matches and substitute extra spaces within
the quoted matches. It can be done with two expressions though:

def normquery(text, findterms=re.compile(r'"([^"]+)"|(\S+)').findall,
normspace=re.compile(r'\s{2,}').sub):
return [normspace(' ', (t[0] or t[1]).strip()) for t in
findterms(text)]
normquery(' "some words" with and "without quotes " ')
['some words', 'with', 'and', 'without quotes']


HTH,
George
 
G

Gerard Flanagan

Hi,

I'm fairly new in Python and I haven't used the regular expressions
enough to be able to achieve what I want.
I'd like to select terms in a string, so I can then do a search in my
database.

query = ' " some words" with and "without quotes " '
p = re.compile(magic_regular_expression) $ <--- the magic happens
m = p.match(query)

I'd like m.groups() to return:
('some words', 'with', 'and', 'without quotes')

Is that achievable with a single regular expression, and if so, what
would it be?

Any help would be much appreciated.

With simpleparse:

----------------------------------------------------------

from simpleparse.parser import Parser
from simpleparse.common import strings
from simpleparse.dispatchprocessor import DispatchProcessor, getString


grammar = '''
text := (quoted / unquoted / ws)+
quoted := string
unquoted := -ws+
ws := [ \t\r\n]+
'''

class MyProcessor(DispatchProcessor):

def __init__(self, groups):
self.groups = groups

def quoted(self, val, buffer):
self.groups.append(' '.join(getString(val, buffer)
[1:-1].split()))

def unquoted(self, val, buffer):
self.groups.append(getString(val, buffer))

def ws(self, val, buffer):
pass

groups = []
parser = Parser(grammar, 'text')
proc = MyProcessor(groups)
parser.parse(TESTS[1][1][0], processor=proc)

print groups
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top