regexp for sequence of quoted strings

G

gry

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

but how do I represent a *sequence* of these separated
by commas? I guess I can artificially tack a comma on the
end of the input string and do:

r"\{('([^']|\\')*',)\}"

but that seems like an ugly hack...

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

-- George
 
S

Steven Bethard

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}
[snip]

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.


py> s = "{'the','dog\'s','bite'}"
py> s
"{'the','dog's','bite'}"
py> s[1:-1]
"'the','dog's','bite'"
py> s[1:-1].split(',')
["'the'", "'dog's'", "'bite'"]
py> [item[1:-1] for item in s[1:-1].split(',')]
['the', "dog's", 'bite']

py> s = "{'the'}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['the']

py> s = "{}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['']

Not sure what you want in the last case, but if you want an empty list,
you can probably add a simple if-statement to check if s[1:-1] is non-empty.

HTH,

STeVe
 
A

Alexander Schmolck

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

what about {'dog \\', ...} ?

If you don't need to validate anything you can just forget about the commas
etc and extract all the 'strings' with findall,

The regexp below is a bit too complicated (adapted from something else) but I
think will work:

In [90]:rex = re.compile(r"'(?:[^\n]|(?<!\\)(?:\\)(?:\\\\)*\n)*?(?<!\\)(?:\\\\)*?'")

In [91]:rex.findall(r"{'the','dog\'s','bite'}")
Out[91]:["'the'", "'dog\\'s'", "'bite'"]

Otherwise just add something like ",|}$" to deal with the final } instead of a
comma.

Alternatively, you could also write a regexp to split on the "','" bit and trim
the first and the last split.

'as
 
P

Paul McGuire

Pyparsing includes some built-in quoted string support that might
simplify this problem. Of course, if you prefer regexp's, I'm by no
means offended!

Check out my Python console session below. (You may need to expand the
unquote method to do more handling of backslash escapes.)

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)
.... t2 = t[0][1:-1]
.... return t2.replace("\\'","'")
....['the', "dog's", 'bite']
 
S

Steven Bethard

Paul said:
... t2 = t[0][1:-1]
... return t2.replace("\\'","'")
...

Note also, that the codec 'string-escape' can be used to do what's done
with str.replace in this example:

py> s
"'the','dog\\'s','bite'"
py> s.replace("\\'", "'")
"'the','dog's','bite'"
py> s.decode('string-escape')
"'the','dog's','bite'"

Using str.decode() is a little more general as it will also decode other
escaped characters. This may be good or bad depending on your needs.

STeVe
 
P

Paul McGuire

Ah, this is much better than my crude replace technique. I forgot
about str.decode().

Thanks!
-- Paul
 
G

gry

PyParsing rocks! Here's what I ended up with:

def unpack_sql_array(s):
import pyparsing as pp
withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)
withoutquotes = pp.CharsNotIn('",')
parser = pp.StringStart() + \
pp.Word('{').suppress() + \
pp.delimitedList(withquotes ^ withoutquotes) + \
pp.Word('}').suppress() + \
pp.StringEnd()
return parser.parseString(s).asList()

unpack_sql_array('{the,dog\'s,"foo,"}')
['the', "dog's", 'foo,']

[[Yes, this input is not what I stated originally. Someday, when I
reach a higher plane of existance, I will post a *complete* and
*correct* query to usenet...]]

Does the above seem fragile or questionable in any way?
Thanks all for your comments!

-- George
 
P

Paul McGuire

George -

Thanks for your enthusiastic endorsement!

Here are some quibbles about your pyparsing grammar (but really, not
bad for a first timer):
1. The Word class is used to define "words" or collective groups of
characters, by specifying what sets of characters are valid as leading
and/or body chars, as in:
integer = Word(digitsFrom0to9)
firstName = Word(upcaseAlphas, lowcaseAlphas)
In your parser, I think you want the Literal class instead, to match
the literal string '{'.

2. I don't think there is any chance to confuse a withQuotes with a
withoutQuotes, so you can try using the "match first" operator '|',
rather than the greedy matching "match longest" operator '^'.

3. Lastly, don't be too quick to use asList() to convert parse results
into lists - parse results already have most of the list accessors
people would need to access the returned matched tokens. asList() just
cleans up the output a bit.

Good luck, and thanks for trying pyparsing!
-- Paul
 
M

Magnus Lycka

I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{} ....
I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Assuming that you trust the input, you could always use eval,
but since it seems fairly easy to solve anyway, that might
not be the best (at least not safest) solution.
>>> strings = [r'''{'the','dog\'s','bite'}''', '''{'the'}''', '''{}''']
>>> for s in strings:
.... print eval('['+s[1:-1]+']')
....
['the', "dog's", 'bite']
['the']
[]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top