Escaping commas within parens in CSV parsing?

F

felciano

Hi --

I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]

I think this is probably non-standard escaping, as I can't figure out
how to structure a csv dialect to handle it correctly. I can probably
hack this with regular expressions but I thought I'd check to see if
anyone had any quick suggestions for how to do this elegantly first.

Thanks!

Ramon
 
S

Skip Montanaro

Ramon> I am trying to use the csv module to parse a column of values
Ramon> containing comma-delimited values with unusual escaping:

Ramon> AAA, BBB, CCC (some text, right here), DDD

Ramon> I want this to come back as:

Ramon> ["AAA", "BBB", "CCC (some text, right here)", "DDD"]

Alas, there's no "escaping" at all in the line above. I see no obvious way
to distinguish one comma from another in this example. If you mean the fact
that the comma you want to retain is in parens, that's not escaping. Escape
characters don't appear in the output as they do in your example.

Ramon> I can probably hack this with regular expressions but I thought
Ramon> I'd check to see if anyone had any quick suggestions for how to
Ramon> do this elegantly first.

I see nothing obvious unless you truly mean that the beginning of each field
is all caps. In that case you could wrap a file object and :

import re
class FunnyWrapper:
"""untested"""
def __init__(self, f):
self.f = f

def __iter__(self):
return self

def next(self):
return '"' + re.sub(r',( *[A-Z]+)', r'","\1', self.f.next()) + '"'

and use it like so:

reader = csv.reader(FunnyWrapper(open("somefile.csv", "rb")))
for row in reader:
print row

(I'm not sure what the ramifications are of iterating over a file opened in
binary mode.)

Skip
 
D

Devan L

Oops, the above code doesn't quite work. Use this one instead.
re.findall(r'(.+? (?:\(.+?\))?)(?:,|$)',yourtexthere)
 
P

Paul McGuire

Well, this doesn't have the terseness of an re solution, but it
shouldn't be hard to follow.
-- Paul

#~ This is a very crude first pass. It does not handle nested
#~ ()'s, nor ()'s inside quotes. But if your data does not
#~ stray too far from the example, this will probably do the job.

#~ Download pyparsing at http://pyparsing.sourceforge.net.
import pyparsing as pp

test = "AAA, BBB , CCC (some text, right here), DDD"

COMMA = pp.Literal(",")
LPAREN = pp.Literal("(")
RPAREN = pp.Literal(")")
parenthesizedText = LPAREN + pp.SkipTo(RPAREN) + RPAREN

nonCommaChars = "".join( [ chr(c) for c in range(32,127)
if c not in map(ord,list(",()")) ] )
nonCommaText = pp.Word(nonCommaChars)

commaListEntry = pp.Combine(pp.OneOrMore( parenthesizedText |
nonCommaText ),adjacent=False)
commaListEntry.setParseAction( lambda s,l,t: t[0].strip() )

csvList = pp.delimitedList( commaListEntry )
print csvList.parseString(test)
 
E

Edvard Majakari

I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]

Quick and somewhat dirty: change your delimiter to a char that never exists in
fields (eg. null character '\0').

Example:
s = 'AAA\0 BBB\0 CCC (some text, right here)\0 DDD'
[f.strip() for f in s.split('\0')]
['AAA', 'BBB', 'CCC (some text, right here)', 'DDD']

But then you'd need to be certain there's no null character in the input
lines by checking it:

colsep = '\0'

for field in inputs:
if colsep in field:
raise IllegalCharException('invalid chars in field %s' % field)

If you need to stick with comma as a separator and the format is relatively
fixed, I'd probably use some parser module instead. Regular expressions are
nice too, but it is easy to make a mistake with those, and for non-trivial
stuff they tend to become write-only.

--
# Edvard Majakari Software Engineer
# PGP PUBLIC KEY available Soli Deo Gloria!

$_ = '456476617264204d616a616b6172692c20612043687269737469616e20'; print
join('',map{chr hex}(split/(\w{2})/)),uc substr(crypt(60281449,'es'),2,4),"\n";
 
F

felciano

Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :)

Thanks again for all the quick responses.

Ramon
 
W

William Park

felciano said:
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :)

Thanks again for all the quick responses.

How about changing '(' or ')' into three double-quotes '"""'? That will
solve splitting issue. But, I'm not sure how you would get back '(' or
')', without much coding.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top