split string at commas respecting quotes when string not in csv format

Discussion in 'Python' started by R. David Murray, Mar 26, 2009.

  1. OK, I've got a little problem that I'd like to ask the assembled minds
    for help with. I can write code to parse this, but I'm thinking it may
    be possible to do it with regexes. My regex foo isn't that good, so if
    anyone is willing to help (or offer an alternate parsing suggestion)
    I would be greatful. (This has to be stdlib only, by the way, I
    can't introduce any new modules into the application so pyparsing is
    not an option.)

    The challenge is to turn a string like this:

    a=1,b="0234,)#($)@", k="7"

    into this:

    [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]

    --
    R. David Murray http://www.bitdance.com
     
    R. David Murray, Mar 26, 2009
    #1
    1. Advertising

  2. R. David Murray

    John Machin Guest

    Re: split string at commas respecting quotes when string not in csvformat

    On Mar 27, 6:51 am, "R. David Murray" <> wrote:
    > OK, I've got a little problem that I'd like to ask the assembled minds
    > for help with.  I can write code to parse this, but I'm thinking it may
    > be possible to do it with regexes.  My regex foo isn't that good, so if
    > anyone is willing to help (or offer an alternate parsing suggestion)
    > I would be greatful.  (This has to be stdlib only, by the way, I
    > can't introduce any new modules into the application so pyparsing is
    > not an option.)
    >
    > The challenge is to turn a string like this:
    >
    >     a=1,b="0234,)#($)@", k="7"
    >
    > into this:
    >
    >     [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]


    The challenge is for you to explain unambiguously what you want.

    1. a=1 => "1" and k="7" => "7" ... is this a mistake or are the quotes
    optional in the original string when not required to protect a comma?

    2. What is the rule that explains the transmogrification of @ to # in
    your example?

    3. Is the input guaranteed to be syntactically correct?

    The following should do close enough to what you want; adjust as
    appropriate.

    >>> import re
    >>> s = """a=1,b="0234,)#($)@", k="7" """
    >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    >>> rx.findall(s)

    [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')

    [('a', '1'), ('b', '2')]
    >>>


    HTH,
    John
     
    John Machin, Mar 26, 2009
    #2
    1. Advertising

  3. R. David Murray

    Paul McGuire Guest

    Re: split string at commas respecting quotes when string not in csvformat

    On Mar 26, 2:51 pm, "R. David Murray" <> wrote:
    > OK, I've got a little problem that I'd like to ask the assembled minds
    > for help with.  I can write code to parse this, but I'm thinking it may
    > be possible to do it with regexes.  My regex foo isn't that good, so if
    > anyone is willing to help (or offer an alternate parsing suggestion)
    > I would be greatful.  (This has to be stdlib only, by the way, I
    > can't introduce any new modules into the application so pyparsing is
    > not an option.)
    >
    > The challenge is to turn a string like this:
    >
    >     a=1,b="0234,)#($)@", k="7"
    >
    > into this:
    >
    >     [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]
    >
    > --
    > R. David Murray            http://www.bitdance.com


    If you must cram all your code into a single source file, then
    pyparsing would be problematic. But pyparsing's installation
    footprint is really quite small, just a single Python source file. So
    if your program spans more than one file, just add pyparsing.py into
    the local directory along with everything else.

    Then you could write this little parser and be done (note the
    differentiation between 1 and "7"):

    test = 'a=1,b="0234,)#($)@", k="7"'

    from pyparsing import Suppress, Word, alphas, alphanums, \
    nums, quotedString, removeQuotes, Group, delimitedList

    EQ = Suppress('=')
    varname = Word(alphas,alphanums)
    integer = Word(nums).setParseAction(lambda t:int(t[0]))
    varvalue = integer | quotedString.setParseAction(removeQuotes)
    var_assignment = varname("name") + EQ + varvalue("rhs")
    expr = delimitedList(Group(var_assignment))

    results = expr.parseString(test)
    print results.asList()
    for assignment in results:
    print assignment.name, '<-', repr(assignment.rhs)

    Prints:

    [['a', 1], ['b', '0234,)#($)@'], ['k', '7']]
    a <- 1
    b <- '0234,)#($)@'
    k <- '7'

    -- Paul
     
    Paul McGuire, Mar 26, 2009
    #3
  4. Re: split string at commas respecting quotes when string not in csvformat

    John Machin <> wrote:
    > On Mar 27, 6:51 am, "R. David Murray" <> wrote:
    > > OK, I've got a little problem that I'd like to ask the assembled minds
    > > for help with.  I can write code to parse this, but I'm thinking it may
    > > be possible to do it with regexes.  My regex foo isn't that good, so if
    > > anyone is willing to help (or offer an alternate parsing suggestion)
    > > I would be greatful.  (This has to be stdlib only, by the way, I
    > > can't introduce any new modules into the application so pyparsing is
    > > not an option.)
    > >
    > > The challenge is to turn a string like this:
    > >
    > >     a=1,b="0234,)#($)@", k="7"
    > >
    > > into this:
    > >
    > >     [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]

    >
    > The challenge is for you to explain unambiguously what you want.
    >
    > 1. a=1 => "1" and k="7" => "7" ... is this a mistake or are the quotes
    > optional in the original string when not required to protect a comma?


    optional.

    > 2. What is the rule that explains the transmogrification of @ to # in
    > your example?


    Now that's a mistake :)

    > 3. Is the input guaranteed to be syntactically correct?


    If it's not, it's the customer that gets to deal with the error.

    > The following should do close enough to what you want; adjust as
    > appropriate.
    >
    > >>> import re
    > >>> s = """a=1,b="0234,)#($)@", k="7" """
    > >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    > >>> rx.findall(s)

    > [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    > >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')

    > [('a', '1'), ('b', '2')]
    > >>>


    I'm going to save this one and study it, too. I'd like to learn
    to use regexes better, even if I do try to avoid them when possible :)

    --
    R. David Murray http://www.bitdance.com
     
    R. David Murray, Mar 27, 2009
    #4
  5. Re: split string at commas respecting quotes when string not in csvformat

    Paul McGuire <> wrote:
    > On Mar 26, 2:51 pm, "R. David Murray" <> wrote:
    > > OK, I've got a little problem that I'd like to ask the assembled minds
    > > for help with.  I can write code to parse this, but I'm thinking it may
    > > be possible to do it with regexes.  My regex foo isn't that good, so if
    > > anyone is willing to help (or offer an alternate parsing suggestion)
    > > I would be greatful.  (This has to be stdlib only, by the way, I
    > > can't introduce any new modules into the application so pyparsing is
    > > not an option.)

    >
    > If you must cram all your code into a single source file, then
    > pyparsing would be problematic. But pyparsing's installation
    > footprint is really quite small, just a single Python source file. So
    > if your program spans more than one file, just add pyparsing.py into
    > the local directory along with everything else.


    It isn't a matter of wanting to cram the code into a single source file.
    I'm fixing a bug in a vendor-installed application. A ten line locally
    maintained patch is bad enough, installing a whole new external dependency
    is just Not An Option :)

    --
    R. David Murray http://www.bitdance.com
     
    R. David Murray, Mar 27, 2009
    #5
  6. R. David Murray

    Tim Chase Guest

    Re: split string at commas respecting quotes when string not in csvformat

    >> >>> import re
    >> >>> s = """a=1,b="0234,)#($)@", k="7" """
    >> >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    >> >>> rx.findall(s)

    >> [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    >> >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')

    >> [('a', '1'), ('b', '2')]
    >> >>>

    >
    > I'm going to save this one and study it, too. I'd like to learn
    > to use regexes better, even if I do try to avoid them when possible :)


    This regexp is fairly close to the one I used, but I employed the
    re.VERBOSE flag to split it out for readability. The above
    breaks down as

    [ ]* # optional whitespace, traditionally "\s*"
    (\w+) # tag the variable name as one or more "word" chars
    = # the literal equals sign
    ( # tag the value
    [^",]+ # one or more non-[quote/comma] chars
    | # or
    "[^"]*" # quotes around a bunch of non-quote chars
    ) # end of the value being tagged
    [ ]* # same as previously, optional whitespace ("\s*")
    (?: # a non-capturing group (why?)
    , # a literal comma
    | # or
    $ # the end-of-line/string
    ) # end of the non-capturing group

    Hope this helps,

    -tkc
     
    Tim Chase, Mar 27, 2009
    #6
  7. R. David Murray

    John Machin Guest

    Re: split string at commas respecting quotes when string not in csvformat

    On Mar 27, 9:19 pm, Tim Chase <> wrote:
    > >>  >>> import re
    > >>  >>> s = """a=1,b="0234,)#($)@", k="7" """
    > >>  >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    > >>  >>> rx.findall(s)
    > >>  [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    > >>  >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
    > >>  [('a', '1'), ('b', '2')]

    >
    > > I'm going to save this one and study it, too.  I'd like to learn
    > > to use regexes better, even if I do try to avoid them when possible :)

    >
    > This regexp is fairly close to the one I used, but I employed the
    > re.VERBOSE flag to split it out for readability.  The above
    > breaks down as
    >
    >   [ ]*       # optional whitespace, traditionally "\s*"


    No, it's optional space characters -- T'd regard any other type of
    whitespace there as a stuff-up.

    >   (\w+)      # tag the variable name as one or more "word" chars
    >   =          # the literal equals sign
    >   (          # tag the value
    >   [^",]+     # one or more non-[quote/comma] chars
    >   |          # or
    >   "[^"]*"    # quotes around a bunch of non-quote chars
    >   )          # end of the value being tagged
    >   [ ]*       # same as previously, optional whitespace  ("\s*")


    same correction as previously

    >   (?:        # a non-capturing group (why?)


    a group because I couldn't be bothered thinking too hard about the
    precedence of the | operator, and non-capturing because the OP didn't
    want it captured.

    >   ,          # a literal comma
    >   |          # or
    >   $          # the end-of-line/string
    >   )          # end of the non-capturing group
    >
    > Hope this helps,


    Me too :)

    Cheers,
    John
     
    John Machin, Mar 27, 2009
    #7
  8. R. David Murray

    Paul McGuire Guest

    Re: split string at commas respecting quotes when string not in csvformat

    On Mar 27, 5:19 am, Tim Chase <> wrote:
    > >>  >>> import re
    > >>  >>> s = """a=1,b="0234,)#($)@", k="7" """
    > >>  >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    > >>  >>> rx.findall(s)
    > >>  [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    > >>  >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
    > >>  [('a', '1'), ('b', '2')]

    >
    > > I'm going to save this one and study it, too.  I'd like to learn
    > > to use regexes better, even if I do try to avoid them when possible :)

    >
    > This regexp is fairly close to the one I used, but I employed the
    > re.VERBOSE flag to split it out for readability.  The above
    > breaks down as
    >
    >   [ ]*       # optional whitespace, traditionally "\s*"
    >   (\w+)      # tag the variable name as one or more "word" chars
    >   =          # the literal equals sign
    >   (          # tag the value
    >   [^",]+     # one or more non-[quote/comma] chars
    >   |          # or
    >   "[^"]*"    # quotes around a bunch of non-quote chars
    >   )          # end of the value being tagged
    >   [ ]*       # same as previously, optional whitespace  ("\s*")
    >   (?:        # a non-capturing group (why?)
    >   ,          # a literal comma
    >   |          # or
    >   $          # the end-of-line/string
    >   )          # end of the non-capturing group
    >
    > Hope this helps,
    >
    > -tkc


    Mightent there be whitespace on either side of the '=' sign? And if
    you are using findall, why is the bit with the delimiting commas or
    end of line/string necessary? I should think findall would just skip
    over this stuff, like it skips over *DODGY*SYNTAX* in your example.

    -- Paul
     
    Paul McGuire, Mar 27, 2009
    #8
  9. R. David Murray

    Tim Chase Guest

    Re: split string at commas respecting quotes when string not in csvformat

    Paul McGuire wrote:
    > On Mar 27, 5:19 am, Tim Chase <> wrote:
    >>>> >>> import re
    >>>> >>> s = """a=1,b="0234,)#($)@", k="7" """
    >>>> >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
    >>>> >>> rx.findall(s)
    >>>> [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
    >>>> >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
    >>>> [('a', '1'), ('b', '2')]
    >>> I'm going to save this one and study it, too. I'd like to learn
    >>> to use regexes better, even if I do try to avoid them when possible :)

    >> This regexp is fairly close to the one I used, but I employed the
    >> re.VERBOSE flag to split it out for readability. The above
    >> breaks down as
    >>
    >> [ ]* # optional whitespace, traditionally "\s*"
    >> (\w+) # tag the variable name as one or more "word" chars
    >> = # the literal equals sign
    >> ( # tag the value
    >> [^",]+ # one or more non-[quote/comma] chars
    >> | # or
    >> "[^"]*" # quotes around a bunch of non-quote chars
    >> ) # end of the value being tagged
    >> [ ]* # same as previously, optional whitespace ("\s*")
    >> (?: # a non-capturing group (why?)
    >> , # a literal comma
    >> | # or
    >> $ # the end-of-line/string
    >> ) # end of the non-capturing group

    >
    > Mightent there be whitespace on either side of the '=' sign? And if
    > you are using findall, why is the bit with the delimiting commas or
    > end of line/string necessary? I should think findall would just skip
    > over this stuff, like it skips over *DODGY*SYNTAX* in your example.


    Which would leave you with the solution(s) fairly close to what I
    original posited ;-)

    (my comment about the "non-capturing group (why?)" was in
    relation to not needing to find the EOL/comma because findall()
    doesn't need it, as Paul points out, not the precedence of the
    "|" operator.)

    -tkc
     
    Tim Chase, Mar 27, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    877
    GIMME
    Feb 11, 2004
  2. Jim
    Replies:
    8
    Views:
    396
    Raymond Hettinger
    Jul 10, 2006
  3. AviraM
    Replies:
    2
    Views:
    6,399
    Manish Pandit
    Sep 28, 2006
  4. Robert Dodier
    Replies:
    5
    Views:
    676
    mario
    Jun 25, 2008
  5. Terry Reedy
    Replies:
    1
    Views:
    329
    John Machin
    Mar 26, 2009
Loading...

Share This Page