Splitting a quoted string.

Discussion in 'Python' started by mosscliffe, May 16, 2007.

  1. mosscliffe

    mosscliffe Guest

    I am looking for a simple split function to create a list of entries
    from a string which contains quoted elements. Like in 'google'
    search.

    eg string = 'bob john "johnny cash" 234 june'

    and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    'june']

    I wondered about using the csv routines, but I thought I would ask the
    experts first.

    There maybe a simple function, but as yet I have not found it.

    Thanks

    Richard
    mosscliffe, May 16, 2007
    #1
    1. Advertising

  2. mosscliffe

    Paul Melis Guest

    Hi,

    mosscliffe wrote:
    > I am looking for a simple split function to create a list of entries
    > from a string which contains quoted elements. Like in 'google'
    > search.
    >
    > eg string = 'bob john "johnny cash" 234 june'
    >
    > and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    > 'june']
    >
    > I wondered about using the csv routines, but I thought I would ask the
    > experts first.
    >
    > There maybe a simple function, but as yet I have not found it.


    Here a not-so-simple-function using regular expressions. It repeatedly
    matched two regexps, one that matches any sequence of characters except
    a space and one that matches a double-quoted string. If there are two
    matches the one occurring first in the string is taken and the matching
    part of the string cut off. This is repeated until the whole string is
    matched. If there are two matches at the same point in the string the
    longer of the two matches is taken. (This can't be done with a single
    regexp using the A|B operator, as it uses lazy evaluation. If A matches
    then it is returned even if B would match a longer string).

    import re

    def split_string(s):

    pat1 = re.compile('[^ ]+')
    pat2 = re.compile('"[^"]*"')

    parts = []

    m1 = pat1.search(s)
    m2 = pat2.search(s)
    while m1 or m2:

    if m1 and m2:
    # Both match, take match occurring earliest in the string
    p1 = m1.group(0)
    p2 = m2.group(0)
    if m1.start(0) < m2.start(0):
    part = p1
    s = s[m1.end(0):]
    elif m2.start(0) < m1.start(0):
    part = p2
    s = s[m2.end(0):]
    else:
    # Both match at the same string position, take longest match
    if len(p1) > len(p2):
    part = p1
    s = s[m1.end(0):]
    else:
    part = p2
    s = s[m2.end(0):]
    elif m1:
    part = m1.group(0)
    s = s[m1.end(0):]
    else:
    part = m2.group(0)
    s = s[m2.end(0):]

    parts.append(part)

    m1 = pat1.search(s)
    m2 = pat2.search(s)

    return parts

    >>> s = 'bob john "johnny cash" 234 june'
    >>> split_string(s)

    ['bob', 'john', '"johnny cash"', '234', 'june']
    >>>



    Paul
    Paul Melis, May 16, 2007
    #2
    1. Advertising

  3. mosscliffe

    Paul Melis Guest

    Paul Melis wrote:
    > Hi,
    >
    > mosscliffe wrote:
    >
    >> I am looking for a simple split function to create a list of entries
    >> from a string which contains quoted elements. Like in 'google'
    >> search.
    >>
    >> eg string = 'bob john "johnny cash" 234 june'
    >>
    >> and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    >> 'june']
    >>
    >> I wondered about using the csv routines, but I thought I would ask the
    >> experts first.
    >>
    >> There maybe a simple function, but as yet I have not found it.

    >
    >
    > Here a not-so-simple-function using regular expressions. It repeatedly
    > matched two regexps, one that matches any sequence of characters except
    > a space and one that matches a double-quoted string. If there are two
    > matches the one occurring first in the string is taken and the matching
    > part of the string cut off. This is repeated until the whole string is
    > matched. If there are two matches at the same point in the string the
    > longer of the two matches is taken. (This can't be done with a single
    > regexp using the A|B operator, as it uses lazy evaluation. If A matches
    > then it is returned even if B would match a longer string).


    Here a slightly improved version which is a bit more compact and which
    removes the quotes on the matched output quoted string.

    import re

    def split_string(s):

    pat1 = re.compile('[^" ]+')
    pat2 = re.compile('"([^"]*)"')

    parts = []

    m1 = pat1.search(s)
    m2 = pat2.search(s)
    while m1 or m2:

    if m1 and m2:
    if m1.start(0) < m2.start(0):
    match = 1
    elif m2.start(0) < m1.start(0):
    match = 2
    else:
    if len(m1.group(0)) > len(m2.group(0)):
    match = 1
    else:
    match = 2
    elif m1:
    match = 1
    else:
    match = 2

    if match == 1:
    part = m1.group(0)
    s = s[m1.end(0):]
    else:
    part = m2.group(1)
    s = s[m2.end(0):]

    parts.append(part)

    m1 = pat1.search(s)
    m2 = pat2.search(s)

    return parts

    print split_string('bob john "johnny cash" 234 june')
    print split_string('"abc""abc"')
    Paul Melis, May 16, 2007
    #3
  4. mosscliffe

    Duncan Booth Guest

    mosscliffe <> wrote:

    > I am looking for a simple split function to create a list of entries
    > from a string which contains quoted elements. Like in 'google'
    > search.
    >
    > eg string = 'bob john "johnny cash" 234 june'
    >
    > and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    > 'june']
    >
    > I wondered about using the csv routines, but I thought I would ask the
    > experts first.
    >
    > There maybe a simple function, but as yet I have not found it.


    You probably need to specify the problem more completely. e.g. Can the
    quoted parts of the strings contain quote marks? If so how what are the
    rules for escaping them. Do two spaces between a word mean an empty field
    or still a single string delimiter.

    Once you've worked that out you can either use re.split with a suitable
    regular expression, or use the csv module specifying your desired dialect:

    >>> class mosscliffe(csv.Dialect):

    delimiter = ' '
    quotechar = '"'
    doublequote = False
    skipinitialspace = False
    lineterminator = '\r\n'
    quoting = csv.QUOTE_MINIMAL


    >>> csv.register_dialect("mosscliffe", mosscliffe)
    >>> string = 'bob john "johnny cash" 234 june'
    >>> for row in csv.reader([string], dialect="mosscliffe"):

    print row


    ['bob', 'john', 'johnny cash', '234', 'june']
    Duncan Booth, May 16, 2007
    #4
  5. mosscliffe

    mosscliffe Guest

    Thank you very much for all for your replies.

    I am now much wiser to using regex and CSV.

    As I am quite a newbie, I have had my 'class' education improved as
    well.

    Many thanks again

    Richard

    On May 16, 12:48 pm, Duncan Booth <>
    wrote:
    > mosscliffe <> wrote:
    > > I am looking for a simple split function to create a list of entries
    > > from a string which contains quoted elements. Like in 'google'
    > > search.

    >
    > > eg string = 'bob john "johnny cash" 234 june'

    >
    > > and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    > > 'june']

    >
    > > I wondered about using the csv routines, but I thought I would ask the
    > > experts first.

    >
    > > There maybe a simple function, but as yet I have not found it.

    >
    > You probably need to specify the problem more completely. e.g. Can the
    > quoted parts of the strings contain quote marks? If so how what are the
    > rules for escaping them. Do two spaces between a word mean an empty field
    > or still a single string delimiter.
    >
    > Once you've worked that out you can either use re.split with a suitable
    > regular expression, or use the csv module specifying your desired dialect:
    >
    > >>> class mosscliffe(csv.Dialect):

    >
    > delimiter = ' '
    > quotechar = '"'
    > doublequote = False
    > skipinitialspace = False
    > lineterminator = '\r\n'
    > quoting = csv.QUOTE_MINIMAL
    >
    > >>> csv.register_dialect("mosscliffe", mosscliffe)
    > >>> string = 'bob john "johnny cash" 234 june'
    > >>> for row in csv.reader([string], dialect="mosscliffe"):

    >
    > print row
    >
    > ['bob', 'john', 'johnny cash', '234', 'june']
    mosscliffe, May 16, 2007
    #5
  6. On May 16, 12:42 pm, mosscliffe <> wrote:
    > I am looking for a simple split function to create a list of entries
    > from a string which contains quoted elements. Like in 'google'
    > search.
    >
    > eg string = 'bob john "johnny cash" 234 june'
    >
    > and I want to have a list of ['bob', 'john, 'johnny cash', '234',
    > 'june']
    >
    > I wondered about using the csv routines, but I thought I would ask the
    > experts first.
    >
    > There maybe a simple function, but as yet I have not found it.
    >


    See 'split' from 'shlex' module:

    >>> s = 'bob john "johnny cash" 234 june'
    >>> import shlex
    >>> shlex.split(s)

    ['bob', 'john', 'johnny cash', '234', 'june']
    >>>
    Gerard Flanagan, May 16, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page