Split string but ignore quotes

Discussion in 'Python' started by Scooter, Sep 29, 2009.

  1. Scooter

    Scooter Guest

    I'm attempting to reformat an apache log file that was written with a
    custom output format. I'm attempting to get it to w3c format using a
    python script. The problem I'm having is the field-to-field matching.
    In my python code I'm using split with spaces as my delimiter. But it
    fails when it reaches the user agent because that field itself
    contains spaces. But that user agent is enclosed with double quotes.
    So is there a way to split on a certain delimiter but not to split
    within quoted words.

    i.e. a line might look like

    2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
    1923 1360 31715 -
     
    Scooter, Sep 29, 2009
    #1
    1. Advertising

  2. 2009/9/29 Scooter <>:
    > I'm attempting to reformat an apache log file that was written with a
    > custom output format. I'm attempting to get it to w3c format using a
    > python script. The problem I'm having is the field-to-field matching.
    > In my python code I'm using split with spaces as my delimiter. But it
    > fails when it reaches the user agent because that field itself
    > contains spaces. But that user agent is enclosed with double quotes.
    > So is there a way to split on a certain delimiter but not to split
    > within quoted words.
    >
    > i.e. a line might look like
    >
    > 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    > Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    > 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
    > 1923 1360 31715 -


    Try shlex:

    >>> import shlex
    >>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
    >>> shlex.split(s)

    ['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
    MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
    Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
    'http://somehost.com', '200']





    --
    mvh Björn
     
    Björn Lindqvist, Sep 29, 2009
    #2
    1. Advertising

  3. Scooter

    MRAB Guest

    Björn Lindqvist wrote:
    > 2009/9/29 Scooter <>:
    >> I'm attempting to reformat an apache log file that was written with a
    >> custom output format. I'm attempting to get it to w3c format using a
    >> python script. The problem I'm having is the field-to-field matching.
    >> In my python code I'm using split with spaces as my delimiter. But it
    >> fails when it reaches the user agent because that field itself
    >> contains spaces. But that user agent is enclosed with double quotes.
    >> So is there a way to split on a certain delimiter but not to split
    >> within quoted words.
    >>
    >> i.e. a line might look like
    >>
    >> 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    >> Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    >> 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
    >> 1923 1360 31715 -

    >
    > Try shlex:
    >
    >>>> import shlex
    >>>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
    >>>> shlex.split(s)

    > ['2009-09-29', '12:00:00', '-', 'GET', '/', 'Mozilla/4.0 (compatible;
    > MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media
    > Center PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)',
    > 'http://somehost.com', '200']
    >

    The regex solution is:

    >>> import re
    >>> s = '2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE

    7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200'
    >>> re.findall(r'".*?"|\S+', s)

    ['2009-09-29', '12:00:00', '-', 'GET', '/', '"Mozilla/4.0 (compatible;
    MSIE 7.0; Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center
    PC 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"',
    'http://somehost.com', '200']
     
    MRAB, Sep 29, 2009
    #3
  4. Scooter

    Simon Forman Guest

    On Tue, Sep 29, 2009 at 11:11 AM, Scooter <> wrote:
    > I'm attempting to reformat an apache log file that was written with a
    > custom output format. I'm attempting to get it to w3c format using a
    > python script. The problem I'm having is the field-to-field matching.
    > In my python code I'm using split with spaces as my delimiter. But it
    > fails when it reaches the user agent because that field itself
    > contains spaces. But that user agent is enclosed with double quotes.
    > So is there a way to split on a certain delimiter but not to split
    > within quoted words.
    >
    > i.e. a line might look like
    >
    > 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    > Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    > 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
    > 1923 1360 31715 -
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    s = '''2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0;
    ..NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200 1923
    1360 31715 -'''


    initial, user_agent, trailing = s.split('"')

    # Then depending on what you want to do with them...
    foo = initial.split() + [user_agent] + trailing.split()
     
    Simon Forman, Sep 29, 2009
    #4
  5. Scooter

    BJ Swope Guest

    Would the csv module be appropriate?

    On 9/29/09, Scooter <> wrote:
    > I'm attempting to reformat an apache log file that was written with a
    > custom output format. I'm attempting to get it to w3c format using a
    > python script. The problem I'm having is the field-to-field matching.
    > In my python code I'm using split with spaces as my delimiter. But it
    > fails when it reaches the user agent because that field itself
    > contains spaces. But that user agent is enclosed with double quotes.
    > So is there a way to split on a certain delimiter but not to split
    > within quoted words.
    >
    > i.e. a line might look like
    >
    > 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    > Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    > 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)" http://somehost.com 200
    > 1923 1360 31715 -
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >



    --
    To argue that honorable conduct is only required against an honorable
    enemy degrades the Americans who must carry out the orders. -- Charles
    Krulak, Former Commandant of the Marine Corps

    We are all slave to our own paradigm. -- Joshua Williams

    If the letters PhD appear after a person's name, that person will
    remain outdoors even after it's started raining. -- Jeff Kay
     
    BJ Swope, Sep 30, 2009
    #5
  6. On Sep 29, 5:11 pm, Scooter <> wrote:
    > I'm attempting to reformat an apache log file that was written with a
    > custom output format. I'm attempting to get it to w3c format using a
    > python script. The problem I'm having is the field-to-field matching.
    > In my python code I'm using split with spaces as my delimiter. But it
    > fails when it reaches the user agent because that field itself
    > contains spaces. But that user agent is enclosed with double quotes.
    > So is there a way to split on a certain delimiter but not to split
    > within quoted words.
    >
    > i.e. a line might look like
    >
    > 2009-09-29 12:00:00 - GET / "Mozilla/4.0 (compatible; MSIE 7.0;
    > Windows NT 6.0; GTB5; SLCC1; .NET CLR 2.0.50727; Media Center PC
    > 5.0; .NET CLR 3.0.04506; .NET CLR 3.5.21022)"http://somehost.com200
    > 1923 1360 31715 -


    Best option for you is to use shlex module as Björn said.
    This is quite a simple question and you would find it on your own for
    sure if you search python docs a little bit :)
     
    Processor-Dev1l, Sep 30, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    1
    Views:
    13,752
    Oisin
    Mar 24, 2006
  2. R. David Murray
    Replies:
    8
    Views:
    612
    Tim Chase
    Mar 27, 2009
  3. Terry Reedy
    Replies:
    1
    Views:
    335
    John Machin
    Mar 26, 2009
  4. Jonno
    Replies:
    0
    Views:
    231
    Jonno
    Apr 13, 2011
  5. Sam Kong
    Replies:
    5
    Views:
    276
    Rick DeNatale
    Aug 12, 2006
Loading...

Share This Page