splitting a words of a line

Discussion in 'Python' started by Sumit, Dec 6, 2007.

  1. Sumit

    Sumit Guest

    Hi ,
    I am trying to splitt a Line whihc is below of format ,

    AzAccept PLYSSTM01 [23/Sep/2005:16:14:28 -0500] "162.44.245.32 CN=dddd
    cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    Secure,DC=customer,DC=rxcorp,DC=com" "plysmhc03zp GET /mci/performance/
    SelectProducts.aspx?
    p=0&V=C&a=29&menu=adhoc" [d4b62ca2-09a0-4334622b-0e1c-03c42ba5] [0]

    Here all the string whihc i want to split is
    ---------------------------------
    AzAccept
    PLYSSTM01
    [23/Sep/2005:16:14:28 -0500]
    162.44.245.32
    CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    Secure,DC=customer,DC=rxcorp,DC=com"
    GET
    /mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc
    d4b62ca2-09a0-4334622b-0e1c-03c42ba5
    0
    --------------------------------

    i am trying to use re.split() method to split them , But unable to get
    the exact result .

    Any help on this is highly appriciated .

    Thanks
    Sumit
    Sumit, Dec 6, 2007
    #1
    1. Advertising

  2. Sumit

    John Machin Guest

    On Dec 7, 2:21 am, Sumit <> wrote:
    > Hi ,
    > I am trying to splitt a Line whihc is below of format ,
    >
    > AzAccept PLYSSTM01 [23/Sep/2005:16:14:28 -0500] "162.44.245.32 CN=dddd
    > cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    > Secure,DC=customer,DC=rxcorp,DC=com" "plysmhc03zp GET /mci/performance/
    > SelectProducts.aspx?
    > p=0&V=C&a=29&menu=adhoc" [d4b62ca2-09a0-4334622b-0e1c-03c42ba5] [0]


    Because lines are mangled in transmission, it is rather difficult to
    guess exactly what you have in your input and what your expected
    results are.

    Also you don't show exactly what you have tried.

    At the end is a small script that contains my guess as to your input
    and expected results, shows an example of what the re.VERBOSE flag is
    intended for, and how you might debug your results.

    So that you don't get your homework done 100% for free, I haven't
    corrected the last mistake I made.

    As usual, re may not be the best way of doing this exercise. Your
    *single* piece of evidence may not be enough. It appears to be a
    horrid conglomeration of instances of different things, each with its
    own grammar. You may find that something like PyParsing would be more
    legible and more robust.

    >
    > Here all the string whihc i want to split is
    > ---------------------------------
    > AzAccept
    > PLYSSTM01
    > [23/Sep/2005:16:14:28 -0500]
    > 162.44.245.32
    > CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    > Secure,DC=customer,DC=rxcorp,DC=com"
    > GET
    > /mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc
    > d4b62ca2-09a0-4334622b-0e1c-03c42ba5
    > 0
    > --------------------------------
    >
    > i am trying to use re.split() method to split them , But unable to get
    > the exact result .
    >


    C:\junk>type sumit.py
    import re

    textin = \
    """AzAccept PLYSSTM01 [23/Sep/2005:16:14:28 -0500] "162.44.245.32
    CN=dddd """ \
    """cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk """ \
    """Secure,DC=customer,DC=rxcorp,DC=com" "plysmhc03zp GET /mci/
    performance/""" \
    """SelectProducts.aspx?""" \
    """p=0&V=C&a=29&menu=adhoc" [d4b62ca2-09a0-4334622b-0e1c-03c42ba5]
    [0]"""

    expected = [
    "AzAccept",
    "PLYSSTM01",
    "23/Sep/2005:16:14:28 -0500",
    "162.44.245.32",
    "CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    Secure,DC=custom
    er,DC=rxcorp,DC=com",
    "plysmhc03zp",
    "GET",
    "/mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc",
    "d4b62ca2-09a0-4334622b-0e1c-03c42ba5",
    "0",
    ]

    pattern = r"""
    (\S+) # AzAccept
    \s+
    (\S+) # PLYSSTM01
    \s+\[
    ([^]]+) # 23/Sep/2005:16:14:28 -0500
    ]\s+"
    (\S+) # 162.44.245.32
    \s+
    ([^"]+) # CN=dddd cojack (890),OU=1, etc etc,DC=rxcorp,DC=com
    "\s+"
    (\S+) # plysmhc03zp
    \s+
    (\S+) # GET
    \s+
    (\S+) # /mci/performance/ ... menu=adhoc
    \s+\[
    ([^]]+) # d4b62ca2-09a0-4334622b-0e1c-03c42ba5
    ]\s+\[
    ([^]]+) # 0
    ]$
    """

    mobj = re.match(pattern, textin, re.VERBOSE)
    if not mobj:
    print "Bzzzt!"
    else:
    result = mobj.groups()
    print "len check", len(result) == len(expected), len(result),
    len(expected)
    for a, b in zip(result, expected):
    print a == b, repr(a), repr(b)



    C:\junk>python sumit.py
    len check True 10 10
    True 'AzAccept' 'AzAccept'
    True 'PLYSSTM01' 'PLYSSTM01'
    True '23/Sep/2005:16:14:28 -0500' '23/Sep/2005:16:14:28 -0500'
    True '162.44.245.32' '162.44.245.32'
    True 'CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    Secure,DC=custo
    mer,DC=rxcorp,DC=com' 'CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-
    Users,OU=kk
    k Secure,DC=customer,DC=rxcorp,DC=com'
    True 'plysmhc03zp' 'plysmhc03zp'
    True 'GET' 'GET'
    False '/mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc"'
    '/mci/perf
    ormance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc'
    True 'd4b62ca2-09a0-4334622b-0e1c-03c42ba5'
    'd4b62ca2-09a0-4334622b-0e1c-03c42ba
    5'
    True '0' '0'

    C:\junk>
    John Machin, Dec 6, 2007
    #2
    1. Advertising

  3. Sumit

    Paul McGuire Guest

    On Dec 6, 9:21 am, Sumit <> wrote:
    > Hi ,
    > I am trying to splitt a Line whihc is below of format ,
    >
    > AzAccept PLYSSTM01 [23/Sep/2005:16:14:28 -0500] "162.44.245.32 CN=dddd
    > cojack (890),OU=1,OU=Customers,OU=ISM-Users,OU=kkk
    > Secure,DC=customer,DC=rxcorp,DC=com" "plysmhc03zp GET /mci/performance/
    > SelectProducts.aspx?
    > p=0&V=C&a=29&menu=adhoc" [d4b62ca2-09a0-4334622b-0e1c-03c42ba5] [0]
    >


    As John Machin mentioned, pyparsing may be helpful to you. Here is a
    simple version:

    data = """AzAccept PLYSSTM01 [23/Sep/2005:16:14:28 -0500]
    "162.44.245.32 CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-
    Users,OU=kkk Secure,DC=customer,DC=rxcorp,DC=com" "plysmhc03zp GET /
    mci/performance/SelectProducts.aspx?
    p=0&V=C&a=29&menu=adhoc" [d4b62ca2-09a0-4334622b-0e1c-03c42ba5] [0]"""

    # Version 1 - simple
    from pyparsing import *
    LBRACK,RBRACK,COMMA = map(Suppress,"[],")
    num = Word(nums)
    date = Combine(num+"/"+Word(alphas)+"/"+num+":"+num+":"+num+":"+num) +
    \
    oneOf("+ -") + num
    date.setParseAction(keepOriginalText)
    uuid = delimitedList(Word(hexnums),"-",combine=True)
    logString = Word(alphas,alphanums) + Word(alphas,alphanums) + \
    LBRACK + date + RBRACK + quotedString + quotedString + \
    LBRACK + uuid + RBRACK + LBRACK + Word(nums) + RBRACK

    print logString.parseString(data)

    Prints out:
    ['AzAccept', 'PLYSSTM01', '23/Sep/2005:16:14:28 -0500',
    '"162.44.245.32 CN=dddd cojack (890),OU=1,OU=Customers,OU=ISM-
    Users,OU=kkk Secure,DC=customer,DC=rxcorp,DC=com"', '"plysmhc03zp GET /
    mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc"',
    'd4b62ca2-09a0-4334622b-0e1c-03c42ba5', '0']


    And here is a slightly fancier version, which parses the quoted
    strings (uses the pprint pretty-printing module to show structure of
    the parsed results):

    # Version 2 - fancy
    from pyparsing import *
    LBRACK,RBRACK,COMMA = map(Suppress,"[],")
    num = Word(nums)
    date = Combine(num+"/"+Word(alphas)+"/"+num+":"+num+":"+num+":"+num) +
    \
    oneOf("+ -") + num
    date.setParseAction(keepOriginalText)
    uuid = delimitedList(Word(hexnums),"-",combine=True)

    ipAddr = delimitedList(Word(nums),".",combine=True)
    keyExpr=Word(alphas.upper())
    valExpr=CharsNotIn(',')
    qs1Expr = ipAddr + Group(delimitedList(Combine(keyExpr + '=' +
    valExpr)))
    def parseQS1(t):
    return qs1Expr.parseString(t[0])
    def parseQS2(t):
    return t[0].split()

    qs1 = quotedString.copy().setParseAction(removeQuotes, parseQS1)
    qs2 = quotedString.copy().setParseAction(removeQuotes, parseQS2)

    logString = Word(alphas,alphanums) + Word(alphas,alphanums) + \
    LBRACK + date + RBRACK + qs1 + qs2 + \
    LBRACK + uuid + RBRACK + LBRACK + Word(nums) + RBRACK

    from pprint import pprint
    pprint(logString.parseString(data).asList())

    Prints:
    ['AzAccept',
    'PLYSSTM01',
    '23/Sep/2005:16:14:28 -0500',
    '162.44.245.32',
    ['CN=dddd cojack (890)',
    'OU=1',
    'OU=Customers',
    'OU=ISM-Users',
    'OU=kkk Secure',
    'DC=customer',
    'DC=rxcorp',
    'DC=com'],
    'plysmhc03zp',
    'GET',
    '/mci/performance/SelectProducts.aspx?p=0&V=C&a=29&menu=adhoc',
    'd4b62ca2-09a0-4334622b-0e1c-03c42ba5',
    '0']

    Find more about pyparsing at http://pyparsing.wikispaces.com.

    -- Paul
    Paul McGuire, Dec 7, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mich
    Replies:
    2
    Views:
    436
  2. John Ericson
    Replies:
    0
    Views:
    419
    John Ericson
    Jul 19, 2003
  3. Mark
    Replies:
    0
    Views:
    433
  4. John Dibling
    Replies:
    0
    Views:
    405
    John Dibling
    Jul 19, 2003
  5. Chris Mantoulidis

    Not splitting words in output

    Chris Mantoulidis, Dec 22, 2003, in forum: C++
    Replies:
    3
    Views:
    313
    E. Robert Tisdale
    Dec 23, 2003
Loading...

Share This Page