Regular expression that skips single line comments?

Discussion in 'Python' started by martinjamesevans@gmail.com, Jan 19, 2009.

  1. Guest

    I am trying to parse a set of files that have a simple syntax using
    RE. I'm interested in counting '$' expansions in the files, with one
    minor consideration. A line becomes a comment if the first non-white
    space character is a semicolon.

    e.g. tests 1 and 2 should be ignored

    sInput = """
    ; $1 test1
    ; test2 $2
    test3 ; $3 $3 $3
    test4
    $5 test5
    $6
    test7 $7 test7
    """

    Required output: ['$3', '$3', '$3', '$5', '$6', '$7']


    The following RE works fine but does not deal with the commented
    lines:

    re.findall(r"(\$.)", sInput, re.I)

    e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']


    My attempts at trying to use (?!;) type expressions keep failing.

    I'm not convinced this is suitable for a single expression, so I have
    also attempted to first find-replace any commented lines out without
    much luck.

    e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)


    Any suggestions would be appreciated. Thanks

    Martin
     
    , Jan 19, 2009
    #1
    1. Advertising

  2. MRAB Guest

    wrote:
    > I am trying to parse a set of files that have a simple syntax using
    > RE. I'm interested in counting '$' expansions in the files, with one
    > minor consideration. A line becomes a comment if the first non-white
    > space character is a semicolon.
    >
    > e.g. tests 1 and 2 should be ignored
    >
    > sInput = """
    > ; $1 test1
    > ; test2 $2
    > test3 ; $3 $3 $3
    > test4
    > $5 test5
    > $6
    > test7 $7 test7
    > """
    >
    > Required output: ['$3', '$3', '$3', '$5', '$6', '$7']
    >
    >
    > The following RE works fine but does not deal with the commented
    > lines:
    >
    > re.findall(r"(\$.)", sInput, re.I)
    >
    > e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']
    >
    >
    > My attempts at trying to use (?!;) type expressions keep failing.
    >
    > I'm not convinced this is suitable for a single expression, so I have
    > also attempted to first find-replace any commented lines out without
    > much luck.
    >
    > e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)
    >
    >
    > Any suggestions would be appreciated. Thanks
    >

    You could use:

    >>> re.findall(r"^\s*;.*|(\$.)", sInput, re.M)

    ['', '', '$3', '$3', '$3', '$5', '$6', '$7']

    and then ignore the empty strings.
     
    MRAB, Jan 19, 2009
    #2
    1. Advertising

  3. Tim Chase Guest

    > I am trying to parse a set of files that have a simple syntax using
    > RE. I'm interested in counting '$' expansions in the files, with one
    > minor consideration. A line becomes a comment if the first non-white
    > space character is a semicolon.
    >
    > e.g. tests 1 and 2 should be ignored
    >
    > sInput = """
    > ; $1 test1
    > ; test2 $2
    > test3 ; $3 $3 $3
    > test4
    > $5 test5
    > $6
    > test7 $7 test7
    > """
    >
    > Required output: ['$3', '$3', '$3', '$5', '$6', '$7']


    We're interested in two things: comments and "dollar-something"s

    >>> import re
    >>> r_comment = re.compile(r'\s*;')
    >>> r_dollar = re.compile(r'\$\d+')


    Then remove comment lines and find the matching '$' expansions:

    >>> [r_dollar.findall(line) for line in sInput.splitlines() if

    not r_comment.match(line)]
    [[], ['$3', '$3', '$3'], [], ['$5'], ['$6'], ['$7']]

    Finally, roll each line's results into a single list by slightly
    abusing sum()

    >>> sum((r_dollar.findall(line) for line in sInput.splitlines()

    if not r_comment.match(line)), [])
    ['$3', '$3', '$3', '$5', '$6', '$7']

    Adjust the r_dollar if your variable pattern differs (such as
    reverting to your previous r'\$.' pattern if you prefer, or using
    r'\$\w+' for multi-character variables).

    -tkc
     
    Tim Chase, Jan 19, 2009
    #3
  4. Casey Guest

    Another option (I cheated a little and turned sInput into a sequence
    of lines, similar to what you would get reading a text file):

    sInput = [
    '; $1 test1',
    ' ; test2 $2',
    ' test3 ; $3 $3 $3',
    'test4',
    '$5 test5',
    ' $6',
    ' test7 $7 test7',
    ]

    import re
    re_exp = re.compile(r'(\$.)')
    re_cmt = re.compile(r'\s*;')
    expansions = [exp for line in sInput for exp in re_exp.findall(line)
    if not re_cmt.match(line)]
    print(expansions)

    >>> ['$3', '$3', '$3', '$5', '$6', '$7']
     
    Casey, Jan 19, 2009
    #4
  5. On Mon, 19 Jan 2009 08:08:01 -0800, martinjamesevans wrote:

    > I am trying to parse a set of files that have a simple syntax using RE.
    > I'm interested in counting '$' expansions in the files, with one minor
    > consideration. A line becomes a comment if the first non-white space
    > character is a semicolon.


    Since your data is line-based, surely the simplest, clearest and most
    natural solution is to parse each line individually instead of trying to
    process the entire input with a single RE?


    def extract_dollar_expansions(sInput):
    accumulator = []
    for line in sInput.split('\n'):
    line = line.lstrip()
    if line.startswith(';'):
    continue
    accumulator.extend(re.findall(r"(\$.)", line))
    return accumulator


    (Aside: why are you doing a case-insensitive match for a non-letter? Are
    there different upper- and lower-case dollar signs?)



    >>> extract_dollar_expansions(sInput)

    ['$3', '$3', '$3', '$5', '$6', '$7']




    --
    Steven
     
    Steven D'Aprano, Jan 19, 2009
    #5
  6. Guest

    Firstly, a huge thanks to all for the solutions! Just what I was
    looking for.



    > (Aside: why are you doing a case-insensitive match for a non-letter? Are
    > there different upper- and lower-case dollar signs?)


    As you can probably imagine, I had simplified the problem slightly,
    the language uses a couple of different introducers and also uses both
    numbers and letters (but only single characters).

    I was going to go with a similar idea of parsing per line but was
    trying to give RE another chance. I've used RE often in the past but
    for some reason this one had got under my skin.

    I found this to be quite an interesting little tool:
    http://www.gskinner.com/RegExr/

    Martin
     
    , Jan 20, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,388
  2. peter leonard
    Replies:
    3
    Views:
    578
    Roy Smith
    Oct 20, 2003
  3. Pettersen, Bjorn S
    Replies:
    2
    Views:
    536
    Christopher Koppler
    Oct 21, 2003
  4. Monk
    Replies:
    10
    Views:
    1,539
    Michael Wojcik
    Apr 20, 2005
  5. katy28
    Replies:
    0
    Views:
    3,547
    katy28
    Feb 27, 2008
Loading...

Share This Page