Regular expression that skips single line comments?

martinjamesevans · Jan 19, 2009

I am trying to parse a set of files that have a simple syntax using
RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']

The following RE works fine but does not deal with the commented
lines:

re.findall(r"(\$.)", sInput, re.I)

e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']

My attempts at trying to use (?!

type expressions keep failing.

I'm not convinced this is suitable for a single expression, so I have
also attempted to first find-replace any commented lines out without
much luck.

e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)

Any suggestions would be appreciated. Thanks

Martin

MRAB · Jan 19, 2009

I am trying to parse a set of files that have a simple syntax using
RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']

The following RE works fine but does not deal with the commented
lines:

re.findall(r"(\$.)", sInput, re.I)

e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']

My attempts at trying to use (?! type expressions keep failing.

I'm not convinced this is suitable for a single expression, so I have
also attempted to first find-replace any commented lines out without
much luck.

e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)

Any suggestions would be appreciated. Thanks

You could use:
['', '', '$3', '$3', '$3', '$5', '$6', '$7']

and then ignore the empty strings.

Tim Chase · Jan 19, 2009

I am trying to parse a set of files that have a simple syntax using

RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']

We're interested in two things: comments and "dollar-something"s

Then remove comment lines and find the matching '$' expansions:

>>> [r_dollar.findall(line) for line in sInput.splitlines() if

Click to expand...

Click to expand...

not r_comment.match(line)]
[[], ['$3', '$3', '$3'], [], ['$5'], ['$6'], ['$7']]

Finally, roll each line's results into a single list by slightly
abusing sum()
if not r_comment.match(line)), [])
['$3', '$3', '$3', '$5', '$6', '$7']

Adjust the r_dollar if your variable pattern differs (such as
reverting to your previous r'\$.' pattern if you prefer, or using
r'\$\w+' for multi-character variables).

-tkc

Casey · Jan 19, 2009

Another option (I cheated a little and turned sInput into a sequence
of lines, similar to what you would get reading a text file):

sInput = [
'; $1 test1',
' ; test2 $2',
' test3 ; $3 $3 $3',
'test4',
'$5 test5',
' $6',
' test7 $7 test7',
]

import re
re_exp = re.compile(r'(\$.)')
re_cmt = re.compile(r'\s*;')
expansions = [exp for line in sInput for exp in re_exp.findall(line)
if not re_cmt.match(line)]
print(expansions)

['$3', '$3', '$3', '$5', '$6', '$7']

Click to expand...

Click to expand...

Steven D'Aprano · Jan 19, 2009

I am trying to parse a set of files that have a simple syntax using RE.
I'm interested in counting '$' expansions in the files, with one minor
consideration. A line becomes a comment if the first non-white space
character is a semicolon.

Since your data is line-based, surely the simplest, clearest and most
natural solution is to parse each line individually instead of trying to
process the entire input with a single RE?

def extract_dollar_expansions(sInput):
accumulator = []
for line in sInput.split('\n'):
line = line.lstrip()
if line.startswith(';'):
continue
accumulator.extend(re.findall(r"(\$.)", line))
return accumulator

(Aside: why are you doing a case-insensitive match for a non-letter? Are
there different upper- and lower-case dollar signs?)

['$3', '$3', '$3', '$5', '$6', '$7']

martinjamesevans · Jan 20, 2009

Firstly, a huge thanks to all for the solutions! Just what I was
looking for.

(Aside: why are you doing a case-insensitive match for a non-letter? Are
there different upper- and lower-case dollar signs?)

As you can probably imagine, I had simplified the problem slightly,
the language uses a couple of different introducers and also uses both
numbers and letters (but only single characters).

I was going to go with a similar idea of parsing per line but was
trying to give RE another chance. I've used RE often in the past but
for some reason this one had got under my skin.

I found this to be quite an interesting little tool:
http://www.gskinner.com/RegExr/

Martin

for loop skips items	13	Feb 15, 2012
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Windows file paths, again	11	Oct 21, 2009
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Recursion regular expression (xtended)	1	Aug 16, 2010
large array in a single line	5	May 26, 2009
ftp retrlines with re...	2	Dec 5, 2008

Regular expression that skips single line comments?

martinjamesevans

MRAB

Tim Chase

Casey

Steven D'Aprano

martinjamesevans

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads