Regular expression that skips single line comments?

M

martinjamesevans

I am trying to parse a set of files that have a simple syntax using
RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']


The following RE works fine but does not deal with the commented
lines:

re.findall(r"(\$.)", sInput, re.I)

e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']


My attempts at trying to use (?!;) type expressions keep failing.

I'm not convinced this is suitable for a single expression, so I have
also attempted to first find-replace any commented lines out without
much luck.

e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)


Any suggestions would be appreciated. Thanks

Martin
 
M

MRAB

I am trying to parse a set of files that have a simple syntax using
RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']


The following RE works fine but does not deal with the commented
lines:

re.findall(r"(\$.)", sInput, re.I)

e.g. ['$1', '$2', '$3', '$3', '$3', '$5', '$6', '$7']


My attempts at trying to use (?!;) type expressions keep failing.

I'm not convinced this is suitable for a single expression, so I have
also attempted to first find-replace any commented lines out without
much luck.

e.g. re.sub(r"^[\t ]*?;.*?$", r"", sInput, re.I+re.M)


Any suggestions would be appreciated. Thanks
You could use:
['', '', '$3', '$3', '$3', '$5', '$6', '$7']

and then ignore the empty strings.
 
T

Tim Chase

I am trying to parse a set of files that have a simple syntax using
RE. I'm interested in counting '$' expansions in the files, with one
minor consideration. A line becomes a comment if the first non-white
space character is a semicolon.

e.g. tests 1 and 2 should be ignored

sInput = """
; $1 test1
; test2 $2
test3 ; $3 $3 $3
test4
$5 test5
$6
test7 $7 test7
"""

Required output: ['$3', '$3', '$3', '$5', '$6', '$7']

We're interested in two things: comments and "dollar-something"s

Then remove comment lines and find the matching '$' expansions:
>>> [r_dollar.findall(line) for line in sInput.splitlines() if
not r_comment.match(line)]
[[], ['$3', '$3', '$3'], [], ['$5'], ['$6'], ['$7']]

Finally, roll each line's results into a single list by slightly
abusing sum()
if not r_comment.match(line)), [])
['$3', '$3', '$3', '$5', '$6', '$7']

Adjust the r_dollar if your variable pattern differs (such as
reverting to your previous r'\$.' pattern if you prefer, or using
r'\$\w+' for multi-character variables).

-tkc
 
C

Casey

Another option (I cheated a little and turned sInput into a sequence
of lines, similar to what you would get reading a text file):

sInput = [
'; $1 test1',
' ; test2 $2',
' test3 ; $3 $3 $3',
'test4',
'$5 test5',
' $6',
' test7 $7 test7',
]

import re
re_exp = re.compile(r'(\$.)')
re_cmt = re.compile(r'\s*;')
expansions = [exp for line in sInput for exp in re_exp.findall(line)
if not re_cmt.match(line)]
print(expansions)
['$3', '$3', '$3', '$5', '$6', '$7']
 
S

Steven D'Aprano

I am trying to parse a set of files that have a simple syntax using RE.
I'm interested in counting '$' expansions in the files, with one minor
consideration. A line becomes a comment if the first non-white space
character is a semicolon.

Since your data is line-based, surely the simplest, clearest and most
natural solution is to parse each line individually instead of trying to
process the entire input with a single RE?


def extract_dollar_expansions(sInput):
accumulator = []
for line in sInput.split('\n'):
line = line.lstrip()
if line.startswith(';'):
continue
accumulator.extend(re.findall(r"(\$.)", line))
return accumulator


(Aside: why are you doing a case-insensitive match for a non-letter? Are
there different upper- and lower-case dollar signs?)


['$3', '$3', '$3', '$5', '$6', '$7']
 
M

martinjamesevans

Firstly, a huge thanks to all for the solutions! Just what I was
looking for.


(Aside: why are you doing a case-insensitive match for a non-letter? Are
there different upper- and lower-case dollar signs?)

As you can probably imagine, I had simplified the problem slightly,
the language uses a couple of different introducers and also uses both
numbers and letters (but only single characters).

I was going to go with a similar idea of parsing per line but was
trying to give RE another chance. I've used RE often in the past but
for some reason this one had got under my skin.

I found this to be quite an interesting little tool:
http://www.gskinner.com/RegExr/

Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top