How to match literal backslashes read from a text file using regular expressions?

C

cricfan

I'm parsing a text file to extract word definitions. For example the
input text file contains the following content:

di.va.gate \'di_--v*-.ga_-t\ vb
pas.sim \'pas-*m\ adv : here and there : THROUGHOUT

I am trying to obtain words between two literal backslashes (\ .. \). I
am not able to match words between two literal backslashes using the
regxp - re.compile(r'\\[^\\]*\\').

Here is my sample script:

import re;

#slashPattern = re.compile(re.escape(r'\\[^\\]*\\'));
pattern = r'\\[^\\]*\\'
slashPattern = re.compile(pattern);

fdr = file( "parseinput",'r');
line = fdr.readline();

while (line != ""):
if (slashPattern.match(line)):
print line.rstrip() + " <-- matches pattern " + pattern
else:
print line.rstrip() + " <-- DOES not match pattern " +
pattern
line = fdr.readline();
print;


----------
The output

C:\home\krishna\lang\python>python wsparsetest.py
python wsparsetest.py
di.va.gate \'di_--v*-.ga_-t\ vb <-- DOES not match
pattern \\[^\\]*\\
pas.sim \'pas-*m\ adv : here and there : THROUGHOUT <-- DOES not match
pattern \\[^\\]*\\
 
J

John Machin

I'm parsing a text file to extract word definitions. For example the
input text file contains the following content:

di.va.gate \'di_--v*-.ga_-t\ vb
pas.sim \'pas-*m\ adv : here and there : THROUGHOUT

I am trying to obtain words between two literal backslashes (\ .. \). I
am not able to match words between two literal backslashes using the
regxp - re.compile(r'\\[^\\]*\\').

Here is my sample script:

import re;

Lose the semicolons ...
#slashPattern = re.compile(re.escape(r'\\[^\\]*\\'));
pattern = r'\\[^\\]*\\'
slashPattern = re.compile(pattern);

fdr = file( "parseinput",'r');
line = fdr.readline();

You should upgrade so that you have a modern Python and a modern
tutor[ial] -- then you will be writing:

for line in fdr:
do_something_with(line)

while (line != ""):

Lose the extraneous parentheses ...
if (slashPattern.match(line)):

Your main problem is that you should be using the search() method, not
the match() method. Read the section on this topic in the re docs!!
>>> import re
>>> pat = re.compile(r'\\[^\\]*\\')
>>> pat.match(r'abcd \xyz\ pqr')
>>> pat.search(r'abcd \xyz\ pqr')
print line.rstrip() + " <-- matches pattern " + pattern
else:
print line.rstrip() + " <-- DOES not match pattern " +
pattern
line = fdr.readline();
print;


----------
The output

C:\home\krishna\lang\python>python wsparsetest.py
python wsparsetest.py
di.va.gate \'di_--v*-.ga_-t\ vb <-- DOES not match
pattern \\[^\\]*\\
pas.sim \'pas-*m\ adv : here and there : THROUGHOUT <-- DOES not match
pattern \\[^\\]*\\
-----------

What should I be doing to match those literal backslashes?

Thanks
 
G

George Sakkis

This should give you an idea of how to go about it (needs python 2.3 or
newer):


import re
slashPattern = re.compile(r'\\(.*?)\\')

for i,line in enumerate(file("parseinput")):
print "line", i+1,
match = slashPattern.search(line)
if match:
print "matched:", match.group(1)
else:
print "did not match"

#===== output =======================

line 1 matched: 'di_--v*-.ga_-t
line 2 matched: 'pas-*m

#====================================


George
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top