Regular expression for file name

M

Miki Tebeka

Hello All,

In a configuration file there can be ID's and filename tokens.
The file names have a known suffix (.o or .mls) and I need to get a regular
expression that will catch filename but not an ID.

Currently:
ID = r"[a-zA-Z\.]\w+(?![/\\])"
FILENAME = r"([a-zA-Z]:)?[\w./\\]+\.((mls)|(o))"

However if I have the filename "Sources/kernel/rom_kernel.mls" then
"Source" is interrupted as ID and "s/kernel/rom_kernel.mls" is interrupted
as file name.

Any way to do better?

BTW: I'm using PLY (http://systems.cs.uchicago.edu/ply/) for parsing.

Bye.
 
B

Bengt Richter

Hello All,

In a configuration file there can be ID's and filename tokens.
The file names have a known suffix (.o or .mls) and I need to get a regular
expression that will catch filename but not an ID.

Currently:
ID = r"[a-zA-Z\.]\w+(?![/\\])"
FILENAME = r"([a-zA-Z]:)?[\w./\\]+\.((mls)|(o))"

However if I have the filename "Sources/kernel/rom_kernel.mls" then
"Source" is interrupted as ID and "s/kernel/rom_kernel.mls" is interrupted
as file name.
ITYM s/interrupted/interpreted/ ;-)
Any way to do better?
If you want to prioritize matching amongst several
patterns with some leading commonality, UIAM or'ed terms get
tried left to right. I'm not checking your terms, but I think
here's a possible way to give priority to the FILENAME
pattern:
>>> import re
>>> ID = r"[a-zA-Z\.]\w+(?![/\\])"
>>> FILENAME = r"([a-zA-Z]:)?[\w./\\]+\.((mls)|(o))"
>>> COMBINED = '(?P<file>%s)|(?P<id>%s)' % (FILENAME, ID)
>>> rxo = re.compile(COMBINED)
>>> filename = "Sources/kernel/rom_kernel.mls"
>>> rxo.search(filename).groupdict()
{'id': None, 'file': 'Sources/kernel/rom_kernel.mls'}

Try it with an id:
{'id': 'no_slashes_in_this', 'file': None}

Of course you can mess with the result, e.g.,
>>> result = rxo.search('no_slashes_in_this').groupdict()
>>> result['id'] 'no_slashes_in_this'
>>> result['file']
>>> result['file'] is None True
>>> result['id'], result['file']
('no_slashes_in_this', None)

No guarantees, but HTH

Regards,
Bengt Richter
 
C

Christopher T King

In a configuration file there can be ID's and filename tokens.
The file names have a known suffix (.o or .mls) and I need to get a regular
expression that will catch filename but not an ID.

Currently:
ID = r"[a-zA-Z\.]\w+(?![/\\])"
FILENAME = r"([a-zA-Z]:)?[\w./\\]+\.((mls)|(o))"

However if I have the filename "Sources/kernel/rom_kernel.mls" then
"Source" is interrupted as ID and "s/kernel/rom_kernel.mls" is interrupted
as file name.

I'm not familiar with PLY, but my guess as to the cause is that it gives
you those results because it is trying to match ID first, and then
FILENAME. The best way to solve this is to incorporate another restraint
in your RE, that is, the delimiter at the end of the pattern (presumably
whitespace):

ID = r"[a-zA-Z\.]\w+(?=\s)"
FILENAME = r"([a-zA-Z]:)?[\w./\\]+\.((mls)|(o))(?=\s)"

I'm not sure if PLY supports (?=...) or not, but I assume it does, since
you used its complement ((?!...)) in your original REs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top