John said:
Perhaps you might like to be somewhat more precise with your
requirements.
Sure. More on this below.
"POSIX-compliant" made me think of yuckies like [:fubar:]
in character classes
Yep. I do not need POSIX *syntax* for regular expressions but POSIX
*semantics*, at least the "leftmost-longest" part (in contrast to the
"first then longest" used in Python, Perl, .NET, etc.)
The operands of | are such that the length is not fixed and so you can't
write them in descending length order? Care to tell us some more detail
about those operands?
Basically, I'd like to use the (excellent) python module SPARK
of John Aycock to build an (extended) C lexer. To do so, I need
to specify the patterns that match my tokens as well as a priority
between them. SPARK then builds a big alternate list of patterns
that begins with the high priority patterns and ends with the low
priority patterns and runs a match.
The problem with to be very careful and to specify explicitely the
priorities to get the desired results: "<=" shall be higher than "<",
decimal stuff higher than integer, etc, when most of the time what
you really want is to match the longest pattern ...
Worse, the priority work-around does not work well when you
compare keywords and (other) identifiers. To match "fortune"
as a identifier, you would need to define identifier with a higher
priority than keyword and it is a problem: "for" would be then
match as a identifier when it is a keyword.
I can come up with possible work-arounds for the "id vs
keyword" issue, but nothing that really makes me happy ...
Therefore, I was studying the possible replacement of the
Python native regular expression engine with a "POSIX
semantics" regular expression engine that would give the
longest match and avoid me a lot of extra work ...
I hope it's clearer now
Any advice ?
Cheers
SB