look-behind in oniguruma

P

Phil Tomson

Apparently oniguruma supports look-behind. Is there any documentation on how
to use this feature?

for example, if I had the string "~ABC~DE" and I want to return a list of
letters in the string which are preceeded by '~' ( ['A','D'] in this case) how
might I use the look-behind feature in oniguruma to achieve this? or, how
would I get a list of letters in the string which are not preceeded by '~'
(['B','C', 'E'] in this example.

(I know there are other ways of doing this, I'm just posing this as an example
of using look-behind).

Here's one that's a bit trickier: What if I had "~(ABC)DE" and I want the
tilde (a negation operator) to apply to each letter within the parens that it
preceeds, so that I would get ['A','B','C'], but in the case where the input
string is "(ABC)DE" I would get an empty list.... and then of course I would
want them to be nestable: "~(~ABC)" ('A' should not appear in the list in this
case since it's doubly negated - OK, that's probably going too far and
maybe it's getting to the point where I should break out RACC ;-)

Phil
 
A

Andrew Johnson

Apparently oniguruma supports look-behind. Is there any documentation on how
to use this feature?

Essentially, they are the same as look-aheads ... zero-width assertions,
except that the look-behind expression must be a fixed width pattern (no
indeterminate quantifiers), and no captures are allowed in a negative
look-behind
for example, if I had the string "~ABC~DE" and I want to return a list of
letters in the string which are preceeded by '~' ( ['A','D'] in this case) how
might I use the look-behind feature in oniguruma to achieve this? or, how
would I get a list of letters in the string which are not preceeded by '~'
(['B','C', 'E'] in this example.


str = "~ABC~DE"
p str.scan(/(?<=~)[A-Z]/)
p str.scan(/(?<!~)[A-Z]/)

gives:

["A", "D"]
["B", "C", "E"]

regards,
andrew
 
P

Phil Tomson

^^^^^^^^
hmmm...
Apparently oniguruma supports look-behind. Is there any documentation on how
to use this feature?

Essentially, they are the same as look-aheads ... zero-width assertions,
except that the look-behind expression must be a fixed width pattern (no
indeterminate quantifiers), and no captures are allowed in a negative
look-behind
for example, if I had the string "~ABC~DE" and I want to return a list of
letters in the string which are preceeded by '~' ( ['A','D'] in this case) how
might I use the look-behind feature in oniguruma to achieve this? or, how
would I get a list of letters in the string which are not preceeded by '~'
(['B','C', 'E'] in this example.


str = "~ABC~DE"
p str.scan(/(?<=~)[A-Z]/)
p str.scan(/(?<!~)[A-Z]/)

gives:

["A", "D"]
["B", "C", "E"]

Thanks. That's what I was looking for. Is this essentially the same way that
it works in Perl?

Phil
 
F

Florian Gross

Andrew said:
Essentially, they are the same as look-aheads ... zero-width assertions,
except that the look-behind expression must be a fixed width pattern (no
indeterminate quantifiers), and no captures are allowed in a negative
look-behind

So it is implemented as zero-width look-ahead + eating as many
characters as the content matches?

(I've thought about implementing /foo/.preceded_by('bar') as
/(?!bar).{3}foo/.)
regards,
andrew

More regards,
Florian Gross
 
S

Simon Strandgaard

Essentially, they are the same as look-aheads ... zero-width assertions,
except that the look-behind expression must be a fixed width pattern (no
indeterminate quantifiers), and no captures are allowed in a negative
look-behind

Oniguruma supports alternation inside lookbehind, so you can get a similar
behavior as quantifiers.

AEditor's regexp engine supports variable width lookbehind, where you
can use quantifiers inside lookbehind.. (with inversed left-most-longest
rule).

It would be good if Oniguruma had support for quantifiers inside lookbehind.

irb(main):007:0> re = NewRegexp.new('(?=.z).(?<=(?:ab){2,3}x.)')
=> +-Sequence
+-Lookahead positive
| +-Sequence
| +-Outside set=U-000A
| +-Inside set="z"
+-Outside set=U-000A
+-Lookbehind positive
+-Sequence
+-Repeat greedy{2,3} # quantifier inside lookbehind!!
| +-Group non-capturing
| +-Sequence
| +-Inside set="a"
| +-Inside set="b"
+-Inside set="x"
+-Outside set=U-000A
irb(main):008:0> 'xyz'.gsub5(re, 'Y')
=> "xyz"
irb(main):009:0> 'abxyz'.gsub5(re, 'Y')
=> "abxyz"
irb(main):010:0> 'ababxyz'.gsub5(re, 'Y')
=> "ababxYz"
irb(main):011:0> 'abababxyz'.gsub5(re, 'Y')
=> "abababxYz"
 
S

Simon Strandgaard

Oniguruma supports alternation inside lookbehind, so you can get a similar
behavior as quantifiers.

AEditor's regexp engine supports variable width lookbehind, where you
can use quantifiers inside lookbehind.. (with inversed left-most-longest
rule).

It would be good if Oniguruma had support for quantifiers inside
lookbehind.


(here is an example with infinite quantifiers)

irb(main):016:0> re = NewRegexp.new('(?<!(ab)+|(cd){2,}).')
=> +-Sequence
+-Lookbehind negative
| +-Alternation
| +-Repeat greedy{1,-1}
| | +-Group capture=1
| | +-Sequence
| | +-Inside set="a"
| | +-Inside set="b"
| +-Repeat greedy{2,-1}
| +-Group capture=2
| +-Sequence
| +-Inside set="c"
| +-Inside set="d"
+-Outside set=U-000A
irb(main):017:0> 'qwerty'.gsub5(re, 'Z')
=> "ZZZZZZ"
irb(main):018:0> 'qweabrty'.gsub5(re, 'Z')
=> "ZZZZZrZZ"
irb(main):019:0> 'cdcdqwerty'.gsub5(re, 'Z')
=> "ZZZZqZZZZZ"
irb(main):020:0> 'cdqwerty'.gsub5(re, 'Z')
=> "ZZZZZZZZ"
irb(main):021:0>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top