Restricted capture in Regexp

benjohn · Dec 13, 2006

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not become
captured as part of the match?

I want this so that I can scan and gsub on a string of code and replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I don't
want at all.

Cheers,
Benjohn

Bertram Scharpf · Dec 13, 2006

Hi Benjohn,

Am Mittwoch, 13. Dez 2006, 18:24:08 +0900 schrieb (e-mail address removed):

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I don't
want at all.

/\b[xy]\b/

The \b pattern (word boundary) will look to the left like the ^ pattern
does.

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

Bertram

benjohn · Dec 13, 2006

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not
become
captured as part of the match?

Click to expand...

Neither yes nor no, because of how you have worded your question. Se
below.

I want this so that I can scan and gsub on a string of code and
replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I
don't
want at all.

Click to expand...

There are a number of ways to accomplish this. The simplest is to put
the
part you want to preserve in parentheses, and refer to it in the
replacement.

Like this:

data.sub!(%r{(^|[^a-zA-Z])([xy])([^a-zA-Z]|$)},"\\1\\2\\3")

Notice about this example that the [xy] character class is now captured
and
used as part of the replacement, so its original value is preserved.

Using this approach, you preserve the parts you don't want to replace,
and
replace the parts you do. In the above example, everything is preserved,
but it is just meant to show the pattern.

Hi Paul,

thanks for the reply. I know I can do this, but it means that the
substitution ("\\1\\2\\3") has to be aware of the composition of the
regular expression. The Regexp is no longer a neat little machine that
only grabs things to replace. It's now grabbing the packaging around the
thing to replace too, so you've got to be aware of this in writing the
substitution.

Cheers,
Benjohn

benjohn · Dec 13, 2006

Hi Benjohn,

Am Mittwoch, 13. Dez 2006, 18:24:08 +0900 schrieb (e-mail address removed):

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I
don't
want at all.

Click to expand...

/\b[xy]\b/

The \b pattern (word boundary) will look to the left like the ^ pattern
does.

This seems like the best approach in this case, as it's a good enough
way to find variables. It does break down in the complex case though.

I would appreciate if there were a general pattern looking to the left
corresponding to (?=re) what is non-consuming to the right.

The book I'm reading (o'reilly pocket reference) hints at the look
arround constructs being:

(?=...) - look ahead.
(?!...) - negated look ahead.
(?<=...) - look behind.
(?<!...) - negated look behind.

So perhaps one of those is what you want?

benjohn · Dec 13, 2006

/ ...

Yes, that is true for all regular expressions.

Yes, but this cannot be avoided. You have two choices for examined text
that
surrounds the area to be modified -- you can capture it while examining
it,
and use the captured text in the replacement, or you can use
non-capturing
references:

(?=non-captured text)

I think this may be what I should use. Also, the sugestion of using word
edge tokens works for the specific case.

But the two alternatives work much the same way -- they examine text
that is
preserved as part of the overall regular expression. All that changes
is /how/ the text is preserved.

So, to move ahead, please post a specific example of what you need. Post
an
example of the original string and the desired replacement.

Well, I have a solution for the specific case. That's not what I'm
getting at though. I'm trying to find out if regexp allow me to do
something more general. I want to do this (sorry, I don't have a ruby to
hand):

class CodeFragment
attr_accessor :code_fragment

def variables_regexp
/\b[xyz]\b/
end

def utilised_variables
code_fragment.scan(variables_regexp).uniq.sort
end

def output_substitution(substitutes)
code_fragment.gsub(variables_regexp) do |v|
substitutes[v[0]]
end
end
end

cf = CodeFragment.new
cf.code_fragment = "sin(x+y)"
puts cf.output_substitution({'x'=>1, 'y'=>2})

should give "sin(1+2)"

What I want is for the thing that provides the regular expression to not
need to know about the function that is using it; and for the functions
that uses the regular expression to not know about the expression
provided.

regular
expression. It /is/ possible to take a first step by posting an example
of
original text, and replacement text. Maybe we should try that.

Thank you for your help here.

I'm not trying to solve a single problem though, I'm trying to
understant what kinds of problem I can solve.

I want something that acts as an abstract machine for finding things in
a string (in this case variables, but the rules could be more complex).
One should be able to use this machine without knowing what it finds, or
how it finds. All I should need to know is that it finds things. I'm
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don't, which is fine. I'd just like to know if
they do or not, or if they do a bit, how much.

Thanks,
Benjohn

Simon Strandgaard · Dec 13, 2006

On 12/13/06 said:
cf = CodeFragment.new
cf.code_fragment = "sin(x+y)"
puts cf.output_substitution({'x'=>1, 'y'=>2})

should give "sin(1+2)"

[snip]

prompt> cat a.rb
s = "sin(x+y)"
h = {
'x' => '1',
'y' => '2',
}
h.each do |pattern, replacement|
r = Regexp.new('\b' + Regexp.escape(pattern) + '\b')
s.gsub!(r) { replacement }
end
p s

prompt> ruby a.rb
"sin(1+2)"

Simon Strandgaard · Dec 13, 2006

On 12/13/06 said:
I want something that acts as an abstract machine for finding things in
a string (in this case variables, but the rules could be more complex).
One should be able to use this machine without knowing what it finds, or
how it finds. All I should need to know is that it finds things. I'm
trying to understand if regexps are able to do this - to provide this
separation. Perhaps they don't, which is fine. I'd just like to know if
they do or not, or if they do a bit, how much.

In a language like ruby, its not possible to distinguish between
a variablename or a methodname by just looking at the name.
Regexp just looks at the name.

If you want to replace a variable-name then you need to
parse the code.

David Vallner · Dec 13, 2006

--------------enig9A235B2C7DAE61691C0BD2B3
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The book I'm reading (o'reilly pocket reference) hints at the look
arround constructs being:
=20
(?=3D...) - look ahead.
(?!...) - negated look ahead.

The following two aren't supported in the current Ruby regexp engine,
they are in the one Ruby 1.9 and on will use.

(?<=3D...) - look behind.
(?<!...) - negated look behind.
=20
So perhaps one of those is what you want?
=20

Either way, it's possible to emulate positive lookbehinds by capturing
what would be the pre-match and putting it into the replacement:

string.sub(/(some lookbehind pattern)(what you're looking for)/) {
$1 + replacement_of($2)
}

instead of:

string.sub(/(?<=3Dsome lookbehind pattern)what you're looking for/) {
replacement_of($~.to_s)
}

and kludge negative lookbehinds by instead enumerating all the patterns
that would match in a positive one. They just make the pattern
(sometimes much) more elegant in most cases.

David Vallner

--------------enig9A235B2C7DAE61691C0BD2B3
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFgGdAy6MhrS8astoRAtncAJ9Uoke+xEbQVwtLYt/oqrVS1eQXGwCdHfUn
oTXyfij7OD5rbcmZu2tAU0Y=
=fLni
-----END PGP SIGNATURE-----

--------------enig9A235B2C7DAE61691C0BD2B3--

Bertram Scharpf · Dec 13, 2006

Hi,

Am Mittwoch, 13. Dez 2006, 19:28:08 +0900 schrieb (e-mail address removed):

The book I'm reading (o'reilly pocket reference) hints at the look
arround constructs being:

(?=...) - look ahead.
(?!...) - negated look ahead.
(?<=...) - look behind.
(?<!...) - negated look behind.

The latter two don't work in Ruby as far as I know.

irb(main):001:0> "hello" =~ /(?<=e)ll/
SyntaxError: compile error
(irb):1: undefined (?...) sequence: /(?<=e)ll/
from (irb):1

Bertram

William James · Dec 14, 2006

Is there a regexp feature that lets me require something to be present
in the input string for the regexp to match, but for that to not become
captured as part of the match?

I want this so that I can scan and gsub on a string of code and replace
variables. Matching just variables requires looking at the context
arround them, but if I capture this, I replace the context too.

Eg, to scan for variables called x or y, I might use:
/(^|[^a-zA-Z])[xy]([^a-zA-Z]|$)/

but using that on "exp(x)" will match (and replace) "(x)", which I don't
want at all.

Cheers,
Benjohn

class String
def gsub_capture( regexp, replacement )
offset = 0
gsub( regexp ){|s|
offset = $~.offset(1)[0] - $~.offset(0)[0]
s[ 0, offset ] +
replacement + s[ offset + $1.size .. -1 ] }
end
end

puts "1. Take two <2> cups and three <3> spoons".
gsub_capture( / <(\d)> /, "x")

Benjohn Barnes · Dec 14, 2006

Is this what you mean? Can you extrapolate this way of approaching the
problem to solve your own?

I was able to. I had not understood that scan and gsub work
differently when capturing takes place. Scan seems to have more
sensible behaviour. I would like gsub's block or second parameter to
provide an array, and for this to replace the captured parts of the
regexp, so:

"axb".gsub(/(.)x(.)/, ['A', 'B'])

would return:

"AxB"

gsub doesn't behave like this, but I imagine it would be possible to
build a gsub like function that did.

It would probably need to
inspect the regular expression given to it with a regular expression.

Thanks everyone,
Benjohn

Rob Biedenharn · Dec 15, 2006

Benjohn Barnes wrote:
/...

I would like gsub's block or second parameter to
provide an array, and for this to replace the captured parts of the
regexp, so:

"axb".gsub(/(.)x(.)/, ['A', 'B'])

would return:

"AxB"

gsub doesn't behave like this, but I imagine it would be possible to
build a gsub like function that did.

Click to expand...

result = "axb".gsub(/(.)(x)(.)/, "A\\2B" ) # gets what you want.

It would probably need to
inspect the regular expression given to it with a regular expression.

Click to expand...

Not really. Each of sub(), gsub() and scan() have their niche. It
is more a
matter of learning how to use them.

And, now that I think about it, your example using a provided array of
replacement values can be implemented like this:

rep = ['A', 'B']

result = "axb".gsub(/(.)(x)(.)/, "#{rep[0]}\\2#{rep[1]}")

And, I am sure, in many other ways.

If you're not interested in the other groupings you can use (?

to
group the regexp without capturing.

rep = ['A', 'B']
result = "axb".gsub(/(?:.)(x)(?:.)/, "#{rep[0]}\\1#{rep[1]}")

Of course, these RE's don't even need to be grouped at all:

rep = ['A', 'B']
result = "axb".gsub(/.(x)./, "#{rep[0]}\\1#{rep[1]}")

And (x) is just a match for 'x', so you don't have to use a group at
all.

In general, you could take your regexp, rewrite to capture the parts
between the desired replacements, and use a replacement (or a block)
similar to what Paul introduced to get the result you desire.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

Regexp - start and end of line or string	1	Jan 16, 2011
Regular expressions, capture repeated groups	4	Jul 8, 2010
[array & regexp] Development - works, production not - why?	10	Aug 23, 2009
regexp(ing) Backus-Naurish expressions ...	23	Mar 10, 2013
How to convert a "normal" searchstring into regexp	8	Sep 5, 2009
regex: capture groups and term binding	5	Sep 28, 2007
Simple regexp question	0	Oct 26, 2005
Restricted Subsets of Perl	10	Dec 15, 2004

Restricted capture in Regexp

benjohn

Bertram Scharpf

benjohn

benjohn

benjohn

Simon Strandgaard

Simon Strandgaard

David Vallner

Bertram Scharpf

William James

Benjohn Barnes

Rob Biedenharn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads