Finding position of a RegExp subexpression

Csaba Gabor · Apr 21, 2006

I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

For example:
var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
var text = "There were some nesting parens in the test";
alert (regExpPos (text, re, 3));

should show 17

Would anyone have one of these?
Csaba Gabor from Vienna

Randy Webb · Apr 21, 2006

Csaba Gabor said the following on 4/21/2006 1:23 PM:

I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

There is one already. indexOf

Never tried it with RegExp's though

Dr John Stockton · Apr 21, 2006

JRS: In article <[email protected]>
, dated Fri, 21 Apr 2006 10:23:41 remote, seen in

news:comp.lang.javascript said:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

For example:
var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
var text = "There were some nesting parens in the test";
alert (regExpPos (text, re, 3));

should show 17

If you can alter the RegExp by inserting extra parentheses so that
everything is matched, them you could sum the lengths of all lower
matches.

Or you could then, with .replace, substitute all lower matches to "",
and see by how much the length has changed.

But I don't know whether that would always work with sufficiently
complex RegExps.

You could .replace the parameter in question with an Unreasonable String
(it is, after all, Unicode) and then do indexOf(that US).

Note : if the original string is less than 2^16 characters long, there
mist be at least one "16-bit" Unicode character that it does not
contain. So to find a one-character US, start searching for each
possible character in turn (starting with the least plausible) until you
find one that is not there.

Untested.

Csaba Gabor · Apr 21, 2006

Randy said:
Csaba Gabor said the following on 4/21/2006 1:23 PM:

There is one already. indexOf
Never tried it with RegExp's though

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

Csaba

By the way, thanks for that ear piercing demo in the other thread.

The problem with using text.indexOf(RegExp.$pare,pos) will find the
position of substring within string, but the problem is that that
RegExp.$parenNum may not be unique within string

Randy Webb · Apr 21, 2006

Csaba Gabor said the following on 4/21/2006 2:48 PM:

Randy said:
Randy said:

Csaba Gabor said the following on 4/21/2006 1:23 PM:
There is one already. indexOf
Never tried it with RegExp's though

Click to expand...

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

My knowledge of RegExp's may not be well enough to understand them so I
may be reading it wrong, but if you want the last match, then
lastIndexOf gives it. -1 if no match.

Csaba

By the way, thanks for that ear piercing demo in the other thread.

It does a better job than coffee at 5 am

Csaba Gabor · Apr 22, 2006

Dr said:
JRS: In article <[email protected]>
, dated Fri, 21 Apr 2006 10:23:41 remote, seen in

If you can alter the RegExp by inserting extra parentheses so that
everything is matched, them you could sum the lengths of all lower
matches.

This is, in effect, what I have done, code provided below. However, it
is a non trivial process that must account for nested parentheses
(...(...()...()...)...(...()...)...), back references (\#), and non
capturing subexpressions (?:...).

Or you could then, with .replace, substitute all lower matches to "",
and see by how much the length has changed.

But I don't know whether that would always work with sufficiently
complex RegExps.

You could .replace the parameter in question with an Unreasonable String
(it is, after all, Unicode) and then do indexOf(that US).

I appreciate the brainstorming. Back references render the remaining
above ideas unworkable, as far as I can tell. Below is a function I
coded up which does the job. It works by introducing parens ending at
the start of the specified capturing parens [those are parens that
don't start with (?:] and stretching back to the start of the
containing capturing parens. Of course the containing paren's position
must be identified, too, so you get the idea this is recursive. The
complete listing of the function in all its gory glory follows (not
extensively tested).

Csaba Gabor from Vienna

function regExpPos (text, re, parenNum) {
// returns the starting position of the parenNum-th capturing parens
// of the RegExp, re, when matching text; -1 if not successful
if (!parenNum) { // terminating case
if (!text.match(re)) return -1;
return RegExp.leftContext.length; }
var i, j, aParen, src=re.source;
if (arguments.length<4) { // initial entry - this section determines
// opening and closing positions of all capturing parens
var code, chr;
aParen = [[0, src.length]];
var mode = 0; // 0 => normal, 1 => character []
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") { ++i; continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
if (chr=="(" && src.substr(i+1,2)!="?:") aParen.push([i, -1]);
else if (chr==")")
for (j=aParen.length;j--

if (aParen[j][1]<0) { aParen[j][1]=i; break; }
}
if (parenNum>=aParen.length) {
if (!text.match(re)) return -1;
return (RegExp.leftContext.length + RegExp.lastMatch.length); }
} else aParen = arguments[3];

// step 1 - find the containing parens (cp, aCP)
var aTP = aParen[parenNum]; // parenNum's start, end position
for (var cP=parenNum;cP--

if (aParen[cP][1]>aTP[1]) break;
var res, aP2, aCP = aParen[cP]; // containing paren's start, end pos

// step 2 - avoid introducing extra level of parens
// for when cP to parenNum is completely filled with parens
for (i=parenNum, aP2=;--i>cP
if (aParen[aP2[aP2.length-1]][0]==aParen[1]+1)
aP2[aP2.length] = i;
if (aParen[aP2[aP2.length-1]][0]==aCP[0]+1) {
if (!text.match(re)) return -1;
for (res=0, i=aP2.length;--i res += RegExp['$'+aP2].length;
return res + (!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen)); }

// step 3 - insert parens from start of cP to start of parenNum
//alert (aParen.join("\n"));
src = src.slice(0,i=aCP[0]) + "(" +
src.slice(i,i=aTP[0]) + ")" + src.slice(i);

// step 4 - replace back references >= parenNum
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") {
if (!mode && (code=src.charCodeAt(i+1))<57 && (code>=48+(cP+1)))
src = src.slice(0,i+1) + String.fromCharCode(code+1) +
src.slice(i+2);
++i;
continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
}

// step 5 - do the regular expression
var rex = /x/;
rex.compile(src);
if (!text.match(rex)) return -1;
return RegExp['$'+(cP+1)].length +
(!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen));
}

Lasse Reichstein Nielsen · Apr 22, 2006

Csaba Gabor said:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

I can't see an immediate way that works with all regexps and/or
texts. You only get the value of the group match, and that can be very
un-unique in the string, and even in the match. The only index you
ever get is the index of the entire match.

/L

Dr John Stockton · Apr 22, 2006

JRS: In article <[email protected]>, dated Fri, 21 Apr
2006 15:00:08 remote, seen in Randy Webb

Randy said:
Csaba Gabor said the following on 4/21/2006 2:48 PM:

Randy said:

Csaba Gabor said the following on 4/21/2006 1:23 PM:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.
There is one already. indexOf
Never tried it with RegExp's though

Click to expand...

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

Click to expand...

My knowledge of RegExp's may not be well enough to understand them so I
may be reading it wrong, but if you want the last match, then
lastIndexOf gives it. -1 if no match.

ISTM that, if he had wanted that, he would have said so. After all, the
Viennese are good at English.

Testing such as

R = ("12j3456789").match(/(\d)(\d)(\d)(\d)/)
A = R['lastIndex']

suggests that A is indeed the index at which to start the next match,
and
A = R['lastIndex'] - R[R.length-1].length

is therefore the beginning of the last match.

So, Csaba, you just need a RegExp that edits RegExps to have only n
matches, and a question very similar to the original is already
answered.

It looks as if RegExp.leftContext.length *may* actually answer the
modified question but IE4 appears not to have leftContext.

Small Flanagan asserts that IE4 has neither leftContext not lastIndex.

<FAQENTRY> The FAQ needs a goof link or two, and a supporting entry, for
RegExp.

JavaScript: how to keep track of the circle in canvas on specific path?	0	Mar 20, 2023
Finding an element position	4	Feb 20, 2006
RegExp.exec() returns null when there is a match - a JavaScript RegExp bug?	2	Dec 17, 2006
Help with my responsive home page	2	Dec 14, 2022
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
So I have (a sketch of) a universal system...	3	Sep 2, 2022
RegExp Help!	0	Aug 19, 2005
Length of longest contiguous digits exercise	19	Oct 23, 2009

Finding position of a RegExp subexpression

Csaba Gabor

Randy Webb

Dr John Stockton

Csaba Gabor

Randy Webb

Csaba Gabor

Lasse Reichstein Nielsen

Dr John Stockton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads