Finding position of a RegExp subexpression

C

Csaba Gabor

I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

For example:
var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
var text = "There were some nesting parens in the test";
alert (regExpPos (text, re, 3));

should show 17


Would anyone have one of these?
Csaba Gabor from Vienna
 
R

Randy Webb

Csaba Gabor said the following on 4/21/2006 1:23 PM:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

There is one already. indexOf :)
Never tried it with RegExp's though :)
 
D

Dr John Stockton

JRS: In article <[email protected]>
, dated Fri, 21 Apr 2006 10:23:41 remote, seen in
news:comp.lang.javascript said:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

For example:
var re = /some(thing|or other)?.*(n(est)(?:ed)?.*(parens) )/
var text = "There were some nesting parens in the test";
alert (regExpPos (text, re, 3));

should show 17

If you can alter the RegExp by inserting extra parentheses so that
everything is matched, them you could sum the lengths of all lower
matches.

Or you could then, with .replace, substitute all lower matches to "",
and see by how much the length has changed.

But I don't know whether that would always work with sufficiently
complex RegExps.

You could .replace the parameter in question with an Unreasonable String
(it is, after all, Unicode) and then do indexOf(that US).

Note : if the original string is less than 2^16 characters long, there
mist be at least one "16-bit" Unicode character that it does not
contain. So to find a one-character US, start searching for each
possible character in turn (starting with the least plausible) until you
find one that is not there.

Untested.
 
C

Csaba Gabor

Randy said:
Csaba Gabor said the following on 4/21/2006 1:23 PM:

There is one already. indexOf :)
Never tried it with RegExp's though :)

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

Csaba

By the way, thanks for that ear piercing demo in the other thread. :)



The problem with using text.indexOf(RegExp.$pare,pos) will find the
position of substring within string, but the problem is that that
RegExp.$parenNum may not be unique within string
 
R

Randy Webb

Csaba Gabor said the following on 4/21/2006 2:48 PM:
Randy said:
Csaba Gabor said the following on 4/21/2006 1:23 PM:
There is one already. indexOf :)
Never tried it with RegExp's though :)

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

My knowledge of RegExp's may not be well enough to understand them so I
may be reading it wrong, but if you want the last match, then
lastIndexOf gives it. -1 if no match.
Csaba

By the way, thanks for that ear piercing demo in the other thread. :)

It does a better job than coffee at 5 am :)
 
C

Csaba Gabor

Dr said:
JRS: In article <[email protected]>
, dated Fri, 21 Apr 2006 10:23:41 remote, seen in


If you can alter the RegExp by inserting extra parentheses so that
everything is matched, them you could sum the lengths of all lower
matches.

This is, in effect, what I have done, code provided below. However, it
is a non trivial process that must account for nested parentheses
(...(...()...()...)...(...()...)...), back references (\#), and non
capturing subexpressions (?:...).
Or you could then, with .replace, substitute all lower matches to "",
and see by how much the length has changed.

But I don't know whether that would always work with sufficiently
complex RegExps.

You could .replace the parameter in question with an Unreasonable String
(it is, after all, Unicode) and then do indexOf(that US).

I appreciate the brainstorming. Back references render the remaining
above ideas unworkable, as far as I can tell. Below is a function I
coded up which does the job. It works by introducing parens ending at
the start of the specified capturing parens [those are parens that
don't start with (?:] and stretching back to the start of the
containing capturing parens. Of course the containing paren's position
must be identified, too, so you get the idea this is recursive. The
complete listing of the function in all its gory glory follows (not
extensively tested).

Csaba Gabor from Vienna


function regExpPos (text, re, parenNum) {
// returns the starting position of the parenNum-th capturing parens
// of the RegExp, re, when matching text; -1 if not successful
if (!parenNum) { // terminating case
if (!text.match(re)) return -1;
return RegExp.leftContext.length; }
var i, j, aParen, src=re.source;
if (arguments.length<4) { // initial entry - this section determines
// opening and closing positions of all capturing parens
var code, chr;
aParen = [[0, src.length]];
var mode = 0; // 0 => normal, 1 => character []
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") { ++i; continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
if (chr=="(" && src.substr(i+1,2)!="?:") aParen.push([i, -1]);
else if (chr==")")
for (j=aParen.length;j--;)
if (aParen[j][1]<0) { aParen[j][1]=i; break; }
}
if (parenNum>=aParen.length) {
if (!text.match(re)) return -1;
return (RegExp.leftContext.length + RegExp.lastMatch.length); }
} else aParen = arguments[3];

// step 1 - find the containing parens (cp, aCP)
var aTP = aParen[parenNum]; // parenNum's start, end position
for (var cP=parenNum;cP--;) if (aParen[cP][1]>aTP[1]) break;
var res, aP2, aCP = aParen[cP]; // containing paren's start, end pos

// step 2 - avoid introducing extra level of parens
// for when cP to parenNum is completely filled with parens
for (i=parenNum, aP2=;--i>cP;)
if (aParen[aP2[aP2.length-1]][0]==aParen[1]+1)
aP2[aP2.length] = i;
if (aParen[aP2[aP2.length-1]][0]==aCP[0]+1) {
if (!text.match(re)) return -1;
for (res=0, i=aP2.length;--i;) res += RegExp['$'+aP2].length;
return res + (!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen)); }

// step 3 - insert parens from start of cP to start of parenNum
//alert (aParen.join("\n"));
src = src.slice(0,i=aCP[0]) + "(" +
src.slice(i,i=aTP[0]) + ")" + src.slice(i);

// step 4 - replace back references >= parenNum
for (i=0;i<src.length;++i) {
if ((chr=src.charAt(i))=="\\") {
if (!mode && (code=src.charCodeAt(i+1))<57 && (code>=48+(cP+1)))
src = src.slice(0,i+1) + String.fromCharCode(code+1) +
src.slice(i+2);
++i;
continue; }
if (mode) { if (chr=="]") mode = 0; continue; }
if (chr=="[") { mode = 1; continue; }
}

// step 5 - do the regular expression
var rex = /x/;
rex.compile(src);
if (!text.match(rex)) return -1;
return RegExp['$'+(cP+1)].length +
(!cP ? RegExp.leftContext.length :
regExpPos(text, re, cP, aParen));
}
 
L

Lasse Reichstein Nielsen

Csaba Gabor said:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.

I can't see an immediate way that works with all regexps and/or
texts. You only get the value of the group match, and that can be very
un-unique in the string, and even in the match. The only index you
ever get is the index of the entire match.

/L
 
D

Dr John Stockton

JRS: In article <[email protected]>, dated Fri, 21 Apr
2006 15:00:08 remote, seen in Randy Webb
Csaba Gabor said the following on 4/21/2006 2:48 PM:
Randy said:
Csaba Gabor said the following on 4/21/2006 1:23 PM:
I need to come up with a function
function regExpPos (text, re, parenNum) { ... }
that will return the position within text of RegExp.$parenNum if there
is a match, and -1 otherwise.
There is one already. indexOf :)
Never tried it with RegExp's though :)

The problem with
function regExpPos (text, re, parenNum) {
if (!text.match(re)) return -1;
return text.indexOf(RegExp['$'+parenNum], RegExp.leftContext.length)
}

is that RegExp['$'+parenNum] may not be unique within text (though it
is in the example that I gave). So if I change text to
var text = "There were some questionable nesting parens in the test";
regExpPos (text, re, 3) would return 18 instead of the correct 30.

My knowledge of RegExp's may not be well enough to understand them so I
may be reading it wrong, but if you want the last match, then
lastIndexOf gives it. -1 if no match.

ISTM that, if he had wanted that, he would have said so. After all, the
Viennese are good at English.


Testing such as

R = ("12j3456789").match(/(\d)(\d)(\d)(\d)/)
A = R['lastIndex']

suggests that A is indeed the index at which to start the next match,
and
A = R['lastIndex'] - R[R.length-1].length

is therefore the beginning of the last match.

So, Csaba, you just need a RegExp that edits RegExps to have only n
matches, and a question very similar to the original is already
answered.

It looks as if RegExp.leftContext.length *may* actually answer the
modified question but IE4 appears not to have leftContext.

Small Flanagan asserts that IE4 has neither leftContext not lastIndex.


<FAQENTRY> The FAQ needs a goof link or two, and a supporting entry, for
RegExp.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,610
Members
45,254
Latest member
Top Crypto TwitterChannel

Latest Threads

Top