Strange result with Regexp

H

howa

E.g.

var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));


It shows:

s=12345d, xxdx

while I expect

xd

Any suggestions?

Thanks.
 
L

Lasse Reichstein Nielsen

howa said:
var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));
It shows:

s=12345d, xxdx

I would have expected xdx, but your result is equally valid.
The regular expression /[0-9]*/ matches *zero* or more digits.
Change it to /[0-9]+/.

/L
 
P

pr

howa said:
E.g.

var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));


It shows:

s=12345d, xxdx

while I expect

xd

Any suggestions?

As Lasse says, '*' matches zero or more. In theory, globally replacing a
zero-length string should be an infinite task. In practice
(fortunately), the regular expression engine avoids consecutive
zero-length matches. Therefore you have one 5-digit match and two
0-digit matches, one each side of the 'd'.

These examples look even odder:

"d".replace(/[0-9]*/g, "x") // xdx
"dddd".replace(/[0-9]*/g, "x") // xdxdxdxdx

To preserve your sanity :) try to consider '*' as a last resort. And
only use that 'g' flag if you mean it.
 
L

Lee

howa said:
E.g.

var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));


It shows:

s=12345d, xxdx

while I expect

xd

Any suggestions?

You don't really want to be specifying "zero or more",
or even "one or more". Simply replace *each individual*
digit with an "x", allowing the "g" flag to do the work:

replace(/[0-9]/g,"x")


--
 
A

Alexey Kulentsov

howa said:
E.g.

var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));


It shows:

s=12345d, xxdx

while I expect

xd

Any suggestions?

Remove 'g' modifier from regexp and you will get your xd
 
T

Thomas 'PointedEars' Lahn

pr said:
howa said:
var s = "12345d";
document.write("s="+s+ ", " + s.replace(/[0-9]*/g,'x'));

It shows:

s=12345d, xxdx

while I expect

xd

[...] In theory, globally replacing a zero-length string should be an
infinite task. In practice (fortunately), the regular expression engine
avoids consecutive zero-length matches. Therefore you have one 5-digit
match and two 0-digit matches, one each side of the 'd'.

Not at all. In theory, there is an ε (epsilon) production; please read
about Regular Grammars:

http://en.wikipedia.org/wiki/Formal_grammar#Regular_grammars

In practice, Regular Expressions match *non-overlapping* occurrences of the
pattern in the string which means that even with global matching no position
is visited twice by the matcher; please read ECMA-262 Ed. 3 Final, section
15.5.4.11:

http://www.ecmascript.org/docs.php

Here is what happens, in a nutshell (I used `^' to indicate the next
possible match, and `ε' for the empty word/string to be matched):

0. Input string: "12345d"
Regular Expression: /[0-9]*/g --> lastIndex=0
Replacement string: "x"

1. Find matches for the Regular Expression.

position 0 1 2 3 4 5
ε1ε2ε3ε4ε5εdε
^ ^ ^ ^ ^
(/[0-9]*/, lastIndex=0) --> ("12345", index=0, lastIndex=5)

Greedy matching, so the longest match wins.
The global flag is set, continue.

2. Find more matches for the Regular Expression.

position 0 1 2 3 4 5
ε1ε2ε3ε4ε5εdε
^
(/[0-9]*/, lastIndex=5) --> (ε, index=5, lastIndex=5)

The longest and only possible match that remains is the empty string;
next possible match after position 4.
The global flag is set, continue.

3. Find more matches for the Regular Expression.

position 0 1 2 3 4 5 6
ε1ε2ε3ε4ε5εdε
^
(/[0-9]*/, lastIndex=5) --> (ε, index=6, lastIndex=6)

The longest and only possible match that remains is the empty string;
next possible match after position 5.
The global flag is set, continue.

4. Find more matches for the Regular Expression.

position 0 1 2 3 4 5 6
ε1ε2ε3ε4ε5εdε
^
(/[0-9]*/, lastIndex=6) --> (null, index=6, lastIndex=0)

End of string, no further matches possible.

5. Found matches:

("12345", index=0, lastIndex=5),
(ε, index=5, lastIndex=6),
(ε, index=6, lastIndex=6),

6. Replace all matches with the replacement string each.

position 0 1 2 3 4 5 6
ε1ε2ε3ε4ε5εdε

Result: x xdx

7. Result: "xxdx"

You can confirm this when evaluating the return value of
"12345d".match(/[0-9]*/g) -- as defined in the Specification -- which is
["12345", "", ""] whereas the matches "" can be understood as those
literally matching ε, the empty word/string.


HTH

PointedEars
 
P

pr

Thomas said:
pr said:
[...] In theory, globally replacing a zero-length string should be an
infinite task. In practice (fortunately), the regular expression engine
avoids consecutive zero-length matches. Therefore you have one 5-digit
match and two 0-digit matches, one each side of the 'd'.

Not at all. In theory, there is an ε (epsilon) production; please read
about Regular Grammars:

http://en.wikipedia.org/wiki/Formal_grammar#Regular_grammars

I didn't know about those.
In practice, Regular Expressions match *non-overlapping* occurrences of the
pattern in the string which means that even with global matching no position
is visited twice by the matcher; please read ECMA-262 Ed. 3 Final, section
15.5.4.11:

Are you going to tell me that zero-length strings can overlap? Is that
another mathematics thing?

15.5.4.10:

| If regexp.global is true: Set the regexp.lastIndex property to 0 and
| invoke RegExp.prototype.exec repeatedly until there is no match. If
| there is a match with an empty string (in other words, if the value
| of regexp.lastIndex is left unchanged), increment regexp.lastIndex
| by 1.

and 15.10.2.5

| Step 1 of the RepeatMatcher's closure d states that, once the
| minimum number of repetitions has been satisfied, any more
| expansions of Atom that match the empty string are not considered
| for further repetitions. This prevents the regular expression engine
| from falling into an infinite loop on patterns such
| as:
|
| /(a*)*/.exec("b")
Here is what happens, in a nutshell (I used `^' to indicate the next
possible match, and `ε' for the empty word/string to be matched):
[...]

Your explanation is more detailed but I don't think it says anything
mine didn't. Seems one of us misread.
You can confirm this when evaluating the return value of
"12345d".match(/[0-9]*/g) -- as defined in the Specification -- which is
["12345", "", ""] whereas the matches "" can be understood as those
literally matching ε, the empty word/string.

Exactly; 'one 5-digit match and two 0-digit matches', since the
expression matched zero or more digits. Or, to put it another way:

(function () {
var s = "12345d";
var re = /[0-9]*/g, results;
while ((results = re.exec(s)) &&
confirm(["'" + results[0] + "'", results.index,
re.lastIndex].join(" | ") + "\n")) {
if (results[0].length == 0) {
re.lastIndex++;
}
}
})();
 
T

Thomas 'PointedEars' Lahn

pr said:
Thomas said:
pr said:
[...] In theory, globally replacing a zero-length string should be an
infinite task. In practice (fortunately), the regular expression engine
avoids consecutive zero-length matches. Therefore you have one 5-digit
match and two 0-digit matches, one each side of the 'd'.
Not at all. [...]
In practice, Regular Expressions match *non-overlapping* occurrences of the
pattern in the string which means that even with global matching no position
is visited twice by the matcher; please read ECMA-262 Ed. 3 Final, section
15.5.4.11:

Are you going to tell me that zero-length strings can overlap? Is that
another mathematics thing?

I was talking about patterns in the string, about not strings. IOW,

(ab|abc)

matches only "ab" in "abcd", not also "abc", because these two patterns in
the string overlap. This is accomplished quite simply by continue matching
at the endIndex of the previous match, and not at its index. Which is the
reason why one observes the result of "xxdx".
15.5.4.10:

| If regexp.global is true: Set the regexp.lastIndex property to 0 and
| invoke RegExp.prototype.exec repeatedly until there is no match. If
| there is a match with an empty string (in other words, if the value
| of regexp.lastIndex is left unchanged), increment regexp.lastIndex
| by 1.

and 15.10.2.5

| Step 1 of the RepeatMatcher's closure d states that, once the
| minimum number of repetitions has been satisfied, any more
| expansions of Atom that match the empty string are not considered
| for further repetitions. This prevents the regular expression engine
| from falling into an infinite loop on patterns such
| as:
|
| /(a*)*/.exec("b")

What you said is quite different from that. It has not anything to do with
"consecutive zero-length matches". As I have showed, there are consecutive
zero-length matches that are considered.

In plain English, the above paragraph merely says that once the matcher has
tried to match the empty word (length=0), it stops and continues at the
position of the next occurrence of the pattern in the string, as I have showed.
Here is what happens, in a nutshell (I used `^' to indicate the next
possible match, and `ε' for the empty word/string to be matched):
[...]

Your explanation is more detailed but I don't think it says anything
mine didn't.

Yes, it does.
Seems one of us misread.

Yes, you did.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top