Help with Regular Expression

sdragolov · Apr 6, 2008

Hi - I'm having trouble with finding and replacing all occurences. In
the string below I want to wrap tags around all occurences of
"2008"; The value occurs twice in 200802141200200802141300. However,
only the second occurence is matched and replaced with the span tags.
I have specified the 'g' flag in my regular expression as can be seen
below.

Thanks in advance:

Snipper of JavaScript code with regexp (value of strSearchString is
2008):
....
var strReplaceWith = "$1$2$3";
var re = new RegExp(">([A-Za-z0-9\.]*)(" + strSearchString + ")([A-Za-
z0-9\.]*'?)<",'gi');
document.getElementById("temp123").innerHTML =
strDivContents.replace(re,">" + strReplaceWith + "<");
....

String where match/replace should occur:

<img src="01.png" class="segmentactions" />  <span

class="fullsegment" id="SEGM_DTM0">DTM+2:200802141200200802141300:719'</div>

Michael Wojcik · Apr 10, 2008

Hi - I'm having trouble with finding and replacing all occurences. In
the string below I want to wrap tags around all occurences of
"2008"; The value occurs twice in 200802141200200802141300. However,
only the second occurence is matched and replaced with the span tags.

The problem is regular expression "greediness".

I have specified the 'g' flag in my regular expression as can be seen
below.

The "g" flag tells the matcher to continue trying to find matches in
the string after the first match has been found. (It will start from
the end of the first match, so it will not match overlapping
sequences.) The problem here is that there are no more matches after
the first, because the first match found too much.

Here's a shorter version of your example, which I just tested using
the Firefox Javascript Shell bookmarklet.[1] (You might want to give
that a try, by the way - it's a useful environment for testing
Javascript snippets in isolation, or in the context of a particular
webpage. Obviously various caveats apply: it uses the Firefox
implementation, etc.)

-----
var re =
new RegExp(">([A-Za-z0-9\.]*)(2008)([A-Za-z0-9\.]*'?)<",'gi');
var strReplaceWith="$1$2$3";
var strContents="200802141200200802141300<span";
print(strContents.replace(re,">" + strReplaceWith + "<"));
-----

("print" is a feature of Javascript Shell.) The output:

200802141200200802141300<span

In your RE, the first "[A-Za-z0-9\.]*" (the second term of the RE)
will match as many characters as it can, as long as the rest of the RE
can still be satisfied. That's called "greediness": the RE atom eats
as much input as it can.

In your input, the entire string of numbers matches that starred
character set. The only "inflexible" part of your RE, so to speak, is
the literal "2008" atom in the middle. And there are two opportunities
for the RE to match that, so it'll take the second one, as that lets
the first atom match as much as possible.

To fix that, you need to prevent the second term from matching the
string "2008". In strict RE syntax, that's a bit messy (as you'll see
if you draw the DFA and then convert it to a straightforward RE). But
the extended REs (influenced by egrep, Perl, and the like) that
Javascript provides offer various bits of syntactic sugar that
simplify it a bit.

We can tell that term to not be greedy, using the quantifier modifier
"?". The RE becomes:

RegExp(">([A-Za-z0-9\.]*?)(2008)([A-Za-z0-9\.]*'?)<",'gi')

and now the output is:

200802141200200802141300<span

Now it's only matching the *first* "2008". Why not the second? Because
the final term of your RE matches all the numbers *after* the 2008.

It might seem you could just get rid of that last term:

-----
var re =
new RegExp(">([A-Za-z0-9\.]*?)(2008)",'gi');
var strReplaceWith="$1$2";
var strContents="200802141200200802141300<span";
print(strContents.replace(re,">" + strReplaceWith + "<"));
-----

.... except that doesn't work, because there's no ">" before the second
"2008", so the RE doesn't find a second match. Also, we'd lose that
final bit where you allow for an optional single-quote character. You
haven't told us enough about your data to know how we can best address
this.

For example, the following seems to do what you say you want with your
sample input:

-----
var re =
new RegExp("(>?[A-Za-z0-9\.']*?)(2008)",'gi');
var strReplaceWith="$1$2";
var strContents="200802141200200802141300<span";
print(strContents.replace(re, strReplaceWith));
-----

which produces:

200802141200200802141300" into the first term, and got rid of the
extraneous ">" and "<" you were tacking onto strReplaceWith in the
replace call.) But I don't know if that will work satisfactorily with
your real input, because I don't know what the constraints are.

Also, note that since you use the "i" option, you don't need to
include both upper- and lowercase letters in your character classes.
And I don't know whether you're actually looking to match against the
backslash character, or if you're trying to escape a "." character in
the character set; you're doing the former. (The "." character does
not need to be escaped in a character set.)

[1] http://www.squarefree.com/shell/

Regular expression problem	13	Mar 10, 2013
Problem with Regular Expression	3	Jan 29, 2007
relace() with string variable as part of regular expression	2	Dec 11, 2007
Tough Regular Expression problem	3	Nov 8, 2004
What's the best way to write this regular expression?	41	Mar 6, 2012
Regular expression help	3	Jul 8, 2008
Regular Expression Help please!	7	Nov 1, 2009
Regular express for <p>, <ul> and <ol> tags	18	Aug 25, 2008

Help with Regular Expression

sdragolov

Michael Wojcik

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads