Help with Regular Expression

S

sdragolov

Hi - I'm having trouble with finding and replacing all occurences. In
the string below I want to wrap <span> tags around all occurences of
"2008"; The value occurs twice in 200802141200200802141300. However,
only the second occurence is matched and replaced with the span tags.
I have specified the 'g' flag in my regular expression as can be seen
below.

Thanks in advance:

Snipper of JavaScript code with regexp (value of strSearchString is
2008):
....
var strReplaceWith = "$1<span class=found>$2</span>$3";
var re = new RegExp(">([A-Za-z0-9\.]*)(" + strSearchString + ")([A-Za-
z0-9\.]*'?)<",'gi');
document.getElementById("temp123").innerHTML =
strDivContents.replace(re,">" + strReplaceWith + "<");
....

String where match/replace should occur:

&nbsp; <img src="01.png" class="segmentactions" />&nbsp; <span
class="fullsegment" id="SEGM_DTM0"><span class="segment">DTM</
span><span class="ctrlCharacter">+</span>2<span
class="ctrlCharacter">:</span>200802141200200802141300<span
class="ctrlCharacter">:</span>719'</span></div>
 
M

Michael Wojcik

Hi - I'm having trouble with finding and replacing all occurences. In
the string below I want to wrap <span> tags around all occurences of
"2008"; The value occurs twice in 200802141200200802141300. However,
only the second occurence is matched and replaced with the span tags.

The problem is regular expression "greediness".
I have specified the 'g' flag in my regular expression as can be seen
below.

The "g" flag tells the matcher to continue trying to find matches in
the string after the first match has been found. (It will start from
the end of the first match, so it will not match overlapping
sequences.) The problem here is that there are no more matches after
the first, because the first match found too much.

Here's a shorter version of your example, which I just tested using
the Firefox Javascript Shell bookmarklet.[1] (You might want to give
that a try, by the way - it's a useful environment for testing
Javascript snippets in isolation, or in the context of a particular
webpage. Obviously various caveats apply: it uses the Firefox
implementation, etc.)

-----
var re =
new RegExp(">([A-Za-z0-9\.]*)(2008)([A-Za-z0-9\.]*'?)<",'gi');
var strReplaceWith="$1<span class=found>$2</span>$3";
var strContents="</span>200802141200200802141300<span";
print(strContents.replace(re,">" + strReplaceWith + "<"));
-----

("print" is a feature of Javascript Shell.) The output:

</span>200802141200<span class=found>2008</span>02141300<span

In your RE, the first "[A-Za-z0-9\.]*" (the second term of the RE)
will match as many characters as it can, as long as the rest of the RE
can still be satisfied. That's called "greediness": the RE atom eats
as much input as it can.

In your input, the entire string of numbers matches that starred
character set. The only "inflexible" part of your RE, so to speak, is
the literal "2008" atom in the middle. And there are two opportunities
for the RE to match that, so it'll take the second one, as that lets
the first atom match as much as possible.

To fix that, you need to prevent the second term from matching the
string "2008". In strict RE syntax, that's a bit messy (as you'll see
if you draw the DFA and then convert it to a straightforward RE). But
the extended REs (influenced by egrep, Perl, and the like) that
Javascript provides offer various bits of syntactic sugar that
simplify it a bit.

We can tell that term to not be greedy, using the quantifier modifier
"?". The RE becomes:

RegExp(">([A-Za-z0-9\.]*?)(2008)([A-Za-z0-9\.]*'?)<",'gi')

and now the output is:

</span><span class=found>2008</span>02141200200802141300<span

Now it's only matching the *first* "2008". Why not the second? Because
the final term of your RE matches all the numbers *after* the 2008.

It might seem you could just get rid of that last term:

-----
var re =
new RegExp(">([A-Za-z0-9\.]*?)(2008)",'gi');
var strReplaceWith="$1<span class=found>$2</span>";
var strContents="</span>200802141200200802141300<span";
print(strContents.replace(re,">" + strReplaceWith + "<"));
-----

.... except that doesn't work, because there's no ">" before the second
"2008", so the RE doesn't find a second match. Also, we'd lose that
final bit where you allow for an optional single-quote character. You
haven't told us enough about your data to know how we can best address
this.

For example, the following seems to do what you say you want with your
sample input:

-----
var re =
new RegExp("(>?[A-Za-z0-9\.']*?)(2008)",'gi');
var strReplaceWith="$1<span class=found>$2</span>";
var strContents="</span>200802141200200802141300<span";
print(strContents.replace(re, strReplaceWith));
-----

which produces:

</span><span class=found>2008</span>02141200<span
class=found>2008</span>02141300<span

(Note I pulled the ">" into the first term, and got rid of the
extraneous ">" and "<" you were tacking onto strReplaceWith in the
replace call.) But I don't know if that will work satisfactorily with
your real input, because I don't know what the constraints are.

Also, note that since you use the "i" option, you don't need to
include both upper- and lowercase letters in your character classes.
And I don't know whether you're actually looking to match against the
backslash character, or if you're trying to escape a "." character in
the character set; you're doing the former. (The "." character does
not need to be escaped in a character set.)


[1] http://www.squarefree.com/shell/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top