regex for replacing plain text within html string...

T

Tim_Mac

hi,
i have a tricky problem and my regex expertise has reached its limit.
i have read other posts on this newsgroup that pull out the plain text
from a html string, but that won't work for me because i want to
preserve the html, and replace some of the plain text.

i basically want to show the user's search terms highlighted in the
page, like google does, but i want to do this server side (i have the
mechanics of intercepting the html sorted out, by overriding the
Page.Render method). i can use a simple regex pattern like (keyword)
and replace with <span class='highlight'>$1</span> but this causes
problems because the keyword may appear in markup tags or attribute
values, which the above example will also replace, screwing up the html
structure.

what i want to express is: match the keyword, where it is not contained
inside a html tag, i.e. between a < and > character

my most obvious attempt is too simplistic and doesn't work:
[^<]*(keyword)[^>]*

i did come up with another regex which i am almost embarassed to show
:)
it essentially matches the keyword inside the inner text of a html tag
set. but the problem is that it misses subsequent occurrences of the
keyword in the same match.

here is the pattern:
<(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
and the replace: <$3$1<span class='highlight'>$4</span>$2
it actually works, but as i mentioned it does miss multiple occurrences
inside the same tag, and requires all the text to be within an open +
close html tag.

i would be really grateful if anyone had a suggestion
thanks
tim
 
G

Guest

Your best bet with this type of replacement would be to first regex the text
between the html tabs (i.e. > and <) then do a standard string replace on the
keyword(s).
 
T

Tim_Mac

hi tom. thanks for the reply.
yes but the problem i mentioned is that if you get text between > and <
characters, it could contain more tags inside it, so your
String.Replace method could still replace mark-up then.
by the way, how would you pull out the text by regex, then use
String.Replace and keep the structure of the page html all together? i
don't see how you would use the two approaches together...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top