regex for replacing plain text within html string...

Tim_Mac · Jan 20, 2006

hi,
i have a tricky problem and my regex expertise has reached its limit.
i have read other posts on this newsgroup that pull out the plain text
from a html string, but that won't work for me because i want to
preserve the html, and replace some of the plain text.

i basically want to show the user's search terms highlighted in the
page, like google does, but i want to do this server side (i have the
mechanics of intercepting the html sorted out, by overriding the
Page.Render method). i can use a simple regex pattern like (keyword)
and replace with <span class='highlight'>$1</span> but this causes
problems because the keyword may appear in markup tags or attribute
values, which the above example will also replace, screwing up the html
structure.

what i want to express is: match the keyword, where it is not contained
inside a html tag, i.e. between a < and > character

my most obvious attempt is too simplistic and doesn't work:
[^<]*(keyword)[^>]*

i did come up with another regex which i am almost embarassed to show

it essentially matches the keyword inside the inner text of a html tag
set. but the problem is that it misses subsequent occurrences of the
keyword in the same match.

here is the pattern:
<(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
and the replace: <$3$1<span class='highlight'>$4</span>$2
it actually works, but as i mentioned it does miss multiple occurrences
inside the same tag, and requires all the text to be within an open +
close html tag.

i would be really grateful if anyone had a suggestion
thanks
tim

Guest · Jan 20, 2006

Your best bet with this type of replacement would be to first regex the text
between the html tabs (i.e. > and <) then do a standard string replace on the
keyword(s).

Tim_Mac · Jan 21, 2006

hi tom. thanks for the reply.
yes but the problem i mentioned is that if you get text between > and <
characters, it could contain more tags inside it, so your
String.Replace method could still replace mark-up then.
by the way, how would you pull out the text by regex, then use
String.Replace and keep the structure of the page html all together? i
don't see how you would use the two approaches together...

Make an <input type="text"> input-ready for input	8	Aug 17, 2023
My regex kung-fu is not strong =(	0	Apr 4, 2020
Problems with using event handlers for button and textarea input	1	Nov 29, 2021
Regex, replacing THIS\|THAT	2	Dec 17, 2011
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
replacing characters within a string	2	Nov 9, 2008
Big problem I need to solve with some unix utils	1	Jun 19, 2022
converting html to plain text	18	Apr 16, 2009

regex for replacing plain text within html string...

Tim_Mac

Guest

Tim_Mac

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads