regex for replacing plain text within html string...

Discussion in 'ASP .Net' started by Tim_Mac, Jan 20, 2006.

  1. Tim_Mac

    Tim_Mac Guest

    hi,
    i have a tricky problem and my regex expertise has reached its limit.
    i have read other posts on this newsgroup that pull out the plain text
    from a html string, but that won't work for me because i want to
    preserve the html, and replace some of the plain text.

    i basically want to show the user's search terms highlighted in the
    page, like google does, but i want to do this server side (i have the
    mechanics of intercepting the html sorted out, by overriding the
    Page.Render method). i can use a simple regex pattern like (keyword)
    and replace with <span class='highlight'>$1</span> but this causes
    problems because the keyword may appear in markup tags or attribute
    values, which the above example will also replace, screwing up the html
    structure.

    what i want to express is: match the keyword, where it is not contained
    inside a html tag, i.e. between a < and > character

    my most obvious attempt is too simplistic and doesn't work:
    [^<]*(keyword)[^>]*

    i did come up with another regex which i am almost embarassed to show
    :)
    it essentially matches the keyword inside the inner text of a html tag
    set. but the problem is that it misses subsequent occurrences of the
    keyword in the same match.

    here is the pattern:
    <(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
    and the replace: <$3$1<span class='highlight'>$4</span>$2
    it actually works, but as i mentioned it does miss multiple occurrences
    inside the same tag, and requires all the text to be within an open +
    close html tag.

    i would be really grateful if anyone had a suggestion
    thanks
    tim
     
    Tim_Mac, Jan 20, 2006
    #1
    1. Advertising

  2. Your best bet with this type of replacement would be to first regex the text
    between the html tabs (i.e. > and <) then do a standard string replace on the
    keyword(s).

    "Tim_Mac" wrote:

    > hi,
    > i have a tricky problem and my regex expertise has reached its limit.
    > i have read other posts on this newsgroup that pull out the plain text
    > from a html string, but that won't work for me because i want to
    > preserve the html, and replace some of the plain text.
    >
    > i basically want to show the user's search terms highlighted in the
    > page, like google does, but i want to do this server side (i have the
    > mechanics of intercepting the html sorted out, by overriding the
    > Page.Render method). i can use a simple regex pattern like (keyword)
    > and replace with <span class='highlight'>$1</span> but this causes
    > problems because the keyword may appear in markup tags or attribute
    > values, which the above example will also replace, screwing up the html
    > structure.
    >
    > what i want to express is: match the keyword, where it is not contained
    > inside a html tag, i.e. between a < and > character
    >
    > my most obvious attempt is too simplistic and doesn't work:
    > [^<]*(keyword)[^>]*
    >
    > i did come up with another regex which i am almost embarassed to show
    > :)
    > it essentially matches the keyword inside the inner text of a html tag
    > set. but the problem is that it misses subsequent occurrences of the
    > keyword in the same match.
    >
    > here is the pattern:
    > <(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
    > and the replace: <$3$1<span class='highlight'>$4</span>$2
    > it actually works, but as i mentioned it does miss multiple occurrences
    > inside the same tag, and requires all the text to be within an open +
    > close html tag.
    >
    > i would be really grateful if anyone had a suggestion
    > thanks
    > tim
    >
    >
     
    =?Utf-8?B?VG9tIEFuZGVyc29u?=, Jan 20, 2006
    #2
    1. Advertising

  3. Tim_Mac

    Tim_Mac Guest

    hi tom. thanks for the reply.
    yes but the problem i mentioned is that if you get text between > and <
    characters, it could contain more tags inside it, so your
    String.Replace method could still replace mark-up then.
    by the way, how would you pull out the text by regex, then use
    String.Replace and keep the structure of the page html all together? i
    don't see how you would use the two approaches together...
     
    Tim_Mac, Jan 21, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brown Smith
    Replies:
    1
    Views:
    509
    Frankie
    Jun 25, 2005
  2. Hal Vaughan
    Replies:
    9
    Views:
    436
    James
    Dec 26, 2007
  3. Replies:
    3
    Views:
    231
    Gunnar Hjalmarsson
    May 24, 2006
  4. Jürgen Exner
    Replies:
    1
    Views:
    158
  5. Jake Barnes
    Replies:
    9
    Views:
    834
    dave cutts
    Feb 21, 2006
Loading...

Share This Page