String Manipulation Question - Can RegEx Do This?

Discussion in 'ASP .Net' started by Franklin, Feb 28, 2009.

  1. Franklin

    Franklin Guest

    Using .NET 3.5...

    My understanding is that RegEx is powerful enough to solve most of the
    world's problems...so I'm optimistic about this scenario:

    I need to automate a not-so-straight-forward search and replace operation on
    strings that contain HTML markup fragments.

    I need to take a string like this:
    <td align="center"><asp:placeHolder ID="PlaceHolder75"
    runat="server"></asp:placeHolder></td>

    and make it into this:
    <td align="center" ID="PlaceHolder75"></td>

    Important points are these:

    1. the <asp:placeHolder... /> is being removed entirely, with nothing
    inserted in it's place.

    2. the <td> located immediately before the <asp:placeHolder /> gets the
    "ID=" value of the [removed] <asp:placeHolder />.
    none of the <td> tags already have any ID attribute (this fact should
    simplify the operation).

    3. A given input string may have multiple <asp:placeHolder /> controls - all
    of which need to be removed, with the ID attribute of each being inserted
    into the <TD> immediately preceeding the [removed] <asp:placeHolder />

    So, from those of you with significant regex experience, can regex do this?
    Any pointers are greatly appreciated. Sample code would be awesome, as
    learning regex is a huge task that I've started, but yet have a long way to
    go.

    - F
    Franklin, Feb 28, 2009
    #1
    1. Advertising

  2. Franklin

    Franklin Guest

    "Peter Duniho" <> wrote in message
    news:eek:...
    > On Sat, 28 Feb 2009 10:00:01 -0800, Franklin <> wrote:
    >
    >> Using .NET 3.5...
    >>
    >> My understanding is that RegEx is powerful enough to solve most of the
    >> world's problems...so I'm optimistic about this scenario:

    >
    > As powerful as RegEx is, the world's problems are almost uniformly so
    > difficult so as to preclude any programming technique from being able to
    > solve them.
    >
    >> I need to automate a not-so-straight-forward search and replace
    >> operation on
    >> strings that contain HTML markup fragments.
    >>
    >> I need to take a string like this:
    >> <td align="center"><asp:placeHolder ID="PlaceHolder75"
    >> runat="server"></asp:placeHolder></td>
    >>
    >> and make it into this:
    >> <td align="center" ID="PlaceHolder75"></td>
    >>
    >> Important points are these:
    >>
    >> 1. the <asp:placeHolder... /> is being removed entirely, with nothing
    >> inserted in it's place.
    >>
    >> 2. the <td> located immediately before the <asp:placeHolder /> gets the
    >> "ID=" value of the [removed] <asp:placeHolder />.
    >> none of the <td> tags already have any ID attribute (this fact should
    >> simplify the operation).
    >>
    >> 3. A given input string may have multiple <asp:placeHolder /> controls -
    >> all
    >> of which need to be removed, with the ID attribute of each being inserted
    >> into the <TD> immediately preceeding the [removed] <asp:placeHolder />

    >
    > You should be more specific about how you intend for multiple
    > "PlaceHolder" IDs to be added to the <td> element. Are these to be
    > combined into a single string for one ID attribute? If so, how is the
    > string formatted? If not, how?
    >
    >> So, from those of you with significant regex experience, can regex do
    >> this?
    >> Any pointers are greatly appreciated. Sample code would be awesome, as
    >> learning regex is a huge task that I've started, but yet have a long way
    >> to
    >> go.

    >
    > I'm not a RegEx expert, so don't have an answer off the top of my head. I
    > do know that RegEx supports grouping, repetitive patterns, and retrieving
    > match groups and using them in the replacement pattern, so I'd agree that
    > what you're trying to do could probably be done with RegEx.
    >
    > But are you sure that's the best way? You are dealing with XML structure
    > here, and it seems like it might be better to represent the solution as
    > something that deals with XML structure. For example, just use the
    > classes in System.Xml.Linq to manipulate your document tree.
    > Alternatively, you could create an XSLT transform and transform the
    > document that way (System.Xml.Xsl).
    >


    I'm dealing with xhtml fragments, so it might be difficult to do this with
    techniques that require an entire or well-formed xml document.

    Meanwhile, I'm cobbling something together with RegEx... I'll post it when
    completed (then hopefully get some good feedback on improving it).

    - F
    Franklin, Feb 28, 2009
    #2
    1. Advertising

  3. Franklin

    Pavel Minaev Guest

    On Feb 28, 10:00 am, "Franklin" <> wrote:
    > My understanding is that RegEx is powerful enough to solve most of the
    > world's problems...so I'm optimistic about this scenario:
    >
    > I need to automate a not-so-straight-forward search and replace operation on
    > strings that contain HTML markup fragments.
    >
    > I need to take a string like this:
    >    <td align="center"><asp:placeHolder ID="PlaceHolder75"
    > runat="server"></asp:placeHolder></td>
    >
    > and make it into this:
    >    <td align="center" ID="PlaceHolder75"></td>
    >
    > Important points are these:
    >
    > 1. the <asp:placeHolder... /> is being removed entirely, with nothing
    > inserted in it's place.
    >
    > 2. the <td> located immediately before the <asp:placeHolder  /> gets the
    > "ID=" value of the [removed] <asp:placeHolder />.
    >     none of the <td> tags already have any ID attribute (this fact should
    > simplify the operation).
    >
    > 3. A given input string may have multiple <asp:placeHolder /> controls - all
    > of which need to be removed, with the ID attribute of each being inserted
    > into the <TD> immediately preceeding the [removed] <asp:placeHolder />
    >
    > So, from those of you with significant regex experience, can regex do this?
    > Any pointers are greatly appreciated. Sample code would be awesome, as
    > learning regex is a huge task that I've started, but yet have a long way to
    > go.


    For the specific task that you've outlined, it probably can be done.
    However, the result will most likely be a hack anyway, and here's why.

    Given that it's HTML/ASP you're essentially parsing, to do it
    _properly_, you have to handle all valid cases, unless you can somehow
    guarantee that your input is _precisely_ as you've described, and not
    just its semantic quivalent - and usually it's pretty damn hard to do,
    esp. if it is external input! For example, you'd probably need to
    support single quotes alongside double ones, case-insensitivity,
    arbitrary whitespace, possibility of additional attributes alongside
    "align", possibility of character entities in attribute values (e.g.
    <td align="Center">) - and hey, while we're at it, consider also
    custom named entities and external DTDs!

    If you are parsing HTML that you did not yourself produce, then most
    likely you cannot truly guarantee any of the above (at best, you can
    convince yourself that "no-one would do things in such a weird way").
    If you are producing it yourself, then you still get a very non-
    obvious and brittle coupling - later on you add class="foo" to those
    TDs, forgetting about your regex code, and it all breaks - worse yet,
    it breaks silently, because Regex.Replace won't complain if it doesn't
    find anything to replace.

    I've did quite a bit of regex hacking on my own in the past, and some
    of it was specifically for HTML parsing, where HTML was internal
    input. It was the area of the product which generated the most bugs
    for us post-release, and, after some struggling, and regexes growing
    more and more messy and complicated and unreadable (and, as we
    inevitably kept finding out, still incorrect in some corner cases!),
    we scrapped the whole thing entirely and just wrote a proper parser.
    Pavel Minaev, Feb 28, 2009
    #3
  4. Franklin

    Franklin Guest

    <snip>

    I do have complete control over the inputs.

    The only meaningful variation between what I stated and posted in the OP and
    what I'll have to deal with in the application is that the ID value of the
    PlaceHolder controls will change/be unique. Note that there will possibly be
    multiple PlaceHolders declared within any given input.

    All I really need to do is exactly what's stated in the OP and presented in
    the sample strings in the OP. I'm only needing to search for
    "<asp:placeHolder..." tags, and take the ID of each found Placeholder and
    stick it into the preceeding <TD>.

    I'm close to what I need using .NET's RegEx Replace:
    string result = Regex.Replace(text1,
    "><asp:placeHolder.*</asp:placeHolder>", new
    MatchEvaluator(GetRevisedTdTag), RegexOptions.IgnoreCase);

    The only problem I'm having is that in the above regex match, the .*
    part is causing it to match the start of the first ><asp:placeHolder
    instance and the close of the very last </asp:placeHolder> found in an input
    that has multiple PlaceHolders defined within it.

    I suspect that for somebody with substantial regex knowledge, it would be
    trivial to cause it to match each ><asp:placeHolder... individually. Can you
    help with that part?

    Thanks.
    Franklin, Feb 28, 2009
    #4
  5. Franklin

    J.B. Moreno Guest

    In article <#>, Franklin <>
    wrote:

    > <snip>
    >
    > I do have complete control over the inputs.
    >
    > The only meaningful variation between what I stated and posted in the OP and
    > what I'll have to deal with in the application is that the ID value of the
    > PlaceHolder controls will change/be unique. Note that there will possibly be
    > multiple PlaceHolders declared within any given input.

    -snip-
    > The only problem I'm having is that in the above regex match, the .*
    > part is causing it to match the start of the first ><asp:placeHolder
    > instance and the close of the very last </asp:placeHolder> found in an input
    > that has multiple PlaceHolders defined within it.
    >
    > I suspect that for somebody with substantial regex knowledge, it would be
    > trivial to cause it to match each ><asp:placeHolder... individually. Can you
    > help with that part?


    Don't have it handy to test the full expression at the moment, but what
    you're wanting is the non-greedy version of .* which is .*?

    Normally * takes as many characters as it can and still match the
    expression, adding the ? causes it to stop as soon as it's found it's
    match.

    Frex, given the string "small world, a really, really tiny world", the
    expression "small.*world" will match the entire string, but
    "small.*?world" will match just the first two words.

    --
    J.B. Moreno
    J.B. Moreno, Feb 28, 2009
    #5
  6. Franklin

    Franklin Guest

    <snip>

    > Don't have it handy to test the full expression at the moment, but what
    > you're wanting is the non-greedy version of .* which is .*?



    That's exactly what I needed. Thanks!!!!


    -F
    Franklin, Feb 28, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mladen Adamovic
    Replies:
    0
    Views:
    730
    Mladen Adamovic
    Dec 4, 2003
  2. Mladen Adamovic
    Replies:
    3
    Views:
    14,583
    Mladen Adamovic
    Dec 5, 2003
  3. dd711
    Replies:
    6
    Views:
    878
    Alex Hunsley
    Oct 1, 2004
  4. Replies:
    3
    Views:
    744
    Reedick, Andrew
    Jul 1, 2008
  5. Ruby Newbee

    regex =~ string or string =~ regex?

    Ruby Newbee, Jan 4, 2010, in forum: Ruby
    Replies:
    3
    Views:
    131
    Kirk Haines
    Jan 4, 2010
Loading...

Share This Page