Regular Expressions to parse HTML

Discussion in 'ASP .Net' started by =?Utf-8?B?UGF0cmljaw==?=, May 2, 2006.

  1. I need to parse and HTML document of the following format.

    I am interested to obtain all the HTML from and including the first <div
    class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
    will change). what kind of regular expressions can I use? Note I want
    everything in the core of the HTML including all the tags within the div tags.


    <html>
    <head>
    <!-- Not interested in parsing data in the header-->
    </head>
    <body>
    <div class="head">not interested in this</div>
    <div class="data">Interested in data from this first data div</div>
    <div class="data">There can be <b>other tags</b> within these divs too!</div>
    <a name="data3"></a>(There can be some other stuff in between the div tags)
    Data updated dd/mm/yyyy
    <img src="notInterested.jpg">
    some other rubbish
    <div class="footer">not interested</div>
    =?Utf-8?B?UGF0cmljaw==?=, May 2, 2006
    #1
    1. Advertising

  2. =?Utf-8?B?UGF0cmljaw==?=

    Ken Arway Guest

    Patrick wrote:
    > I need to parse and HTML document of the following format.
    >
    > I am interested to obtain all the HTML from and including the first <div
    > class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
    > will change). what kind of regular expressions can I use? Note I want
    > everything in the core of the HTML including all the tags within the div tags.


    Treating the input Html as one string (C# code):

    Regex regex = new Regex(@"(<div class=""data"">.*(?=<img))",
    RegexOptions.Singleline);


    Sample input:
    <html>
    <head>
    <!-- Not interested in parsing data in the header-->
    </head>
    <body>
    <div class="head">not interested in this</div>
    <div class="data">Interested in data from this first data div</div>
    <div class="data">There can be <b>other tags</b> within these divs too!</div>
    <a name="data3"></a>(There can be some other stuff in between the div tags)
    Data updated dd/mm/yyyy
    <img src="notInterested.jpg">
    some other rubbish
    <div class="footer">not interested</div>

    Sample output:
    1 =»<div class="data">Interested in data from this first data div</div>
    <div class="data">There can be <b>other tags</b> within these divs too!</div>
    <a name="data3"></a>(There can be some other stuff in between the div tags)
    Data updated dd/mm/yyyy
    «=

    --
    Take care,
    Ken
    (to reply directly, remove the cool car. <sigh>)
    Ken Arway, May 3, 2006
    #2
    1. Advertising

  3. =?Utf-8?B?UGF0cmljaw==?=

    macuser9214

    Joined:
    Sep 24, 2007
    Messages:
    1
    old thread, but i wanted the challenge.
    macuser9214, Sep 24, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    598
    Jay Douglas
    Aug 15, 2003
  2. Replies:
    4
    Views:
    1,364
  3. ccm news
    Replies:
    0
    Views:
    3,067
    ccm news
    Jan 15, 2009
  4. Max Adams
    Replies:
    4
    Views:
    105
    Tad McClellan
    Aug 29, 2003
  5. Noman Shapiro
    Replies:
    0
    Views:
    232
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page