Regular Expressions to parse HTML

G

Guest

I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.


<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>
 
K

Ken Arway

Patrick said:
I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.

Treating the input Html as one string (C# code):

Regex regex = new Regex(@"(<div class=""data"">.*(?=<img))",
RegexOptions.Singleline);


Sample input:
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>

Sample output:
1 =»<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
«=
 
Joined
Sep 24, 2007
Messages
1
Reaction score
0
old thread, but i wanted the challenge.
#!/usr/bin/perl
use LWP::Simple;
use URI::URL;

$url=url('http://macuser9214.com/test.htm');

$content=get($url);

$content =~ s/<div class=\"head\".*>//g;
$content =~ s/<div class=\"data\">//g;
$content =~ s/<(?:[^> '"]*|([ '"]).*?1)*>//g;
$content =~ s/<\/div>//g;
$content =~ s/<a.*>//g;
$content =~ s/<img.*>//g;
$content =~ s/.*footer.*//g;
$content =~ s/<!--.*-->//g;


print $content;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
arinsharma
Top