Please help me how is easiest way to extract text between some variable text
<TH class=name width=100>New name</TH> need to extract: New name
A couple of weeks back, hymie! posted a thread enditled 'table -->
pre'. He wanted to extract the content of an HTML table to preformat
it. I posted the following script and output.
Perl gives you a number of ways to do what you want, many of them
simple minded and primitive, others pretty sophisticated. I generally
prefer the former, the more simple minded and primitive the better.
You probably should approach a problem like this in an incremental
fashion, by first matching the least possible amount of what you want,
and adding to it little by little until you get what you want. You
don't need to use a regular expression, index() and substr() will do
the same kind of thing.
Other technologies will do the same kind of thing. I routinely do this
in vi (vim), when I want to transfer some content from one function to
another function, for instance, converting a SQL query to a hash
declaration.
CC.
SCRIPT
#! perl
use strict;
use warnings;
my $content = '';
while (<DATA>)
{
next unless /\w/;
chomp;
if ($_ =~ m!<(\/?)table!)
{
$content .= "<$1pre>";
next;
}
elsif ($_ =~ m!<\/?tr!)
{
$content .= "
\n";
next;
}
elsif ($_ =~ m!<t[dh]>([^<]*)<\/t[dh]>!)
{
$content .= sprintf("%-20s", $1);
next;
}
else
{
warn "ERROR: $_\n";
}
}
print $content;
exit(0);
__DATA__
<table>
<tr>
<td>George</td>
<td>Washington</td>
<td>Virginia</td>
<td>1788</td>
</tr>
<tr>
<td>George</td>
<td>Washington</td>
<td>Virginia</td>
<td>1792</td>
</tr>
<tr>
<td>John</td>
<td>Adams</td>
<td>Massachesetts</td>
<td>1796</td>
</tr>
<tr>
<td>Thomas</td>
<td>Jefferson</td>
<td>Virginia</td>
<td>1800</td>
</tr>
<tr>
<td>Thomas</td>
<td>Jefferson</td>
<td>Virginia</td>
<td>1804</td>
</tr>
</table>
OUTPUT'
<pre>
George Washington Virginia 1788
George Washington Virginia 1792
John Adams Massachesetts 1796
Thomas Jefferson Virginia 1800
Thomas Jefferson Virginia 1804
</pre>