Problem with body text extraction with HTML::Parser

P

Perl_user

Hi,

I have been using HTML::parser to extract the textual data from an HTML
document

I am using the following code:

my $p = HTML::parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname"],
report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
);
$p->parse_file($file || die) || die $!;

sub a_start_handler
{
my($self, $tag) = @_;
$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&text);
$self->handler(end => \&a_end_handler, "self,tagname,text");
}

sub text
{
my($self, $tag) = @_;
my $text=@{$self->handler("text")};

}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}

which reports the title and headers from the page. This works, but I have a
problem getting the body text (seperately), as it isn't contained inside
HTML tags, that I can report.


Web-site...
------------
<html>
<head>
<title>This is the title of the webpage.</title>
</head>
<body>
<h1>First Type Header</h1>
<h2>Second Type Header</h2>
This is the main body of the text. It will be concidered as the article.
Blah Blah blah
</body>
</html>
------------

Any ideas appreciated


Output with reported tags [title h1 h2 h3 h4 h5 h6]
--
title
This is the title of the webpage.
h1
First type
h2
Second Type Header
h3
Third Type Header
--

Output with reported tag <body>
--
This is the title of the webpage. It is a mess.
First type header
Second Type Header
Third Type Header
This is the main body of the text. It will be concidered as the article.
 
I

Ian Stuart

Web-site...
------------
<html>
<head>
<title>This is the title of the webpage.</title>
</head>
<body>
<h1>First Type Header</h1>
<h2>Second Type Header</h2>
This is the main body of the text. It will be concidered as the article.
Blah Blah blah
</body>
</html>
------------

Well, one problem is going to be that the main body of the text is,
technically, not in the body of a web page - if you try to validate that
page, it will complain that "This..." needs to be contained in an element.

Now, whether HTML::parser takes any mind of this, I don't know...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top