Erroneous Text Extraction using HTML::Parser

H

Himanshu Garg

Hello,
I am using HTML::parser to extract text from html pages from
http://bbc.co.uk/urdu/

However the encoding of the input text seems to change to some
unknown encoding in the output.

The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.

#################################################################
use HTML::parser;

# set standard output to utf8
binmode(STDOUT, ":utf8");

# Create parser object
my $p = HTML::parser->new( api_version => 3, text_h => [\&text,
"text"] );

# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");

sub text
{
my ($txt) = @_;
print $txt;
}
#################################################################

Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.

Thank You
Himanshu.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top