H
Himanshu Garg
Hello,
I am using HTML:
arser to extract text from html pages from
http://bbc.co.uk/urdu/
However the encoding of the input text seems to change to some
unknown encoding in the output.
The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.
#################################################################
use HTML:
arser;
# set standard output to utf8
binmode(STDOUT, ":utf8");
# Create parser object
my $p = HTML:
arser->new( api_version => 3, text_h => [\&text,
"text"] );
# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");
sub text
{
my ($txt) = @_;
print $txt;
}
#################################################################
Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.
Thank You
Himanshu.
I am using HTML:
http://bbc.co.uk/urdu/
However the encoding of the input text seems to change to some
unknown encoding in the output.
The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.
#################################################################
use HTML:
# set standard output to utf8
binmode(STDOUT, ":utf8");
# Create parser object
my $p = HTML:
"text"] );
# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");
sub text
{
my ($txt) = @_;
print $txt;
}
#################################################################
Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.
Thank You
Himanshu.