Erroneous Text Extraction using HTML::Parser

Himanshu Garg · Jan 27, 2004

Hello,
I am using HTML:

arser to extract text from html pages from
http://bbc.co.uk/urdu/

However the encoding of the input text seems to change to some
unknown encoding in the output.

The program is given below. The HTML is in a string to keep the
example simple. The same problem appears with HTML in a file.

#################################################################
use HTML:

arser;

# set standard output to utf8
binmode(STDOUT, ":utf8");

# Create parser object
my $p = HTML:

arser->new( api_version => 3, text_h => [\&text,
"text"] );

# parse UTF-8 encoded arabic text
$p->parse( "<html> <body>
پاکستان </body> </html>");

sub text
{
my ($txt) = @_;
print $txt;
}
#################################################################

Also, I am unable to pin point the problem by looking at the
parser source code because HTML/Parser.pm doesn't seem to contain any
code that does the real parsing work.

Thank You
Himanshu.

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Problem with body text extraction with HTML::Parser	1	Dec 13, 2005
Reversing output of user input by using while loop...	2	Sep 1, 2022
Text Extraction	0	Jan 11, 2008
HTML parser using Hpricot	0	Jan 8, 2010
Add recipes using JavaScript in table	20	Apr 17, 2023
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023
swing html parser	1	Jul 22, 2010

Erroneous Text Extraction using HTML::Parser

Himanshu Garg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads