Erroneous Text Extraction using HTML::Parser

Discussion in 'Perl' started by Himanshu Garg, Jan 27, 2004.

  1. Hello,
    I am using HTML::parser to extract text from html pages from
    http://bbc.co.uk/urdu/

    However the encoding of the input text seems to change to some
    unknown encoding in the output.

    The program is given below. The HTML is in a string to keep the
    example simple. The same problem appears with HTML in a file.

    #################################################################
    use HTML::parser;

    # set standard output to utf8
    binmode(STDOUT, ":utf8");

    # Create parser object
    my $p = HTML::parser->new( api_version => 3, text_h => [\&text,
    "text"] );

    # parse UTF-8 encoded arabic text
    $p->parse( "<html> <body>
    پاکستان </body> </html>");

    sub text
    {
    my ($txt) = @_;
    print $txt;
    }
    #################################################################

    Also, I am unable to pin point the problem by looking at the
    parser source code because HTML/Parser.pm doesn't seem to contain any
    code that does the real parsing work.

    Thank You
    Himanshu.
    Himanshu Garg, Jan 27, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gianni Mariani
    Replies:
    0
    Views:
    328
    Gianni Mariani
    Jan 13, 2005
  2. Timo
    Replies:
    2
    Views:
    340
  3. watergirl
    Replies:
    4
    Views:
    2,508
    watergirl
    Oct 10, 2006
  4. Perl_user
    Replies:
    1
    Views:
    145
    Ian Stuart
    Dec 13, 2005
  5. Replies:
    2
    Views:
    88
    Richard Balbat
    Nov 12, 2013
Loading...

Share This Page