utf8 and HTML Entities

Discussion in 'Perl Misc' started by Nick Gerber, Sep 19, 2007.

  1. Nick Gerber

    Nick Gerber Guest

    Hi

    I'm lost :-(

    I have a string encodet in utf8 with part HTML Entities and part
    characters in utf-8.

    How do I translate the HTML Entities into proper utf-8?

    Thanks
    Nick Gerber, Sep 19, 2007
    #1
    1. Advertising

  2. Nick Gerber

    Ben Bullock Guest

    On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:


    > I have a string encodet in utf8 with part HTML Entities and part
    > characters in utf-8.
    >
    > How do I translate the HTML Entities into proper utf-8?


    Since this must be a commonly encountered problem, my first guess would be
    to try cpan to save myself the bother of writing it myself. I rapidly found:

    http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm

    Please note that I can't vouch for this software since I have not tried it.

    As far as utf8 goes you need to use the "Encode" module.
    Ben Bullock, Sep 19, 2007
    #2
    1. Advertising

  3. Nick Gerber

    Nick Gerber Guest

    I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
    me that could not make it to do the conversion for me. I'll try again.

    Thanks

    Ben Bullock wrote:
    > On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber wrote:
    >
    >
    >> I have a string encodet in utf8 with part HTML Entities and part
    >> characters in utf-8.
    >>
    >> How do I translate the HTML Entities into proper utf-8?

    >
    > Since this must be a commonly encountered problem, my first guess would be
    > to try cpan to save myself the bother of writing it myself. I rapidly found:
    >
    > http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm
    >
    > Please note that I can't vouch for this software since I have not tried it.
    >
    > As far as utf8 goes you need to use the "Encode" module.
    Nick Gerber, Sep 20, 2007
    #3
  4. Nick Gerber wrote:
    > I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
    > me that could not make it to do the conversion for me. I'll try again.


    That's my way which works for millions of HTML (or XML) files:

    use HTML::Entities;

    my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.

    open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
    or die "Can't open: $1!";

    my $data = <HTML>;

    my $content = decode_entities($data);

    binmode(STDOUT, ":utf8");

    print "$content\n";

    It is also save (in most cases) to use

    my $content = decode_entities(decode_entities($data));

    which decodes something like

    &amp;amp;



    | $ perl -version
    | This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

    Helmut Wollmersdorfer
    Helmut Wollmersdorfer, Sep 21, 2007
    #4
  5. Nick Gerber

    Mumia W. Guest

    On 09/20/2007 08:31 PM, wrote:
    > On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber <> wrote:
    >
    >> Hi
    >>
    >> I'm lost :-(
    >>
    >> I have a string encodet in utf8 with part HTML Entities and part
    >> characters in utf-8.
    >>
    >> How do I translate the HTML Entities into proper utf-8?
    >>
    >> Thanks

    >
    > Should be enough here to get you going:
    >
    > [ long program snipped ]


    No, that's too much.

    Mr. Gerber didn't post any code or data, and so he didn't get many
    responses because no one knew exactly what he was talking about.

    As Mr. Bullock said, HTML::Entities should do it. Here is an example:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::Entities;

    binmode(STDOUT, ':utf8');
    local $/;
    my $data = <DATA>;

    $data = decode_entities($data);

    print $data, "\n";

    __DATA__
    膄 膅 膆
    &aacute; &eacute; &iacute; &oacute; &uacute;
    &auml; &euml; &iuml; &ouml; &uuml;
    Mumia W., Sep 21, 2007
    #5
  6. Nick Gerber

    Nick Gerber Guest

    Thanks all.

    Nick

    Mumia W. wrote:
    > On 09/20/2007 08:31 PM, wrote:
    >> On Wed, 19 Sep 2007 14:59:02 +0200, Nick Gerber <> wrote:
    >>
    >>> Hi
    >>>
    >>> I'm lost :-(
    >>>
    >>> I have a string encodet in utf8 with part HTML Entities and part
    >>> characters in utf-8.
    >>>
    >>> How do I translate the HTML Entities into proper utf-8?
    >>>
    >>> Thanks

    >>
    >> Should be enough here to get you going:
    >>
    >> [ long program snipped ]

    >
    > No, that's too much.
    >
    > Mr. Gerber didn't post any code or data, and so he didn't get many
    > responses because no one knew exactly what he was talking about.
    >
    > As Mr. Bullock said, HTML::Entities should do it. Here is an example:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > use HTML::Entities;
    >
    > binmode(STDOUT, ':utf8');
    > local $/;
    > my $data = <DATA>;
    >
    > $data = decode_entities($data);
    >
    > print $data, "\n";
    >
    > __DATA__
    > 膄 膅 膆
    > &aacute; &eacute; &iacute; &oacute; &uacute;
    > &auml; &euml; &iuml; &ouml; &uuml;
    >
    Nick Gerber, Sep 25, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Oschler
    Replies:
    8
    Views:
    737
    Christopher T King
    Jul 31, 2004
  2. Robert Brewer
    Replies:
    0
    Views:
    512
    Robert Brewer
    Jul 25, 2004
  3. gry
    Replies:
    2
    Views:
    718
    Alf P. Steinbach
    Mar 13, 2012
  4. Jim Higson
    Replies:
    3
    Views:
    220
    Eric Amick
    Jul 25, 2004
  5. howa

    HTML::Entities & UTF8

    howa, Nov 10, 2008, in forum: Perl Misc
    Replies:
    1
    Views:
    142
    Peter J. Holzer
    Nov 15, 2008
Loading...

Share This Page