utf8 and HTML Entities

N

Nick Gerber

Hi

I'm lost :-(

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks
 
B

Ben Bullock

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Since this must be a commonly encountered problem, my first guess would be
to try cpan to save myself the bother of writing it myself. I rapidly found:

http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm

Please note that I can't vouch for this software since I have not tried it.

As far as utf8 goes you need to use the "Encode" module.
 
N

Nick Gerber

I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.

Thanks
 
H

Helmut Wollmersdorfer

Nick said:
I tried HTML/Entities.pm, but it didn't do the trick for me. But, it was
me that could not make it to do the conversion for me. I'll try again.

That's my way which works for millions of HTML (or XML) files:

use HTML::Entities;

my $ENCODING = 'utf8'; # or iso-8859-7, CP1250 etc.

open (HTML, "<:encoding($ENCODING)", "$DIR/$file")
or die "Can't open: $1!";

my $data = <HTML>;

my $content = decode_entities($data);

binmode(STDOUT, ":utf8");

print "$content\n";

It is also save (in most cases) to use

my $content = decode_entities(decode_entities($data));

which decodes something like

&amp;amp;



| $ perl -version
| This is perl, v5.8.8 built for i486-linux-gnu-thread-multi

Helmut Wollmersdorfer
 
M

Mumia W.

Hi

I'm lost :-(

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks

Should be enough here to get you going:

[ long program snipped ]

No, that's too much.

Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.

As Mr. Bullock said, HTML::Entities should do it. Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

binmode(STDOUT, ':utf8');
local $/;
my $data = <DATA>;

$data = decode_entities($data);

print $data, "\n";

__DATA__
膄 膅 膆
&aacute; &eacute; &iacute; &oacute; &uacute;
&auml; &euml; &iuml; &ouml; &uuml;
 
N

Nick Gerber

Thanks all.

Nick
Hi

I'm lost :-(

I have a string encodet in utf8 with part HTML Entities and part
characters in utf-8.

How do I translate the HTML Entities into proper utf-8?

Thanks

Should be enough here to get you going:

[ long program snipped ]

No, that's too much.

Mr. Gerber didn't post any code or data, and so he didn't get many
responses because no one knew exactly what he was talking about.

As Mr. Bullock said, HTML::Entities should do it. Here is an example:

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

binmode(STDOUT, ':utf8');
local $/;
my $data = <DATA>;

$data = decode_entities($data);

print $data, "\n";

__DATA__
膄 膅 膆
&aacute; &eacute; &iacute; &oacute; &uacute;
&auml; &euml; &iuml; &ouml; &uuml;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top