UTF-8 to named character entities

C

Crap

Hi all,

I need a conversion from UTF-8 to named character entities (ë ->
ë) and after using the file for publishing purposes I need
to convert it back to UTF-8. I tried this with HTML:Entities but i
get very strange results. I know very little about Perl and its
available modules. I just got perl to work on my Mac. Below is a
piece of my XML-file and the converted result. Can anybody please
give me an advise on how to proceed.
Thanks,
Chris

-- XML --
<a>(…). De andere soort is die welke de contacten en transacties
tussen patiënten reguleert.’<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

-- after conversion --
<a>(&acirc;€&brvbar;). De andere soort is
die welke de contacten en transacties tussen
pati&Atilde;&laquo;nten
reguleert.&acirc;€™<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>
 
A

Anno Siegel

Crap said:
Hi all,

I need a conversion from UTF-8 to named character entities (ë ->
&euml;) and after using the file for publishing purposes I need
to convert it back to UTF-8. I tried this with HTML:Entities but i
get very strange results. I know very little about Perl and its
available modules. I just got perl to work on my Mac. Below is a
piece of my XML-file and the converted result. Can anybody please
give me an advise on how to proceed.
Thanks,
Chris

-- XML --
<a>(…). De andere soort is die welke de contacten en transacties
tussen patiënten reguleert.’<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

-- after conversion --
<a>(&acirc;€&brvbar;). De andere soort is
die welke de contacten en transacties tussen
pati&Atilde;&laquo;nten
reguleert.&acirc;€™<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

After what conversion? Please show the call to (presumably)
encode_entities() that produced the result you show. Also explain in
what way it isn't what you expect.

Cutting and pasting your input into a Perl script

use HTML::Entities;

my $str = <<EOS;
<a>(^E). De andere soort is die welke de contacten en transacties
tussen patiënten reguleert.^R<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>
EOS

print encode_entities( $str);

gives me different output:

&lt;a&gt;(^E). De andere soort is die welke de contacten en transacties
tussen pati&euml;nten reguleert.^R&lt;vn&gt;
&lt;al&gt;
&lt;a&gt;EN 1130b31-1131a.&lt;/a&gt;
&lt;/al&gt;
&lt;/vn&gt;

Anno
 
C

Crap

Thanks for the reply, however I am still lost.
What I want is a conversion from UTF-8 to the corresponding named
character entities (ë --> &euml;). But I get really funny
characters. Below is my script, test.xml, and output.xml

command line: perl utf.pl test.xml


--utf.pl--
#!/usr/bin/perl -w

use HTML::Entities;

open OUT, ">output.xml";

while(<>){
$string=(encode_entities($_));
print OUT $string;
}

close OUT;


---test.xml---
<a>patiënten reguleert.’<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

---output.xml---
<a>pati&Atilde;&laquo;nten
reguleert.&acirc;€™<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

---expected.xml---
<a>pati&euml;nten reguleert.&rsquo;<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

Thanks,
Chris
 
J

John Bokma

Thanks for the reply, however I am still lost.
What I want is a conversion from UTF-8 to the corresponding named
character entities (ë --> &euml;). But I get really funny
characters. Below is my script, test.xml, and output.xml

Question: why do you want this? Just give your output.xml a proper
character encoding, and you can use ë.
 
R

RedGrittyBrick

Crap said:
Thanks for the reply, however I am still lost.
What I want is a conversion from UTF-8 to the corresponding named
character entities (ë --> &euml;). But I get really funny
characters. Below is my script, test.xml, and output.xml

command line: perl utf.pl test.xml


--utf.pl--
#!/usr/bin/perl -w

use HTML::Entities;

open OUT, ">output.xml";

while(<>){
$string=(encode_entities($_));
print OUT $string;
}

close OUT;


---test.xml---
<a>patiënten reguleert.�<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

---output.xml---
<a>pati&Atilde;&laquo;nten
reguleert.&acirc;��<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

---expected.xml---
<a>pati&euml;nten reguleert.&rsquo;<vn>
<al>
<a>EN 1130b31-1131a.</a>
</al>
</vn>

Your problem is that HTML::Entities needs Latin-1 not utf8


#!perl
use strict;
use warnings;
use Encode;
use HTML::Entities;

open my $out, '>', 'output.xml' or die "Unable to open output.xml
because $!";
while (<>) {
my $string = $_;
$string = encode("iso-8859-1", decode("utf8", $string));
print $out encode_entities($string);
}
close $out or die "Woo $!";
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top