UTF-8 to named character entities

Discussion in 'Perl Misc' started by Crap, Jun 26, 2005.

  1. Crap

    Crap Guest

    Hi all,

    I need a conversion from UTF-8 to named character entities (ë ->
    ë) and after using the file for publishing purposes I need
    to convert it back to UTF-8. I tried this with HTML:Entities but i
    get very strange results. I know very little about Perl and its
    available modules. I just got perl to work on my Mac. Below is a
    piece of my XML-file and the converted result. Can anybody please
    give me an advise on how to proceed.
    Thanks,
    Chris

    -- XML --
    <a>(…). De andere soort is die welke de contacten en transacties
    tussen patiënten reguleert.’<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>

    -- after conversion --
    <a>(&acirc;€&brvbar;). De andere soort is
    die welke de contacten en transacties tussen
    pati&Atilde;&laquo;nten
    reguleert.&acirc;€™<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>
    Crap, Jun 26, 2005
    #1
    1. Advertising

  2. Crap

    Anno Siegel Guest

    Crap <-spam.invalid> wrote in comp.lang.perl.misc:
    > Hi all,
    >
    > I need a conversion from UTF-8 to named character entities (ë ->
    > &euml;) and after using the file for publishing purposes I need
    > to convert it back to UTF-8. I tried this with HTML:Entities but i
    > get very strange results. I know very little about Perl and its
    > available modules. I just got perl to work on my Mac. Below is a
    > piece of my XML-file and the converted result. Can anybody please
    > give me an advise on how to proceed.
    > Thanks,
    > Chris
    >
    > -- XML --
    > <a>(…). De andere soort is die welke de contacten en transacties
    > tussen patiënten reguleert.’<vn>
    > <al>
    > <a>EN 1130b31-1131a.</a>
    > </al>
    > </vn>
    >
    > -- after conversion --
    > <a>(&acirc;€&brvbar;). De andere soort is
    > die welke de contacten en transacties tussen
    > pati&Atilde;&laquo;nten
    > reguleert.&acirc;€™<vn>
    > <al>
    > <a>EN 1130b31-1131a.</a>
    > </al>
    > </vn>


    After what conversion? Please show the call to (presumably)
    encode_entities() that produced the result you show. Also explain in
    what way it isn't what you expect.

    Cutting and pasting your input into a Perl script

    use HTML::Entities;

    my $str = <<EOS;
    <a>(^E). De andere soort is die welke de contacten en transacties
    tussen patiënten reguleert.^R<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>
    EOS

    print encode_entities( $str);

    gives me different output:

    &lt;a&gt;(^E). De andere soort is die welke de contacten en transacties
    tussen pati&euml;nten reguleert.^R&lt;vn&gt;
    &lt;al&gt;
    &lt;a&gt;EN 1130b31-1131a.&lt;/a&gt;
    &lt;/al&gt;
    &lt;/vn&gt;

    Anno
    Anno Siegel, Jun 27, 2005
    #2
    1. Advertising

  3. Crap

    Crap Guest

    Thanks for the reply, however I am still lost.
    What I want is a conversion from UTF-8 to the corresponding named
    character entities (ë --> &euml;). But I get really funny
    characters. Below is my script, test.xml, and output.xml

    command line: perl utf.pl test.xml


    --utf.pl--
    #!/usr/bin/perl -w

    use HTML::Entities;

    open OUT, ">output.xml";

    while(<>){
    $string=(encode_entities($_));
    print OUT $string;
    }

    close OUT;


    ---test.xml---
    <a>patiënten reguleert.’<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>

    ---output.xml---
    <a>pati&Atilde;&laquo;nten
    reguleert.&acirc;€™<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>

    ---expected.xml---
    <a>pati&euml;nten reguleert.&rsquo;<vn>
    <al>
    <a>EN 1130b31-1131a.</a>
    </al>
    </vn>

    Thanks,
    Chris
    Crap, Jun 29, 2005
    #3
  4. Crap

    John Bokma Guest

    -spam.invalid (Crap) wrote:

    > Thanks for the reply, however I am still lost.
    > What I want is a conversion from UTF-8 to the corresponding named
    > character entities (ë --> &euml;). But I get really funny
    > characters. Below is my script, test.xml, and output.xml


    Question: why do you want this? Just give your output.xml a proper
    character encoding, and you can use ë.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jun 29, 2005
    #4
  5. Crap wrote:
    > Thanks for the reply, however I am still lost.
    > What I want is a conversion from UTF-8 to the corresponding named
    > character entities (ë --> &euml;). But I get really funny
    > characters. Below is my script, test.xml, and output.xml
    >
    > command line: perl utf.pl test.xml
    >
    >
    > --utf.pl--
    > #!/usr/bin/perl -w
    >
    > use HTML::Entities;
    >
    > open OUT, ">output.xml";
    >
    > while(<>){
    > $string=(encode_entities($_));
    > print OUT $string;
    > }
    >
    > close OUT;
    >
    >
    > ---test.xml---
    > <a>patiënten reguleert.�<vn>
    > <al>
    > <a>EN 1130b31-1131a.</a>
    > </al>
    > </vn>
    >
    > ---output.xml---
    > <a>pati&Atilde;&laquo;nten
    > reguleert.&acirc;��<vn>
    > <al>
    > <a>EN 1130b31-1131a.</a>
    > </al>
    > </vn>
    >
    > ---expected.xml---
    > <a>pati&euml;nten reguleert.&rsquo;<vn>
    > <al>
    > <a>EN 1130b31-1131a.</a>
    > </al>
    > </vn>


    Your problem is that HTML::Entities needs Latin-1 not utf8


    #!perl
    use strict;
    use warnings;
    use Encode;
    use HTML::Entities;

    open my $out, '>', 'output.xml' or die "Unable to open output.xml
    because $!";
    while (<>) {
    my $string = $_;
    $string = encode("iso-8859-1", decode("utf8", $string));
    print $out encode_entities($string);
    }
    close $out or die "Woo $!";
    RedGrittyBrick, Jun 30, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joergen Bech
    Replies:
    2
    Views:
    4,426
    Joergen Bech
    Jun 14, 2005
  2. Mangler

    HTML Named/Number Entities

    Mangler, Nov 27, 2008, in forum: HTML
    Replies:
    10
    Views:
    871
    Mangler
    Nov 28, 2008
  3. Replies:
    7
    Views:
    3,547
  4. Jian Lin
    Replies:
    14
    Views:
    224
    Jian Lin
    May 13, 2009
  5. Jim Higson
    Replies:
    3
    Views:
    214
    Eric Amick
    Jul 25, 2004
Loading...

Share This Page