xml::twig - writing utf-8

M

miletwo

I'm trying to read xml file and rewrite as RSS using following file.
Problem is, it is not forcing UTF-8 no matter what I do. Any help
appreciated.

***********************
#!/bin/perl -w
#use strict;
use XML::Twig;
use utf8;

use open OUT => ":utf8";
use open IN => ":utf8";

my $shownum = 10;
my $thisyear = '2006';
my $field= 'releasedate';
my $twig= new XML::Twig( keep_encoding=> 1);

open(INFILE, "directorylist.xml");
$twig->parse(\*INFILE);

my $root= $twig->root;
my @releases= $root->children;

my $output = "";

$output .= '<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/">' . "\n";
$output .= '<channel>' . "\n\n";
$output .= <<EOT;
<title>scrubbed Incorporated - Recent News</title>
<link>http://www.scrubbed.com/press/</link>
<description>Visit the scrubbed Press Center where you will find
many resources, including press releases, corporate information,
technology overviews, executive bios and photos, the scrubbed logo and
more.<br />If you are a member of the media and are not able to find
what you are looking for in the Press Center, please send an email to
corpcomm\@scrubbed.com.</description>
<language>en-us</language>

EOT

for(my $i=0; $i < $shownum; $i++){
$output .= "\t" . '<item>' . "\n";
$output .= "\t\t" . '<title>' .
$releases[$i]->first_child('headline')->text . '</title>' . "\n";
$output .= "\t\t" . '<link>http://www.scrubbed.com/press/releases/' .
$thisyear . '/' . $releases[$i]->att('name') . '.html</link>' . "\n";
$output .= "\t\t" . '<description>' .
$releases[$i]->first_child('subheader')->text . '</description>' .
"\n";
$output .= "\t\t" . '<dc:date>' .
$releases[$i]->first_child('releasedate')->text . '</dc:date>' . "\n";
$output .= "\t" . '</item>';
$output .= "\n\n";
}

$output .= "</channel>\n</rss>";
Encode::_utf8_on($output);

open(FILEWRITE,">:utf8", "press.rss");
binmode FILEWRITE, ":utf8";
print FILEWRITE $output;
 
P

Peter J. Holzer

I'm trying to read xml file and rewrite as RSS using following file.
Problem is, it is not forcing UTF-8 no matter what I do. Any help
appreciated.

Your script works for me. Please provide a complete example that
demonstrates the error. Your script tries to read a file named
directorylist.xml, but you didn't provide that file. I had to read your
script to find out what that file should contain, and write one myself.
Maybe there is an error in your input file.

Also you didn't provide any information about the system you are using.
I tested it with Debian Sarge (perl 5.8.4, XML::Twig 3.17).

hp
 
M

Michel Rodriguez

I'm trying to read xml file and rewrite as RSS using following file.
Problem is, it is not forcing UTF-8 no matter what I do. Any help
appreciated.

***********************
#!/bin/perl -w
#use strict;
use XML::Twig;
use utf8;

use open OUT => ":utf8";
use open IN => ":utf8";

my $shownum = 10;
my $thisyear = '2006';
my $field= 'releasedate';
my $twig= new XML::Twig( keep_encoding=> 1);

open(INFILE, "directorylist.xml");
$twig->parse(\*INFILE);

my $root= $twig->root;
my @releases= $root->children;

my $output = "";

$output .= '<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/">' . "\n";
$output .= '<channel>' . "\n\n";
$output .= <<EOT;
<title>scrubbed Incorporated - Recent News</title>
<link>http://www.scrubbed.com/press/</link>
<description>Visit the scrubbed Press Center where you will find
many resources, including press releases, corporate information,
technology overviews, executive bios and photos, the scrubbed logo and
more.<br />If you are a member of the media and are not able to find
what you are looking for in the Press Center, please send an email to
corpcomm\@scrubbed.com.</description>
<language>en-us</language>

EOT

for(my $i=0; $i < $shownum; $i++){
$output .= "\t" . '<item>' . "\n";
$output .= "\t\t" . '<title>' .
$releases[$i]->first_child('headline')->text . '</title>' . "\n";
$output .= "\t\t" . '<link>http://www.scrubbed.com/press/releases/' .
$thisyear . '/' . $releases[$i]->att('name') . '.html</link>' . "\n";
$output .= "\t\t" . '<description>' .
$releases[$i]->first_child('subheader')->text . '</description>' .
"\n";
$output .= "\t\t" . '<dc:date>' .
$releases[$i]->first_child('releasedate')->text . '</dc:date>' . "\n";
$output .= "\t" . '</item>';
$output .= "\n\n";
}

$output .= "</channel>\n</rss>";
Encode::_utf8_on($output);

open(FILEWRITE,">:utf8", "press.rss");
binmode FILEWRITE, ":utf8";
print FILEWRITE $output;

Whaouh! You sure want to make sure you get UTF-8 on output! Except of
course that the keep_encoding option tells XML::Twig not output the same
encoding as you got in the input (which you did not show us as
mentionned by the previous poster).

If you want to output utf-8, the best way is NOT to do anything: by
default the parser will convert anything into utf-8, and the output will
be in that encoding.

Did you try your code without the various utf8-related instructions
peppered though it? What was the result?
 
M

miletwo

Here's directorylist.xml. I'm on MacOSX but also tried running this on
my Solaris box and it does the same thing. I've also tried it with and
without keep_encoding, so don't "think" that's it.

Thanks for replies.
<?xml version="1.0" encoding="UTF-8"?>
<directory>
<file name="060525_brings_custom_user">
<releasedate>05-25-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[XXSCRUBBEDXX Brings Custom User-Interface
Capabilities to U.S. Cellular's easyedgeSM with the uiOne
Solution]]></headline>
<subheader><![CDATA[]]></subheader>
<division>Corp, QIS</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060524_initiates_patent_infringement">
<releasedate>05-24-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[XXSCRUBBEDXX Initiates Patent Infringement
Proceedings in the UK against Nokia]]></headline>
<subheader><![CDATA[]]></subheader>
<division>Corp</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060518_takes_XXSCRUBBEDXX_2006">
<releasedate>05-18-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[XXSCRUBBEDXX Takes XXSCRUBBEDXX 2006 to the
Next Level with Addition of Telecom Italia and XXSCRUBBEDXX to an
Already Impressive XXSCRUBBEDXX 2006 Conference Agenda]]></headline>
<subheader><![CDATA[Premiere Players in the Industry Showcase
Advanced Data Capabilities at XXSCRUBBEDXX 2006 Conference in San Diego
May 31-June 2]]></subheader>
<division>Corp, QIS</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060518_averitt_selects_omnitracs">
<releasedate>05-18-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[AVERITT Selects XXSCRUBBEDXX's OmniTRACS®
and OmniExpress® Mobile Communication Systems for Entire Fleet and
Service Centers]]></headline>
<subheader><![CDATA[Leading Freight and Supply Chain Management
Provider with International Reach One of First to Implement End-to-End
Solution for Improved Fleet Communications]]></subheader>
<division>Corp, QWBS</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060517_clears_up_misunderstandings">
<releasedate>05-17-2006</releasedate>
<releasetime>12:36 PM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[XXSCRUBBEDXX Clears Up Misunderstandings
Regarding the ITC Staff Attorney Briefing]]></headline>
<subheader><![CDATA[]]></subheader>
<division>Corp</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060512_hospital_democratic_republic">
<releasedate>05-12-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[Hospital in the Democratic Republic of Congo to
Be Outfitted with CDMA2000 1xEV-DO to Help Improve Healthcare in
Africa]]></headline>
<subheader><![CDATA[XXSCRUBBEDXX Pledges Donation and Technology
to the Dikembe Mutombo Foundation, First Hospital Built in the Congo in
Nearly 40 Years]]></subheader>
<division>Corp</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060509_british_sky_broadcasting">
<releasedate>05-09-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[XXSCRUBBEDXX and British Sky Broadcasting
Announce Intent to Conduct XXSCRUBBEDXX™ Technology Trial in United
Kingdom]]></headline>
<subheader><![CDATA[Joint Exercise Expected to be Europe's First
Technical Trial of Open, Network-Agnostic FLO Technology]]></subheader>
<division>Corp</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
<file name="060509_application_downloads_XXSCRUBBEDXX">
<releasedate>05-09-2006</releasedate>
<releasetime>04:30 AM</releasetime>
<timezone>America/Los_Angeles</timezone>
<headline><![CDATA[Application Downloads with XXSCRUBBEDXX's
XXSCRUBBEDXX® Solution Surpass Three Million in Thailand on Hutch's
Advanced CDMA2000 1X Network]]></headline>
<subheader><![CDATA[Active Hutchison CAT Customers Have Downloaded
an Average of 10 Applications Each Since XXSCRUBBEDXX Launched, Numbers
Continue to Grow]]></subheader>
<division>Corp, QIS</division>
<categories></categories>
<document></document>
<exclude></exclude>
</file>
</directory>
 
P

Peter J. Holzer

Here's directorylist.xml. I'm on MacOSX but also tried running this on
my Solaris box and it does the same thing. I've also tried it with and
without keep_encoding, so don't "think" that's it.

This file contains only 8 <file/> elements. Your script crashes with

Can't call method "first_child" on an undefined value at ./miletwo line 40.

if there are less than 10 children of the root element, before it even
opens the output file. So with this file, your script doesn't write
anything. How do you determine whether a non-existent file is UTF-8 or
not?

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top