XML::PARSER utf-8 and japanese characters

Hemant Shah · Jul 27, 2004

Folks,

I am having problem writing Japanese characters.

I am parsing an XML document that is in utf-8, actually it is a
content.xml file from Open Office. It contains Japanese text along
with english text. (english text and it's japanese translation).

I want to write the the english and japanese text into individual
files.

Another process will read these individual files and insert the it
into DB2 database which is also in utf-8.

I am having problem writing japanese text to a file.

I am running perl 5.8.3 on AIX 5.2.

Here are the code fragments from my script:

use Encode;
use encoding utf8, STDOUT => "utf8", STDIN => "utf8";
use XML:

arser;

$ContentParser = new XML:

arser(Handlers => {Start => \&HandleContentStart,
End => \&HandleContentEnd,
Default => \&DefaultContentHandler,
Char => \&HandleContentChar});

$ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');

# In HandleContentChar() subroutine
open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||
die "Cannot open temporary file for write $TmpFile. $!";

# Code to print XML tags

print TEMPFILE "$JapaneseText";

# Code to print XML tags

close(TEMPFILE);

When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
they are different.

Also is there a way to split the Japanese text at unicode character
boundary. I would like to store lines of 100 (single byte) characters or
less per line. I do not have any problem with english and spanish text,
but japanese characters are double byte, so I would like to split the
line at 50 japanese characters.

Thanks in advance.

--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.

Ben Morrow · Jul 29, 2004

Quoth (e-mail address removed):

I am having problem writing Japanese characters.

I am parsing an XML document that is in utf-8, actually it is a
content.xml file from Open Office. It contains Japanese text along
with english text. (english text and it's japanese translation).

I want to write the the english and japanese text into individual
files.

Another process will read these individual files and insert the it
into DB2 database which is also in utf-8.

I am having problem writing japanese text to a file.

I am running perl 5.8.3 on AIX 5.2.

That's a good start...

Here are the code fragments from my script:

use Encode;
use encoding utf8, STDOUT => "utf8", STDIN => "utf8";

I would have explicitly binmoded the FHs, for clarity, but hey...

use XML:arser;

$ContentParser = new XML:arser(Handlers => {Start => \&HandleContentStart,
End => \&HandleContentEnd,
Default => \&DefaultContentHandler,
Char => \&HandleContentChar});

$ContentParser->parsefile ("content.xml", ProtocolEncoding => 'UTF-8');

# In HandleContentChar() subroutine
open (TEMPFILE, ">:encoding(utf8)", $TmpFile) ||

Use lexical filehandles.
Use low-precedence operators to avoid brackets.

open my $TEMFILE, '>:encoding(utf8)', $TmpFile or die ...;

die "Cannot open temporary file for write $TmpFile. $!";

# Code to print XML tags

print TEMPFILE "$JapaneseText";

Don't quote unnecessarily.

# Code to print XML tags

close(TEMPFILE);

When I look at the Japanese text in content.xml file and $TmpFile (hex dump),
they are different.

How are they different? Are they equivalent representations of the text
(I don't know if there are any non-canonical representations for
Japanese)? Can you give some examples of input and output text?

Also is there a way to split the Japanese text at unicode character
boundary. I would like to store lines of 100 (single byte) characters or
less per line. I do not have any problem with english and spanish text,
but japanese characters are double byte,

No they aren't. Most Japanese characters require 3 bytes in the UTF8
encoding, and all accented spanish characters will require at least two.

so I would like to split the line at 50 japanese characters.

What do you actually mean here? You claim not to mean 100 bytes/line,
but I suspect that might be what you actually want (if this is for some
program with a line-length limitation). Otherwise, do you mean 100
Unicode codepoints (100 complete utf8 sequences), 100 graphemes
(sequences like {LATIN SMALL LETTER A}{COMBINING ACUTE ACCENT}
which, while two Unicode codepoints, display as one character) or 100
(displayed) columns? These can by done by:

$string =~ s/(.{100})/$1\n/g; # CHARS (CODEPOINTS)

$string =~ s/(\X{100})/$1\n/g; # GRAPHEMES (COMBINING SEQUENCES)

; 'bytes' and 'columns' are slightly harder, and I can't see an easy way
to do them with a regex:

# BYTES

{
my $newstring = '';
my $width = 0;

for (split //, $string) {
$width += do { use bytes; length };
$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}

$string = $newstring;
}

# COLUMNS (taking CJK full-width forms into account)

use Unicode::EastAsianWidth; # install from CPAN

{
my $newstring = '';
my $width = 0;

for (split //, $string) {
/\p{IsPrint}/ and $width += /\p{InFullwidth}/ ? 2 : 1;
# There is a bug here: it doesn't deal correctly with
# printing-but-not-spacing characters (like combining accents).

$width > 100 and $newstring .= "\n", $width -= 100;
$newstring .= $_;
}

$string = $newstring;
}

<none of the above tested>. You will need to read the docs for
Unicode::EastAsianWidth if you use it: I don't fully understand what it
says about 'ambiguous width' characters, knowing very little about CJK
writing.

Ben

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Japanese characters in TITLE element	28	Apr 4, 2011
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 7, 2009
UTF-8 problem	8	Aug 21, 2007
Check if a string contains japanese character and convert from UTF-8 to ISO-2022-JP	5	Mar 16, 2006
xml::twig - writing utf-8	4	May 25, 2006
CGI and UTF-8	14	Sep 28, 2009
UTF-8 and diacritics combining characters	5	Dec 19, 2008

XML::PARSER utf-8 and japanese characters

Hemant Shah

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads