Converting UTF-* characters to &#xxx;

H

Hemant Shah

Folks,

I need to convert UTF-8 characters into is ordinal number (),
Is there a module to do it or do I have to write something?

How do I get started on it? I am new to Unicode encoding and I am still
trying to understand how UTF-8 characters are encoded.

Thanks.


--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
B

Ben Morrow

I need to convert UTF-8 characters into is ordinal number (),
Is there a module to do it or do I have to write something?

How do I get started on it? I am new to Unicode encoding and I am still
trying to understand how UTF-8 characters are encoded.

Firstly, use Perl 5.8.

Next, read perldoc perluniintro. Basically, you don't need to worry
about how perl encodes its characters: you just make sure you mark each
data source correctly with its encoding, and perl'll handle the rest.

For finding ordinal numbers, perldoc -f ord.
For converting them to hex, perldoc -f sprintf.
For an easier way to do what you (probably) want to do, perldoc
PerlIO::encoding and perldoc Encode (the section on fallbacks).

Ben
 
H

Hemant Shah

While said:
Firstly, use Perl 5.8.

I am using perl 5.8
Next, read perldoc perluniintro. Basically, you don't need to worry
about how perl encodes its characters: you just make sure you mark each
data source correctly with its encoding, and perl'll handle the rest.

I am not worried about how perl stores the characters. This is to store
the characters in an ASCII format in the file.

Here is what we are trying to do. We will be translating our help/error
messages in to Spanish, French, Japanese, etc.

I have written a perl script that will read english sentence from the
database, connect to our translation software and get the sentence
translated (translated text is in UTF-8 format). I want to store this
into a database or flat file in XML. This file
could contain english, spanish, french and japanese language and I
want it to be in 8-bit character set (ISO-8859-1). If I can convert
the japanese characters into the ordinal numbers I can store the text
in "" format. I would write the perl script to convert the text
between UTF-8 and ordinal and back. Spanish and franch characters can
be stored in ISO-8859-1 characterset with out any problem using
Encode module.


For finding ordinal numbers, perldoc -f ord.
For converting them to hex, perldoc -f sprintf.

I will take a look at the above docs.

Thanks.
For an easier way to do what you (probably) want to do, perldoc
PerlIO::encoding and perldoc Encode (the section on fallbacks).

Ben

--
Joy and Woe are woven fine,
A Clothing for the Soul divine William Blake
Under every grief and pine 'Auguries of Innocence'
Runs a joy with silken twine. (e-mail address removed)

--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
A

Alan J. Flavell

could contain english, spanish, french and japanese language and I
want it to be in 8-bit character set (ISO-8859-1). If I can convert
the japanese characters into the ordinal numbers I can store the text
in "" format. I would write the perl script to convert the text
between UTF-8 and ordinal and back.

See the discussion here a few days ago. Subject was (unbelievable as
it might seem) "replace unicode characters by representation".
Spanish and franch characters can
be stored in ISO-8859-1 characterset with out any problem using
Encode module.

They can, indeed, but you said in the earlier part of your posting
that you want to use ASCII. Best be sure what it is that you want.

good luck

(And don't quote sigs, and other material not germane to your
followup. thanks.)
 
H

Hemant Shah

I looked at the thread, but I do not think it can deal with double byte
characters.
See the discussion here a few days ago. Subject was (unbelievable as
it might seem) "replace unicode characters by representation".

Yes, that is what I am doing.
They can, indeed, but you said in the earlier part of your posting
that you want to use ASCII. Best be sure what it is that you want.

good luck

(And don't quote sigs, and other material not germane to your
followup. thanks.)

I am new to this and still reading various docs, so please bear with me if
I miss obvious things. Maybe if I try to explain what I am trying to do,
then someone may have better solution then what I am thinking of.

We are trying to translate all of our help/error messages to other
languages, currently ES, FR and JA.

The translation come back to us in an XML file with UTF-8 encoding (Open
Office doc). I use XML::parser to parse the file.

I need to take the tranlsations of each sentence and store them in same file
with #ifdef around them, and also store them into a DB2 database which is
using ISO-8859-1 character set.

The flat file is also in XML format. Based on the specified language our
pre-processor will extract XML code for english and specified language
from it.

The file is also controled by RCS. To keep things simple in flat file and
database I am trying to convert everything to extended ASCII characters
(ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
how to store japanese characters.

Example of the flat file:

#ifdef H5829
<?xml version='1.0' encoding='UTF-8'?>
<!-- **__**__**__**__**__**__**__**__**__**__**__**__**__**__**__** -->
<!-- Program: sent.1100 -->
<!-- Author: Name of the Author -->
<!-- Purpose: To describe content of sent 1100 -->
<!-- Project: H5829 -->
<!-- Version: XML 1.0 -->
<!-- Notes: -->
<!-- **_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*__** -->
<!DOCTYPE sentsource SYSTEM "sent">
<sentsource>
<comment mod = 'H5829'
author = 'myself'
date = '20020624'
type = 'doconly' >
Initial programming.
</comment>
<filekey>1100</filekey>
<xinfo type = 'EN1'>
<sentence>
A master record is not associated with this entry so the suspense
number entered will not be verified.
</sentence>
</xinfo>
#ifdef H3436
<xinfo type = 'ES1'>
<sentence>
Un registro maestro no se asocia a esta entrada así que el número del suspenso
incorporado no será
</sentence>
</xinfo>
#endif H3436
#ifdef H3906
<xinfo type = 'FR1'>
<sentence>
French translation goes here.
</sentence>
</xinfo>
#endif H3436
#ifdef H4906
<xinfo type = 'JA1'>
<sentence>
Japanese translation goes here. I am thinking of putting "" here.
</sentence>
</xinfo>
#endif H3436
</sentsource>
#endif H5829




Thanks for your help.
--
Hemant Shah /"\ ASCII ribbon campaign
E-mail: (e-mail address removed) \ / ---------------------
X against HTML mail
TO REPLY, REMOVE NoJunkMail / \ and postings
FROM MY E-MAIL ADDRESS.
-----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
I haven't lost my mind, Above opinions are mine only.
it's backed up on tape somewhere. Others can have their own.
 
A

Alan J. Flavell

I looked at the thread, but I do not think it can deal with double byte
characters.

Perl (5.8 upwards) doesn't have "double byte characters", it has
"characters". How they are stored internally shouldn't concern you.

In other words, it's simpler than you imagine. But it can be helpful
to take a look at the complexity of what happens "under the covers" if
it helps to appreciate the simplicity of what you get on the surface.
I need to take the tranlsations of each sentence and store them in same file
with #ifdef around them, and also store them into a DB2 database which is
using ISO-8859-1 character set.

Uh-uh, so it really comes down to - not a Perl problem as such - but
dealing with a database that doesn't understand utf-8.

But yes, if you see any benefit in it, you _could_ retain iso-8859-1
characters as themselves, while turning non-iso-8859-1 characters into
their representations.

The catch here is that if you do something which implies to Perl that
you are going beyond iso-8859-1, then it will "upgrade" your data from
8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
will then, internally, be two bytes wide.

Perhaps this will become clearer as you gain familiarity with the
contents of the perluniintro and perlunicode documentation - much of
which probably goes way beyond what you need, but parts of which are
critical to your purpose.

But maybe there's a module that packages this away and does the work
for you. I'm looking at this just at the character-representation
level at the moment, and responding on that basis. Maybe others (or
on a group dedicated to XML such as comp.lang.xml) can offer
more-practical insights into available solutions.
The file is also controled by RCS. To keep things simple in flat file and
database I am trying to convert everything to extended ASCII characters
(ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
how to store japanese characters.

Your plan to represent them as representations sounds OK to
me. Of course if you need to sort data, or process it in similar
ways, then you'll need to think carefully what you're doing.

hope this helps a bit.
 
B

Ben Morrow

Alan J. Flavell said:
Uh-uh, so it really comes down to - not a Perl problem as such - but
dealing with a database that doesn't understand utf-8.

But yes, if you see any benefit in it, you _could_ retain iso-8859-1
characters as themselves, while turning non-iso-8859-1 characters into
their representations.

The catch here is that if you do something which implies to Perl that
you are going beyond iso-8859-1, then it will "upgrade" your data from
8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
will then, internally, be two bytes wide.

The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
calls to the database with subs that encode the data. You will have to
map & to &amp; or & yourself.

I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
'*never* read or write data to or from some external source without
running it through the Encode module'. Then you'll always know where you
stand.

Ben
 
A

Alan J. Flavell

On Thu, 26 Feb 2004, Ben Morrow wrote:

[quoting ajf:]
The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
calls to the database with subs that encode the data.

Looks to be excellent advice to me. Which was why I referred back to
the previous thread for details...
You will have to map & to &amp; or & yourself.

Good point.
I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
'*never* read or write data to or from some external source without
running it through the Encode module'.

Where "external" also includes the database that the hon Usenaut is
using, right?
Then you'll always know where you stand.

Once the questioner is up to speed on dealing with the data internally
to Perl, sure.

all the best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top