Converting UTF-* characters to &#xxx;

Discussion in 'Perl Misc' started by Hemant Shah, Feb 25, 2004.

  1. Hemant Shah

    Hemant Shah Guest

    Folks,

    I need to convert UTF-8 characters into is ordinal number (),
    Is there a module to do it or do I have to write something?

    How do I get started on it? I am new to Unicode encoding and I am still
    trying to understand how UTF-8 characters are encoded.

    Thanks.


    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
    Hemant Shah, Feb 25, 2004
    #1
    1. Advertising

  2. Hemant Shah

    Ben Morrow Guest

    wrote:
    > I need to convert UTF-8 characters into is ordinal number (),
    > Is there a module to do it or do I have to write something?
    >
    > How do I get started on it? I am new to Unicode encoding and I am still
    > trying to understand how UTF-8 characters are encoded.


    Firstly, use Perl 5.8.

    Next, read perldoc perluniintro. Basically, you don't need to worry
    about how perl encodes its characters: you just make sure you mark each
    data source correctly with its encoding, and perl'll handle the rest.

    For finding ordinal numbers, perldoc -f ord.
    For converting them to hex, perldoc -f sprintf.
    For an easier way to do what you (probably) want to do, perldoc
    PerlIO::encoding and perldoc Encode (the section on fallbacks).

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
    Ben Morrow, Feb 25, 2004
    #2
    1. Advertising

  3. Hemant Shah

    Hemant Shah Guest

    While stranded on information super highway Ben Morrow wrote:
    >
    > wrote:
    >> I need to convert UTF-8 characters into is ordinal number (),
    >> Is there a module to do it or do I have to write something?
    >>
    >> How do I get started on it? I am new to Unicode encoding and I am still
    >> trying to understand how UTF-8 characters are encoded.

    >
    > Firstly, use Perl 5.8.


    I am using perl 5.8
    >
    > Next, read perldoc perluniintro. Basically, you don't need to worry
    > about how perl encodes its characters: you just make sure you mark each
    > data source correctly with its encoding, and perl'll handle the rest.


    I am not worried about how perl stores the characters. This is to store
    the characters in an ASCII format in the file.

    Here is what we are trying to do. We will be translating our help/error
    messages in to Spanish, French, Japanese, etc.

    I have written a perl script that will read english sentence from the
    database, connect to our translation software and get the sentence
    translated (translated text is in UTF-8 format). I want to store this
    into a database or flat file in XML. This file
    could contain english, spanish, french and japanese language and I
    want it to be in 8-bit character set (ISO-8859-1). If I can convert
    the japanese characters into the ordinal numbers I can store the text
    in "" format. I would write the perl script to convert the text
    between UTF-8 and ordinal and back. Spanish and franch characters can
    be stored in ISO-8859-1 characterset with out any problem using
    Encode module.



    >
    > For finding ordinal numbers, perldoc -f ord.
    > For converting them to hex, perldoc -f sprintf.


    I will take a look at the above docs.

    Thanks.

    > For an easier way to do what you (probably) want to do, perldoc
    > PerlIO::encoding and perldoc Encode (the section on fallbacks).
    >
    > Ben
    >
    > --
    > Joy and Woe are woven fine,
    > A Clothing for the Soul divine William Blake
    > Under every grief and pine 'Auguries of Innocence'
    > Runs a joy with silken twine.


    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
    Hemant Shah, Feb 26, 2004
    #3
  4. On Thu, 26 Feb 2004, Hemant Shah wrote:

    > could contain english, spanish, french and japanese language and I
    > want it to be in 8-bit character set (ISO-8859-1). If I can convert
    > the japanese characters into the ordinal numbers I can store the text
    > in "" format. I would write the perl script to convert the text
    > between UTF-8 and ordinal and back.


    See the discussion here a few days ago. Subject was (unbelievable as
    it might seem) "replace unicode characters by representation".

    > Spanish and franch characters can
    > be stored in ISO-8859-1 characterset with out any problem using
    > Encode module.


    They can, indeed, but you said in the earlier part of your posting
    that you want to use ASCII. Best be sure what it is that you want.

    good luck

    (And don't quote sigs, and other material not germane to your
    followup. thanks.)
    Alan J. Flavell, Feb 26, 2004
    #4
  5. Hemant Shah

    Hemant Shah Guest

    While stranded on information super highway Alan J. Flavell wrote:
    > On Thu, 26 Feb 2004, Hemant Shah wrote:
    >
    >> could contain english, spanish, french and japanese language and I
    >> want it to be in 8-bit character set (ISO-8859-1). If I can convert
    >> the japanese characters into the ordinal numbers I can store the text
    >> in "" format. I would write the perl script to convert the text
    >> between UTF-8 and ordinal and back.


    I looked at the thread, but I do not think it can deal with double byte
    characters.

    >
    > See the discussion here a few days ago. Subject was (unbelievable as
    > it might seem) "replace unicode characters by representation".
    >
    >> Spanish and franch characters can
    >> be stored in ISO-8859-1 characterset with out any problem using
    >> Encode module.


    Yes, that is what I am doing.

    >
    > They can, indeed, but you said in the earlier part of your posting
    > that you want to use ASCII. Best be sure what it is that you want.
    >
    > good luck
    >
    > (And don't quote sigs, and other material not germane to your
    > followup. thanks.)


    I am new to this and still reading various docs, so please bear with me if
    I miss obvious things. Maybe if I try to explain what I am trying to do,
    then someone may have better solution then what I am thinking of.

    We are trying to translate all of our help/error messages to other
    languages, currently ES, FR and JA.

    The translation come back to us in an XML file with UTF-8 encoding (Open
    Office doc). I use XML::parser to parse the file.

    I need to take the tranlsations of each sentence and store them in same file
    with #ifdef around them, and also store them into a DB2 database which is
    using ISO-8859-1 character set.

    The flat file is also in XML format. Based on the specified language our
    pre-processor will extract XML code for english and specified language
    from it.

    The file is also controled by RCS. To keep things simple in flat file and
    database I am trying to convert everything to extended ASCII characters
    (ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
    how to store japanese characters.

    Example of the flat file:

    #ifdef H5829
    <?xml version='1.0' encoding='UTF-8'?>
    <!-- **__**__**__**__**__**__**__**__**__**__**__**__**__**__**__** -->
    <!-- Program: sent.1100 -->
    <!-- Author: Name of the Author -->
    <!-- Purpose: To describe content of sent 1100 -->
    <!-- Project: H5829 -->
    <!-- Version: XML 1.0 -->
    <!-- Notes: -->
    <!-- **_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*_*__** -->
    <!DOCTYPE sentsource SYSTEM "sent">
    <sentsource>
    <comment mod = 'H5829'
    author = 'myself'
    date = '20020624'
    type = 'doconly' >
    Initial programming.
    </comment>
    <filekey>1100</filekey>
    <xinfo type = 'EN1'>
    <sentence>
    A master record is not associated with this entry so the suspense
    number entered will not be verified.
    </sentence>
    </xinfo>
    #ifdef H3436
    <xinfo type = 'ES1'>
    <sentence>
    Un registro maestro no se asocia a esta entrada así que el número del suspenso
    incorporado no será
    </sentence>
    </xinfo>
    #endif H3436
    #ifdef H3906
    <xinfo type = 'FR1'>
    <sentence>
    French translation goes here.
    </sentence>
    </xinfo>
    #endif H3436
    #ifdef H4906
    <xinfo type = 'JA1'>
    <sentence>
    Japanese translation goes here. I am thinking of putting "" here.
    </sentence>
    </xinfo>
    #endif H3436
    </sentsource>
    #endif H5829




    Thanks for your help.
    --
    Hemant Shah /"\ ASCII ribbon campaign
    E-mail: \ / ---------------------
    X against HTML mail
    TO REPLY, REMOVE NoJunkMail / \ and postings
    FROM MY E-MAIL ADDRESS.
    -----------------[DO NOT SEND UNSOLICITED BULK E-MAIL]------------------
    I haven't lost my mind, Above opinions are mine only.
    it's backed up on tape somewhere. Others can have their own.
    Hemant Shah, Feb 26, 2004
    #5
  6. On Thu, 26 Feb 2004, Hemant Shah wrote:

    > I looked at the thread, but I do not think it can deal with double byte
    > characters.


    Perl (5.8 upwards) doesn't have "double byte characters", it has
    "characters". How they are stored internally shouldn't concern you.

    In other words, it's simpler than you imagine. But it can be helpful
    to take a look at the complexity of what happens "under the covers" if
    it helps to appreciate the simplicity of what you get on the surface.

    > I need to take the tranlsations of each sentence and store them in same file
    > with #ifdef around them, and also store them into a DB2 database which is
    > using ISO-8859-1 character set.


    Uh-uh, so it really comes down to - not a Perl problem as such - but
    dealing with a database that doesn't understand utf-8.

    But yes, if you see any benefit in it, you _could_ retain iso-8859-1
    characters as themselves, while turning non-iso-8859-1 characters into
    their representations.

    The catch here is that if you do something which implies to Perl that
    you are going beyond iso-8859-1, then it will "upgrade" your data from
    8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
    will then, internally, be two bytes wide.

    Perhaps this will become clearer as you gain familiarity with the
    contents of the perluniintro and perlunicode documentation - much of
    which probably goes way beyond what you need, but parts of which are
    critical to your purpose.

    But maybe there's a module that packages this away and does the work
    for you. I'm looking at this just at the character-representation
    level at the moment, and responding on that basis. Maybe others (or
    on a group dedicated to XML such as comp.lang.xml) can offer
    more-practical insights into available solutions.

    > The file is also controled by RCS. To keep things simple in flat file and
    > database I am trying to convert everything to extended ASCII characters
    > (ISO-8859-1). ES and FR do not pose any problems, I am trying to figure out
    > how to store japanese characters.


    Your plan to represent them as representations sounds OK to
    me. Of course if you need to sort data, or process it in similar
    ways, then you'll need to think carefully what you're doing.

    hope this helps a bit.
    Alan J. Flavell, Feb 26, 2004
    #6
  7. On Thu, 26 Feb 2004, Alan J. Flavell wrote:

    > on a group dedicated to XML such as comp.lang.xml)


    Make that comp.text.xml - excuse me.
    Alan J. Flavell, Feb 26, 2004
    #7
  8. Hemant Shah

    Ben Morrow Guest

    "Alan J. Flavell" <> wrote:
    > Uh-uh, so it really comes down to - not a Perl problem as such - but
    > dealing with a database that doesn't understand utf-8.
    >
    > But yes, if you see any benefit in it, you _could_ retain iso-8859-1
    > characters as themselves, while turning non-iso-8859-1 characters into
    > their representations.
    >
    > The catch here is that if you do something which implies to Perl that
    > you are going beyond iso-8859-1, then it will "upgrade" your data from
    > 8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
    > will then, internally, be two bytes wide.


    The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
    calls to the database with subs that encode the data. You will have to
    map & to &amp; or & yourself.

    I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
    '*never* read or write data to or from some external source without
    running it through the Encode module'. Then you'll always know where you
    stand.

    Ben

    --
    It will be seen that the Erwhonians are a meek and long-suffering people,
    easily led by the nose, and quick to offer up common sense at the shrine of
    logic, when a philosopher convinces them that their institutions are not based
    on the strictest morality. [Samuel Butler, paraphrased]
    Ben Morrow, Feb 26, 2004
    #8
  9. On Thu, 26 Feb 2004, Ben Morrow wrote:

    [quoting ajf:]
    > > The catch here is that if you do something which implies to Perl that
    > > you are going beyond iso-8859-1, then it will "upgrade" your data from
    > > 8-bit bytes to utf-8 characters, and so your iso-8859-1 characters
    > > will then, internally, be two bytes wide.

    >
    > The answer here is still to use Encode with FB_HTMLCREF: simply wrap all
    > calls to the database with subs that encode the data.


    Looks to be excellent advice to me. Which was why I referred back to
    the previous thread for details...

    > You will have to map & to &amp; or & yourself.


    Good point.

    > I would say a good rule-of-thumb when dealing with 5.8 and Unicode is
    > '*never* read or write data to or from some external source without
    > running it through the Encode module'.


    Where "external" also includes the database that the hon Usenaut is
    using, right?

    > Then you'll always know where you stand.


    Once the questioner is up to speed on dealing with the data internally
    to Perl, sure.

    all the best
    Alan J. Flavell, Feb 26, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Åukasz Ligowski

    "xxx.has_key(a)" vs "a in xxx"

    Åukasz Ligowski, Oct 30, 2008, in forum: Python
    Replies:
    0
    Views:
    283
    Åukasz Ligowski
    Oct 30, 2008
  2. Replies:
    0
    Views:
    987
  3. Bert Leu

    value of type "xxx" cannot be converted to "xxx"

    Bert Leu, Jun 5, 2007, in forum: ASP .Net Web Services
    Replies:
    2
    Views:
    281
    Jesse Houwing
    Jun 6, 2007
  4. Richard Lionheart
    Replies:
    4
    Views:
    277
    Richard Lionheart
    May 5, 2004
  5. Iñaki Baz Castillo
    Replies:
    5
    Views:
    143
    Iñaki Baz Castillo
    Mar 30, 2008
Loading...

Share This Page