UTF8 to Unicode conversion

Discussion in 'Perl' started by Spamtrap, Jul 30, 2004.

  1. Spamtrap

    Spamtrap Guest

    I only work in Perl occasionaly, and have been searching for a
    solution for a conversion, and everything I found seems much too
    complex.

    All I need to do is take a simple text file [that had been created
    from a Perl script] and copy it, however some specific lines are in
    fact in UTF8 as printed garbagy characters and they need to be
    converted to Unicode, so that the new text file can be imported into a
    desktop program and into some Word documents.

    For the moment [if it makes it easier] I would be happy to get a
    solution for most European languages, and could skip things like
    Russion and Chinese till later

    Español - for example should convert to Español
     
    Spamtrap, Jul 30, 2004
    #1
    1. Advertising

  2. Spamtrap wrote:
    > All I need to do is take a simple text file [that had been created
    > from a Perl script] and copy it, however some specific lines are in
    > fact in UTF8 as printed garbagy characters and they need to be
    > converted to Unicode,


    Sorry, but this doesn't make sense. Do you know what the U in UTF stands
    for? You already have Unicode!

    > so that the new text file can be imported into a
    > desktop program and into some Word documents.


    When you mention Word I'm guessing that you are using some version of
    Windows? Are you still running Windows 98 or so? I'm asking because any
    somewhat newer Microsoft OS as well as Word can handle Unicode (and thus
    UTF8) just fine. Actually Microsoft is one of the major proponents of
    Unicode.

    > For the moment [if it makes it easier] I would be happy to get a
    > solution for most European languages, and could skip things like
    > Russion and Chinese till later
    >
    > Español - for example


    If this is really what you can see when opening the file in your program
    then it is far more likely that the program believes the file is in
    ISO-8859-1 or ANSI-1252. If the program would assume UTF16 or UTF32 then the
    text would be displayed vastly different.

    > should convert to Español


    But this is not encoded in Unicode (in whatever transfer format) but in
    ISO-8859-1 as your header clearly says:
    Content-Type: text/plain; charset=ISO-8859-1

    While it seems you are quite confused nevertheless I suggest to look at the
    Text::Iconv module.

    jue
     
    Jürgen Exner, Jul 30, 2004
    #2
    1. Advertising

  3. Spamtrap

    Joe Smith Guest

    Spamtrap wrote:

    > All I need to do is take a simple text file [that had been created
    > from a Perl script] and copy it, however some specific lines are in
    > fact in UTF8 as printed garbagy characters and they need to be
    > converted to Unicode, so that the new text file can be imported into a
    > desktop program and into some Word documents.


    UTF8 *is* Unicode.
    Some programs that deal with UTF16 or ISO-8859-1 need to be told that
    the file is encoded in UTF8.

    > Español - for example should convert to Español


    That's what happens when a file is in UTF8 but the program reading
    the file thinks it is ISO-8859-1. You'll need to either mark the file
    in some why so that programs recognize it as UTF8, or use an option
    in the program to force it to process the the input as UTF8.

    -Joe
     
    Joe Smith, Jul 30, 2004
    #3
  4. Spamtrap

    Guest

    Spamtrap <> wrote in message news:<>...
    > I only work in Perl occasionaly, and have been searching for a
    > solution for a conversion, and everything I found seems much too
    > complex.
    >
    > All I need to do is take a simple text file [that had been created
    > from a Perl script] and copy it, however some specific lines are in
    > fact in UTF8 as printed garbagy characters and they need to be
    > converted to Unicode,


    What do you mean by "converted to Unicode"?

    Do you perhaps mean some other specific encoding of Unicode? If so
    which one?

    > so that the new text file can be imported into a
    > desktop program and into some Word documents.


    Ah, sounds like you may be using Microsoft products. You probably
    want to convert utf8 into utf16 (can't recall if MS uses BE or LE but
    any utf16 implementation is supposed to autodetect anyhow).

    This has nothing to do with Perl, as such.

    I just tried "convert utf8 utf16" in Google and found lots of stuff.

    This newsgroup does not exist (see FAQ). Please do not start threads
    here.
     
    , Jul 30, 2004
    #4
  5. Spamtrap

    Spamtrap Guest

    Ok let me try to redefine the problem.

    I have a text file, [ in Windows 98], which by definition is in plain
    256 character ASCII. When I view it I see Español - which I assumed
    was originally UTF8 - but I want to see Español [which of course
    could exist in ASCII, without even having to go to Unicode or anything
    fancy] so the encoding is using the two characters ñ for the single
    character ñ

    The data from that text file is being imported into a database [this
    part is not Perl programming]. When I display the data, it displays
    Español not Español

    Then a program will manipulate that database and create a Microsoft
    Word document [or possibly an Adobe PDF document] and I assume the
    text will continue to be incorrect. Therefore I want to use Perl to
    fix that text data before I do the other processing.

    I also have things like СубъеР- which is supposed to be Russian
    and judeţul which is Romanian.

    It is possible I might have to maitain 2 copies of the strings in the
    database tables, one as an ASCII close match for display purposes,
    [since the database will not support UNICODE directly] and one as
    actual UNICODE for passing into Word.
     
    Spamtrap, Jul 30, 2004
    #5
  6. Spamtrap wrote:
    > I have a text file, [ in Windows 98], which by definition is in plain
    > 256 character ASCII.


    Impossible. ASCII by it's very definition has only 127 characters.

    > When I view it I see Español - which I assumed
    > was originally UTF8 -


    Yep, this sounds about right.

    > but I want to see Español [which of course
    > could exist in ASCII,


    No, it cannot because ASCII contains only English characters and does not
    contain any extended characters.

    > without even having to go to Unicode or anything
    > fancy]


    But UTF-8 which apparently is the current encoding of your text _is_ already
    Unicode.

    > so the encoding is using the two characters ñ for the single
    > character ñ
    >
    > The data from that text file is being imported into a database [this
    > part is not Perl programming]. When I display the data, it displays
    > Español not Español


    That simply means one of two things:
    - either the program you are using to display the data does not _know_ how
    to handle UTF-8. If this is the case, then you should use a program that
    actually understands UTF-8.
    - or the program does not realize that the file is in UTF-8 and therefore
    uses whatever default encoding is selected. In that case simply make the
    program recognize the file as UTF-8 encoded, either by changing some option
    in the program or by setting the byte order mark in the file or similar
    means.

    > Then a program will manipulate that database and create a Microsoft
    > Word document [or possibly an Adobe PDF document] and I assume the
    > text will continue to be incorrect. Therefore I want to use Perl to
    > fix that text data before I do the other processing.


    See Text::Iconv if you really want to convert text forth and back

    > I also have things like СÑfбÑSеР- which is supposed to be Russian
    > and judeţul which is Romanian.


    Then you _really_ should keep your text as Unicode because cyrillic
    characters are not part of Windows-1252 or ISO-Latin-1. Which means you
    cannot represent Russian text and Spanish text in the same file.

    > It is possible I might have to maitain 2 copies of the strings in the
    > database tables, one as an ASCII close match for display purposes,


    There are neither cyrillic nor extended characters in ASCII.

    > [since the database will not support UNICODE directly] and one as
    > actual UNICODE for passing into Word.


    Then change the database. This is 2004, not 1984. A database that today
    cannot handle arbitrary international text is not worth it's money, even if
    it's free.

    jue
     
    Jürgen Exner, Jul 31, 2004
    #6
  7. Spamtrap

    Joe Smith Guest

    Spamtrap wrote:

    > Ok let me try to redefine the problem.
    >
    > I have a text file, [ in Windows 98], which by definition is in plain
    > 256 character ASCII. When I view it I see Español - which I assumed
    > was originally UTF8 - but I want to see Español [which of course
    > could exist in ASCII, without even having to go to Unicode or anything
    > fancy] so the encoding is using the two characters ñ for the single
    > character ñ


    ASCII is only 128 characters. Character codes 128 to 255 can be
    1) ISO-8859-1 (the Latin-1 alphabet), for western European languages.
    2) Some Microsoft CP (code page). There are many.
    3) Special bit patterns used in the UTF8 encoding scheme.

    For Español, all you need is a UTF8-to-ISO8859 conversion utility.

    > The data from that text file is being imported into a database [this
    > part is not Perl programming]. When I display the data, it displays
    > Español not Español


    That means that whatever program you are using to display the data
    does not understand UTF8. There are terminal emulators and command
    consoles that do understand UTF8.

    > Then a program will manipulate that database and create a Microsoft
    > Word document [or possibly an Adobe PDF document] and I assume the
    > text will continue to be incorrect. Therefore I want to use Perl to
    > fix that text data before I do the other processing.


    You could try playing around with
    open IN,':utf8',$input_file or die;
    open OUT,':crlf',$output_file or die;
    print OUT <IN>;

    > I also have things like СубъеР- which is supposed to be Russian
    > and judeţul which is Romanian.


    Russian characters simply cannot be displayed in ASCII or ISO-8859-1.
    ISO-8859-9 has Cyrillic, but not western european accented characters.
    Read http://czyborra.com/charsets/iso8859.html (or Google's cache).

    > It is possible I might have to maitain 2 copies of the strings in the
    > database tables, one as an ASCII close match for display purposes,
    > [since the database will not support UNICODE directly] and one as
    > actual UNICODE for passing into Word.


    The major databases do support Unicode directly. Often it is as simple
    as exporting the database to a flat file, defining a new database
    with UTF8 enabled, and importing the data. You will have to ask the
    DBA to perform this operation.
    -Joe
     
    Joe Smith, Jul 31, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Holger Joukl
    Replies:
    5
    Views:
    553
    Ben Finney
    Dec 13, 2006
  2. Maxim Yegorushkin
    Replies:
    14
    Views:
    645
    Jorgen Grahn
    Jun 12, 2010
  3. Howard Hinnant
    Replies:
    0
    Views:
    724
    Howard Hinnant
    May 31, 2010
  4. Oliver Regenfelder

    Re: Conversion from UTF32 to UTF8 for review

    Oliver Regenfelder, Jun 1, 2010, in forum: C++
    Replies:
    7
    Views:
    589
    Paul Bibbings
    Jun 3, 2010
  5. gry
    Replies:
    2
    Views:
    769
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page