Any convenient and elegant way to do encoding conversion in C++?

Discussion in 'C++' started by Licheng Fang, Sep 23, 2006.

  1. Licheng Fang

    Licheng Fang Guest

    I want to store Chinese in Unicode internally in my program, and give
    output in UTF-8 or GBK format. After two days of searching and reading,
    I still cannot find a simple and straightforward way to do the code
    conversions. In particular, I want portability of the code across
    platfroms (Windows and Linux), and I don't like having to refer the
    user of my code to some third party libraries for compiling.

    Some STL references point to the class "codecvt<>" for this task, but
    it seems that I must rely on non-standard, third-party specializations
    of this class. The STL itself doesn't implement the code conversions.
    Another option I've read about is using GNU's "iconv", which is
    implemented in C, and Glib provides a C++ wrapper of "iconv". Again,
    re-compiling my source code can be a trouble if I relied heavily on
    these libraries. Boost also seems to have some tools for code
    conversion. Considering the huge size of the boost libraries, I would
    have to pass that as an option.

    These are the only possible ways I know of so far. I have to say that
    my idea of how this task should be done is somewhat influenced by the
    Python way, which is simple and elegant:

    if 's' is a string in GBK.

    unicode_s = s.decode('gbk')

    and when I need to output in GBK I simply convert it back by

    output = unicode_s.encode('gbk')

    or, I can let the file object know what's the external coding:

    import codecs
    f = open('somefile', 'r', 'gbk')

    I know it's not fair to expect the same things from two different
    languages. I wonder, however, how can such a seemingly trivial task be
    so infuriatingly complicated in C++.
     
    Licheng Fang, Sep 23, 2006
    #1
    1. Advertising

  2. Licheng Fang wrote:

    > I want to store Chinese in Unicode internally in my program, and give
    > output in UTF-8 or GBK format. After two days of searching and reading,
    > I still cannot find a simple and straightforward way to do the code
    > conversions. In particular, I want portability of the code across
    > platfroms (Windows and Linux), and I don't like having to refer the
    > user of my code to some third party libraries for compiling.


    Then you are in trouble. The windows apis, and the usual windows compilers,
    uses a wchar_t type of 16 bits with utf16 encoding, and gcc and his
    libraries in Linux use a wchar_t of 32 bits. So if you want the internals
    of the program be the same on both platforms, and don't want to use third
    party libraries, you must define your own wchar type and conversions to and
    from utf8. And some platform dependent code to see what characters are
    available in the fonts used.

    The utf8 conversions are not hard, in http://www.unicode.org you have a
    bunch of information.

    --
    Salu2
     
    =?ISO-8859-15?Q?Juli=E1n?= Albo, Sep 23, 2006
    #2
    1. Advertising

  3. Licheng Fang

    loufoque Guest

    Licheng Fang wrote :

    > Some STL references point to the class "codecvt<>" for this task, but
    > it seems that I must rely on non-standard, third-party specializations
    > of this class. The STL itself doesn't implement the code conversions.


    Indeed, it's not in the standard library.
    That's why you need to use a third party library, like libiconv, unless
    you want to write it yourself of course.

    Basically, you just have to define mappings between one encoding and
    Unicode. This is a very boring task.
     
    loufoque, Sep 23, 2006
    #3
  4. Licheng Fang

    Licheng Fang Guest

    Julián Albo wrote:
    > Licheng Fang wrote:
    >
    > > I want to store Chinese in Unicode internally in my program, and give
    > > output in UTF-8 or GBK format. After two days of searching and reading,
    > > I still cannot find a simple and straightforward way to do the code
    > > conversions. In particular, I want portability of the code across
    > > platfroms (Windows and Linux), and I don't like having to refer the
    > > user of my code to some third party libraries for compiling.

    >
    > Then you are in trouble. The windows apis, and the usual windows compilers,
    > uses a wchar_t type of 16 bits with utf16 encoding, and gcc and his
    > libraries in Linux use a wchar_t of 32 bits. So if you want the internals
    > of the program be the same on both platforms, and don't want to use third
    > party libraries, you must define your own wchar type and conversions to and
    > from utf8. And some platform dependent code to see what characters are
    > available in the fonts used.
    >
    > The utf8 conversions are not hard, in http://www.unicode.org you have a
    > bunch of information.


    Thanks very much.

    I know it's simple to convert Unicode to UTF-8, but the input of my
    code is mostly in GBK, which is a popular Chinese encoding. I have to
    deal with that.

    It seems I have to accept that there's no standard way to convert
    encodings in C++. Let me re-state my goals:

    1) use Unicode internally in my program, to facilitate my coding task
    2) make it as convenient as possible for the users of my code to
    compile it

    And let me forget about Windows for now, and think about how I can make
    it simple to re-compile my code on Linux. Given that there's no
    standard way to do encoding convesions, my question is:

    What is the most widely used encoding conversion approach on Linux? Is
    that the "iconv" library? Is this library included by default on most
    Linux platforms? How about the Glib wrappings of this library? Should I
    use it?
     
    Licheng Fang, Sep 23, 2006
    #4
  5. Licheng Fang

    Guest

    Licheng Fang wrote:
    > It seems I have to accept that there's no standard way to convert
    > encodings in C++. Let me re-state my goals:
    >
    > 1) use Unicode internally in my program, to facilitate my coding task
    > 2) make it as convenient as possible for the users of my code to
    > compile it
    >
    > And let me forget about Windows for now, and think about how I can make
    > it simple to re-compile my code on Linux. Given that there's no
    > standard way to do encoding convesions, my question is:
    >
    > What is the most widely used encoding conversion approach on Linux? Is
    > that the "iconv" library? Is this library included by default on most
    > Linux platforms? How about the Glib wrappings of this library? Should I
    > use it?


    iconv is a very standard way to do this. It's a single C function
    which, given proper inputs, will do everything you need. Forget
    about C++ wrappers. It's just a C function. Learn how to
    declare it correctly for a C++ program and you're done (check
    the FAQ). Hopefully, the Glib wrapper you are speaking of
    just does that.

    I use iconv for encoding internal strings when creating XML
    messages which are sent externally. The iconv call is
    contained in a wrapper we created to interface with an XML
    library. It is cross platform - Linux and Windows.

    Notice the word "call" above is not plural. The key point is
    that the system is designed so that it doesn't need to keep
    track of how strings are encoded. The process of creating
    something for external consumption encapsulates the
    conversion.

    You should carefully design you system to do the same,
    otherwise your code will be riddled with conversion calls
    and anything that contains data will need to keep track
    of how that data is encoded.

    Good luck.
     
    , Sep 23, 2006
    #5
  6. Licheng Fang wrote:

    > It seems I have to accept that there's no standard way to convert
    > encodings in C++. Let me re-state my goals:
    > 1) use Unicode internally in my program, to facilitate my coding task
    > 2) make it as convenient as possible for the users of my code to
    > compile it


    You have for example the mbtowc (multibyte to wide char) function and his
    family in the C library, that I suppose will support your encoding if you
    have a locale that uses it. You can handle the locale with the C style
    functions in <locale.h> or the C++ <locale> ones.

    The availability of locales an libraries in linux is off-topic in this
    group, you can ask in some linux related newsgroup.

    --
    Salu2
     
    =?ISO-8859-15?Q?Juli=E1n?= Albo, Sep 23, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Morten Wennevik

    Any elegant way of doing this?

    Morten Wennevik, Nov 8, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    394
    Morten Wennevik
    Nov 8, 2005
  2. P L

    Any elegant way to do this?

    P L, Oct 17, 2003, in forum: C Programming
    Replies:
    1
    Views:
    353
    Mark A. Odell
    Oct 17, 2003
  3. Tomas Mikula
    Replies:
    7
    Views:
    429
    Andreas Leitgeb
    Nov 15, 2007
  4. Ron M
    Replies:
    1
    Views:
    89
    Simon Strandgaard
    Oct 17, 2005
  5. Robert Oschler
    Replies:
    1
    Views:
    113
Loading...

Share This Page