utf-8 and ctypes

Discussion in 'Python' started by Brendan Miller, Sep 28, 2010.

  1. I'm using python 2.5.

    Currently I have some python bindings written in ctypes. On the C
    side, my strings are in utf-8. On the python side I use
    ctypes.c_char_p to convert my strings to python strings. However, this
    seems to break for non-ascii characters.

    It seems that characters not in the ascii subset of UTF-8 are
    discarded by c_char_p during the conversion, or at least they don't
    print out when I go to print the string.

    Does python not support utf-8 strings? Is there some other way I
    should be doing the conversion?

    Thanks,
    Brendan
    Brendan Miller, Sep 28, 2010
    #1
    1. Advertising

  2. In message <>, Brendan
    Miller wrote:

    > It seems that characters not in the ascii subset of UTF-8 are
    > discarded by c_char_p during the conversion ...


    Not a chance.

    > ... or at least they don't print out when I go to print the string.


    So it seems there’s a problem on the printing side. What happens when you
    construct a UTF-8-encoded string directly in Python and try printing it the
    same way?
    Lawrence D'Oliveiro, Sep 29, 2010
    #2
    1. Advertising

  3. 2010/9/29 Lawrence D'Oliveiro <_zealand>:
    > In message <>, Brendan
    > Miller wrote:
    >
    >> It seems that characters not in the ascii subset of UTF-8 are
    >> discarded by c_char_p during the conversion ...

    >
    > Not a chance.
    >
    >> ... or at least they don't print out when I go to print the string.

    >
    > So it seems there$B!G(Bs a problem on the printing side. What happens when you
    > construct a UTF-8-encoded string directly in Python and try printing it the
    > same way?


    Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

    if I enter:
    str = "$BF|K\8l$N%F%9%H(B"

    Then:
    print str
    $BF|K\8l$N%F%9%H(B

    However, when I create a string buffer, pass it into my c++ code, and
    write the same UTF-8 string into it, python seems to discard pretty
    much all the text. The same code works for pure ascii strings.

    Python code:
    _std_string_size = _lib_mbxclient.std_string_size
    _std_string_size.restype = c_long
    _std_string_size.argtypes = [c_void_p]

    _std_string_copy = _lib_mbxclient.std_string_copy
    _std_string_copy.restype = None
    _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

    # This function works for ascii, but breaks on strings with UTF-8!
    def std_string_to_string(str_ptr):
    buf = create_string_buffer(_std_string_size(str_ptr))
    _std_string_copy(str_ptr, buf)
    return buf.raw

    C++ code:

    extern "C"
    long std_string_size(string* str)
    {
    return str->size();
    }

    extern "C"
    void std_string_copy(string* str, char* buf)
    {
    std::copy(str->begin(), str->end(), buf);
    }
    Brendan Miller, Sep 29, 2010
    #3
  4. Brendan Miller

    MRAB Guest

    On 29/09/2010 19:33, Brendan Miller wrote:
    > 2010/9/29 Lawrence D'Oliveiro<_zealand>:
    >> In message<>, Brendan
    >> Miller wrote:
    >>
    >>> It seems that characters not in the ascii subset of UTF-8 are
    >>> discarded by c_char_p during the conversion ...

    >>
    >> Not a chance.
    >>
    >>> ... or at least they don't print out when I go to print the string.

    >>
    >> So it seems there$B!G(Bs a problem on the printing side. What happens when you
    >> construct a UTF-8-encoded string directly in Python and try printing it the
    >> same way?

    >
    > Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
    >
    > if I enter:
    > str = "$BF|K\8l$N%F%9%H(B"
    >
    > Then:
    > print str
    > $BF|K\8l$N%F%9%H(B
    >
    > However, when I create a string buffer, pass it into my c++ code, and
    > write the same UTF-8 string into it, python seems to discard pretty
    > much all the text. The same code works for pure ascii strings.
    >
    > Python code:
    > _std_string_size = _lib_mbxclient.std_string_size
    > _std_string_size.restype = c_long
    > _std_string_size.argtypes = [c_void_p]
    >
    > _std_string_copy = _lib_mbxclient.std_string_copy
    > _std_string_copy.restype = None
    > _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
    >
    > # This function works for ascii, but breaks on strings with UTF-8!
    > def std_string_to_string(str_ptr):
    > buf = create_string_buffer(_std_string_size(str_ptr))
    > _std_string_copy(str_ptr, buf)
    > return buf.raw
    >
    > C++ code:
    >
    > extern "C"
    > long std_string_size(string* str)
    > {
    > return str->size();
    > }
    >
    > extern "C"
    > void std_string_copy(string* str, char* buf)
    > {
    > std::copy(str->begin(), str->end(), buf);
    > }


    It might have something to do with the character encoding of your
    source files.

    Also, try printing out the character codes of the string and the size
    of the string's character in the C++ code.
    MRAB, Sep 29, 2010
    #4
  5. Brendan Miller

    Mark Tolonen Guest

    "Brendan Miller" <> wrote in message
    news:AANLkTi=...
    > 2010/9/29 Lawrence D'Oliveiro <_zealand>:
    >> In message <>,
    >> Brendan
    >> Miller wrote:
    >>
    >>> It seems that characters not in the ascii subset of UTF-8 are
    >>> discarded by c_char_p during the conversion ...

    >>
    >> Not a chance.
    >>
    >>> ... or at least they don't print out when I go to print the string.

    >>
    >> So it seems there$B!G(Bs a problem on the printing side. What happens when
    >> you
    >> construct a UTF-8-encoded string directly in Python and try printing it
    >> the
    >> same way?

    >
    > Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
    >
    > if I enter:
    > str = "$BF|K\8l$N%F%9%H(B"
    >
    > Then:
    > print str
    > $BF|K\8l$N%F%9%H(B
    >
    > However, when I create a string buffer, pass it into my c++ code, and
    > write the same UTF-8 string into it, python seems to discard pretty
    > much all the text. The same code works for pure ascii strings.
    >
    > Python code:
    > _std_string_size = _lib_mbxclient.std_string_size
    > _std_string_size.restype = c_long
    > _std_string_size.argtypes = [c_void_p]
    >
    > _std_string_copy = _lib_mbxclient.std_string_copy
    > _std_string_copy.restype = None
    > _std_string_copy.argtypes = [c_void_p, POINTER(c_char)]
    >
    > # This function works for ascii, but breaks on strings with UTF-8!
    > def std_string_to_string(str_ptr):
    > buf = create_string_buffer(_std_string_size(str_ptr))
    > _std_string_copy(str_ptr, buf)
    > return buf.raw
    >
    > C++ code:
    >
    > extern "C"
    > long std_string_size(string* str)
    > {
    > return str->size();
    > }
    >
    > extern "C"
    > void std_string_copy(string* str, char* buf)
    > {
    > std::copy(str->begin(), str->end(), buf);
    > }


    I didn't see what OS you are using, but I fleshed out your example code and
    have a working example for Windows. Below is the code for the DLL and
    script:

    --------- x.cpp [cl /LD /EHsc /W4
    x.cpp] ----------------------------------------------------
    #include <string>
    #include <algorithm>
    using namespace std;

    extern "C" __declspec(dllexport) long std_string_size(string* str)
    {
    return str->size();
    }

    extern "C" __declspec(dllexport) void std_string_copy(string* str, char*
    buf)
    {
    std::copy(str->begin(), str->end(), buf);
    }

    extern "C" __declspec(dllexport) void* make(const char* s)
    {
    return new string(s);
    }

    extern "C" __declspec(dllexport) void destroy(void* s)
    {
    delete (string*)s;
    }
    ---- x.py ---------------------------------------------------------
    # coding: utf8
    from ctypes import *
    _lib_mbxclient = CDLL('x')

    _std_string_size = _lib_mbxclient.std_string_size
    _std_string_size.restype = c_long
    _std_string_size.argtypes = [c_void_p]

    _std_string_copy = _lib_mbxclient.std_string_copy
    _std_string_copy.restype = None
    _std_string_copy.argtypes = [c_void_p, c_char_p]

    make = _lib_mbxclient.make
    make.restype = c_void_p
    make.argtypes = [c_char_p]

    destroy = _lib_mbxclient.destroy
    destroy.restype = None
    destroy.argtypes = [c_void_p]

    # This function works for ascii, but breaks on strings with UTF-8!
    def std_string_to_string(str_ptr):
    buf = create_string_buffer(_std_string_size(str_ptr))
    _std_string_copy(str_ptr, buf)
    return buf.raw

    s = make(u'$B2f@'H~9q?M!#(B'.encode('utf8'))
    print std_string_to_string(s).decode('utf8')
    ------------------------------------------------------

    And output (in Pythonwin...US Windows console doesn't support Chinese):

    $B2f@'H~9q?M!#(B

    I used c_char_p instead of POINTER(c_char) and added functions to create and
    destroy a std::string for Python's use, but it is otherwise the same as your
    code.

    Hope this helps you work it out,
    -Mark
    Mark Tolonen, Sep 30, 2010
    #5
  6. Brendan Miller <> writes:

    > 2010/9/29 Lawrence D'Oliveiro <_zealand>:
    >> In message <>, Brendan
    >> Miller wrote:
    >>
    >>> It seems that characters not in the ascii subset of UTF-8 are
    >>> discarded by c_char_p during the conversion ...

    >>
    >> Not a chance.
    >>
    >>> ... or at least they don't print out when I go to print the string.

    >>
    >> So it seems there’s a problem on the printing side. What happens when you
    >> construct a UTF-8-encoded string directly in Python and try printing it the
    >> same way?

    >
    > Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...
    >
    > if I enter:
    > str = "日本語ã®ãƒ†ã‚¹ãƒˆ"


    What is this? Which encoding is used by your editor to produce this
    byte-string?

    If you want to be sure you have the right encoding, you need to do this:

    - put a coding: utf-8 (or actually whatever your editor uses) in the
    first or second line
    - use unicode literals. That are the funny little strings with a "u" in
    front of them. They will be *decoded* using the declared encoding.
    - when passing this to C, explicitly *encode* with utf-8 first.

    Diez
    Diez B. Roggisch, Sep 30, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,047
    Joerg Jooss
    Apr 24, 2004
  2. Henk Punt
    Replies:
    0
    Views:
    390
    Henk Punt
    Jul 23, 2004
  3. Replies:
    0
    Views:
    492
  4. jmfauth
    Replies:
    4
    Views:
    305
    jmfauth
    Oct 13, 2010
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    934
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page