utf-8 and ctypes

Brendan Miller · Sep 28, 2010

I'm using python 2.5.

Currently I have some python bindings written in ctypes. On the C
side, my strings are in utf-8. On the python side I use
ctypes.c_char_p to convert my strings to python strings. However, this
seems to break for non-ascii characters.

It seems that characters not in the ascii subset of UTF-8 are
discarded by c_char_p during the conversion, or at least they don't
print out when I go to print the string.

Does python not support utf-8 strings? Is there some other way I
should be doing the conversion?

Thanks,
Brendan

Lawrence D'Oliveiro · Sep 29, 2010

Brendan said:
It seems that characters not in the ascii subset of UTF-8 are
discarded by c_char_p during the conversion ...

Not a chance.

... or at least they don't print out when I go to print the string.

So it seems thereâ€™s a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?

Brendan Miller · Sep 29, 2010

2010/9/29 Lawrence D'Oliveiro said:
Not a chance.

So it seems there$B!G(Bs a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}

MRAB · Sep 29, 2010

2010/9/29 Lawrence D'Oliveiro said:
2010/9/29 Lawrence D'Oliveiro said:

Not a chance.

So it seems there$B!G(Bs a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?

Click to expand...

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}

It might have something to do with the character encoding of your
source files.

Also, try printing out the character codes of the string and the size
of the string's character in the C++ code.

Mark Tolonen · Sep 30, 2010

Brendan Miller said:
2010/9/29 Lawrence D'Oliveiro said:

In message <[email protected]>,

Not a chance.

So it seems there$B!G(Bs a problem on the printing side. What happens when
you
construct a UTF-8-encoded string directly in Python and try printing it
the
same way?

Click to expand...

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}

I didn't see what OS you are using, but I fleshed out your example code and
have a working example for Windows. Below is the code for the DLL and
script:

--------- x.cpp [cl /LD /EHsc /W4
x.cpp] ----------------------------------------------------
#include <string>
#include <algorithm>
using namespace std;

extern "C" __declspec(dllexport) long std_string_size(string* str)
{
return str->size();
}

extern "C" __declspec(dllexport) void std_string_copy(string* str, char*
buf)
{
std::copy(str->begin(), str->end(), buf);
}

extern "C" __declspec(dllexport) void* make(const char* s)
{
return new string(s);
}

extern "C" __declspec(dllexport) void destroy(void* s)
{
delete (string*)s;
}
---- x.py ---------------------------------------------------------
# coding: utf8
from ctypes import *
_lib_mbxclient = CDLL('x')

_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, c_char_p]

make = _lib_mbxclient.make
make.restype = c_void_p
make.argtypes = [c_char_p]

destroy = _lib_mbxclient.destroy
destroy.restype = None
destroy.argtypes = [c_void_p]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

s = make(u'$B2f@'H~9q?M!#(B'.encode('utf8'))
print std_string_to_string(s).decode('utf8')
------------------------------------------------------

And output (in Pythonwin...US Windows console doesn't support Chinese):

$B2f@'H~9q?M!#(B

I used c_char_p instead of POINTER(c_char) and added functions to create and
destroy a std::string for Python's use, but it is otherwise the same as your
code.

Hope this helps you work it out,
-Mark

Diez B. Roggisch · Sep 30, 2010

Brendan Miller said:
Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "æ—¥æœ¬èªžã®ãƒ†ã‚¹ãƒˆ"

What is this? Which encoding is used by your editor to produce this
byte-string?

If you want to be sure you have the right encoding, you need to do this:

- put a coding: utf-8 (or actually whatever your editor uses) in the
first or second line
- use unicode literals. That are the funny little strings with a "u" in
front of them. They will be *decoded* using the declared encoding.
- when passing this to C, explicitly *encode* with utf-8 first.

Diez

hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
Calling Python macro from ctypes	5	Aug 12, 2013
ctypes: delay conversion from c_char_p to string	0	Apr 21, 2010
UTF-8 vs w_char	48	Nov 3, 2013
CSV readers and UTF-8 files	2	Feb 19, 2009

utf-8 and ctypes

Brendan Miller

Lawrence D'Oliveiro

Brendan Miller

MRAB

Mark Tolonen

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads