utf-8 and ctypes

B

Brendan Miller

I'm using python 2.5.

Currently I have some python bindings written in ctypes. On the C
side, my strings are in utf-8. On the python side I use
ctypes.c_char_p to convert my strings to python strings. However, this
seems to break for non-ascii characters.

It seems that characters not in the ascii subset of UTF-8 are
discarded by c_char_p during the conversion, or at least they don't
print out when I go to print the string.

Does python not support utf-8 strings? Is there some other way I
should be doing the conversion?

Thanks,
Brendan
 
L

Lawrence D'Oliveiro

Brendan said:
It seems that characters not in the ascii subset of UTF-8 are
discarded by c_char_p during the conversion ...

Not a chance.
... or at least they don't print out when I go to print the string.

So it seems there’s a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?
 
B

Brendan Miller

2010/9/29 Lawrence D'Oliveiro said:
Not a chance.


So it seems there$B!G(Bs a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}
 
M

MRAB

2010/9/29 Lawrence D'Oliveiro said:
Not a chance.


So it seems there$B!G(Bs a problem on the printing side. What happens when you
construct a UTF-8-encoded string directly in Python and try printing it the
same way?

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}

It might have something to do with the character encoding of your
source files.

Also, try printing out the character codes of the string and the size
of the string's character in the C++ code.
 
M

Mark Tolonen

Brendan Miller said:
2010/9/29 Lawrence D'Oliveiro said:
In message <[email protected]>,


Not a chance.


So it seems there$B!G(Bs a problem on the printing side. What happens when
you
construct a UTF-8-encoded string directly in Python and try printing it
the
same way?

Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "$BF|K\8l$N%F%9%H(B"

Then:
print str
$BF|K\8l$N%F%9%H(B

However, when I create a string buffer, pass it into my c++ code, and
write the same UTF-8 string into it, python seems to discard pretty
much all the text. The same code works for pure ascii strings.

Python code:
_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, POINTER(c_char)]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

C++ code:

extern "C"
long std_string_size(string* str)
{
return str->size();
}

extern "C"
void std_string_copy(string* str, char* buf)
{
std::copy(str->begin(), str->end(), buf);
}

I didn't see what OS you are using, but I fleshed out your example code and
have a working example for Windows. Below is the code for the DLL and
script:

--------- x.cpp [cl /LD /EHsc /W4
x.cpp] ----------------------------------------------------
#include <string>
#include <algorithm>
using namespace std;

extern "C" __declspec(dllexport) long std_string_size(string* str)
{
return str->size();
}

extern "C" __declspec(dllexport) void std_string_copy(string* str, char*
buf)
{
std::copy(str->begin(), str->end(), buf);
}

extern "C" __declspec(dllexport) void* make(const char* s)
{
return new string(s);
}

extern "C" __declspec(dllexport) void destroy(void* s)
{
delete (string*)s;
}
---- x.py ---------------------------------------------------------
# coding: utf8
from ctypes import *
_lib_mbxclient = CDLL('x')

_std_string_size = _lib_mbxclient.std_string_size
_std_string_size.restype = c_long
_std_string_size.argtypes = [c_void_p]

_std_string_copy = _lib_mbxclient.std_string_copy
_std_string_copy.restype = None
_std_string_copy.argtypes = [c_void_p, c_char_p]

make = _lib_mbxclient.make
make.restype = c_void_p
make.argtypes = [c_char_p]

destroy = _lib_mbxclient.destroy
destroy.restype = None
destroy.argtypes = [c_void_p]

# This function works for ascii, but breaks on strings with UTF-8!
def std_string_to_string(str_ptr):
buf = create_string_buffer(_std_string_size(str_ptr))
_std_string_copy(str_ptr, buf)
return buf.raw

s = make(u'$B2f@'H~9q?M!#(B'.encode('utf8'))
print std_string_to_string(s).decode('utf8')
------------------------------------------------------

And output (in Pythonwin...US Windows console doesn't support Chinese):

$B2f@'H~9q?M!#(B

I used c_char_p instead of POINTER(c_char) and added functions to create and
destroy a std::string for Python's use, but it is otherwise the same as your
code.

Hope this helps you work it out,
-Mark
 
D

Diez B. Roggisch

Brendan Miller said:
Doing this seems to confirm something is broken in ctypes w.r.t. UTF-8...

if I enter:
str = "日本語ã®ãƒ†ã‚¹ãƒˆ"

What is this? Which encoding is used by your editor to produce this
byte-string?

If you want to be sure you have the right encoding, you need to do this:

- put a coding: utf-8 (or actually whatever your editor uses) in the
first or second line
- use unicode literals. That are the funny little strings with a "u" in
front of them. They will be *decoded* using the declared encoding.
- when passing this to C, explicitly *encode* with utf-8 first.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top