Unicode problem in ucs4

abhi · Mar 19, 2009

Hi,
I have a C extension, which takes a unicode or string value from
python and convert it to unicode before doing more operations on it.
The skeleton looks like:

static PyObject *unicode_helper( PyObject *self, PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}
// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
............
// perform other operations.
.............
}

This piece of code is working fine on python with ucs2 configuration
but fails with python ucs4 config. By failing, I mean that only the
first letter comes in variable sample i.e. if I pass "test" from
python then sample will contain only "t". However, PyUnicode_GetSize
(sampleObj) function is returning correct value (4 in this case).

Any idea on why this is happening? Any help will be appreciated.

Regards,
Abhigyan

Martin v. Löwis · Mar 20, 2009

Any idea on why this is happening?

Can you provide a complete example? Your code looks correct, and should
just work.

How do you know the result contains only 't' (i.e. how do you know it
does not contain 'e', 's', 't')?

Regards,
Martin

abhi · Mar 20, 2009

Can you provide a complete example? Your code looks correct, and should
just work.

How do you know the result contains only 't' (i.e. how do you know it
does not contain 'e', 's', 't')?

Regards,
Martin

Hi Martin,
Here is the code:
unicodeTest.c

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Regards,
Abhigyan

M.-A. Lemburg · Mar 20, 2009

Hi Martin,
Here is the code:
unicodeTest.c

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
wprintf(L"database value after unicode conversion is : %s\n",
sample);

You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.

return Py_BuildValue("");
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Regards,
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 20 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

abhi · Mar 23, 2009

Hi Martin,
Here is the code:
unicodeTest.c

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

Click to expand...

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

Click to expand...

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
wprintf(L"database value after unicode conversion is : %s\n",
sample);

Click to expand...

You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.

return Py_BuildValue("");
}

Click to expand...

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

Click to expand...

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

Click to expand...

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

Click to expand...

import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world

Click to expand...

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Click to expand...

Regards,
Abhigyan

Click to expand...

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 20 2009)>>> Python/Zope Consulting and Support ... http://www.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));
printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan

John Machin · Mar 23, 2009

[snip]

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

What is databaseObj??? Copy/paste the *actual* code that you compiled
and ran.

printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

[presuming littleendian] The ucs4 string will look like "\t\0\0\0e
\0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
and see what you get. Use bog-standard printf to print the hex
representation of each of the 16 bytes starting at the address sample
is pointing to.

John Machin · Mar 23, 2009

[presuming littleendian] The ucs4 string will look like "\t\0\0\0e
\0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
and see what you get. Use bog-standard printf to print the hex
representation of each of the 16 bytes starting at the address sample
is pointing to.

and typed \t in two places where he should have typed t

M.-A. Lemburg · Mar 23, 2009

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w. At
most size wchar_t characters are copied.

Note that the resulting wchar_t string may or may not be
0-terminated. It is the responsibility of the caller to make sure
that the wchar_t string is 0-terminated in case this is required by
the application.

Returns the number of wchar_t characters copied (excluding a
possibly trailing 0-termination character) or -1 in case of an
error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
PyUnicodeObject *unicode, /* Unicode object */
register wchar_t *w, /* wchar_t buffer */
Py_ssize_t size /* size of buffer */
);

printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

abhi · Mar 23, 2009

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

Click to expand...

sample code:

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

Click to expand...

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

Click to expand...

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

Click to expand...

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w. At
most size wchar_t characters are copied.

Note that the resulting wchar_t string may or may not be
0-terminated. It is the responsibility of the caller to make sure
that the wchar_t string is 0-terminated in case this is required by
the application.

Returns the number of wchar_t characters copied (excluding a
possibly trailing 0-termination character) or -1 in case of an
error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
PyUnicodeObject *unicode, /* Unicode object */
register wchar_t *w, /* wchar_t buffer */
Py_ssize_t size /* size of buffer */
);

printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}

Click to expand...

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

Click to expand...

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Click to expand...

Any ideas?

Click to expand...

-
Abhigyan

Click to expand...

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ... http://www.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix..com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

-
Abhigyan

M.-A. Lemburg · Mar 23, 2009

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

object is passed through the encoder function found for the given
encoding using the error handling method defined by errors. errors
may be NULL to use the default method defined for the codec.

Raises a LookupError in case no encoder can be found.

*/

PyAPI_FUNC(PyObject *) PyCodec_Encode(
PyObject *object,
const char *encoding,
const char *errors
);

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

abhi · Mar 23, 2009

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

Click to expand...

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

Click to expand...

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}

Click to expand...

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

Click to expand...

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

Click to expand...

This gives the following output when I pass "abc" as input:

Click to expand...

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

Click to expand...

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

Click to expand...

Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

object is passed through the encoder function found for the given
encoding using the error handling method defined by errors. errors
may be NULL to use the default method defined for the codec.

Raises a LookupError in case no encoder can be found.

*/

PyAPI_FUNC(PyObject *) PyCodec_Encode(
PyObject *object,
const char *encoding,
const char *errors
);

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ... http://www.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix..com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started.

-
Abhigyan

abhi · Mar 23, 2009

On 2009-03-23 11:50, abhi wrote:

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.
#include<Python.h>
static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;
if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}
// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}
static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}
This gives the following output when I pass "abc" as input:
size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c
So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

Click to expand...

Click to expand...

Sure: just use the appropriate UTF-16 codec for this.

Click to expand...

/* Generic codec based encoding API.

Click to expand...

object is passed through the encoder function found for the given
encoding using the error handling method defined by errors. errors
may be NULL to use the default method defined for the codec.

Click to expand...

Raises a LookupError in case no encoder can be found.

PyAPI_FUNC(PyObject *) PyCodec_Encode(
PyObject *object,
const char *encoding,
const char *errors
);

Click to expand...

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

Click to expand...

Professional Python Services directly from the Source (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ... http://www.egenix.com/

Click to expand...

________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

Click to expand...

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

Click to expand...

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str..48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Click to expand...

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started.

-
Abhigyan

Hi Marc,
Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

-
Abhigyan

M.-A. Lemburg · Mar 23, 2009

Hi Marc,
Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

wchar_t is defined by your compiler. There's no way to change that.

However, you can configure Python to use UCS2 (default) or UCS4 (used
on most Unix platforms), so it's easy to customize for your needs.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

M.-A. Lemburg · Mar 23, 2009

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started.

It will return a PyString object with the UTF-16 data. You can
use PyString_AS_STRING() to access the data stored by it.

Note that writing your own UCS2/UCS4 converter isn't all that hard
either. Just have a look at the code in unicodeobject.c for
PyUnicode_AsWideChar().

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

Martin v. Löwis · Mar 23, 2009

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3

\0s after a char, printf or wprintf is only printing one letter.

No. printf indeed will see a terminating character. However, wprintf
should correctly know that a wchar_t has four bytes per character,
and print it correctly. Make sure to use %ls to print wchar_t arrays;
%s would print multi-byte character strings.

I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail.

Are you absolutely sure about that? Why does that library expect
UCS-2, when you system's wchar_t is four bytes?

In any case, do what MAL told you: use the UCS-2 codec to convert
the Unicode string to a 2-bytes-per-char byte string. The PyObject
you get from the conversion is a byte string object; use
PyString_AsStringAndSize to get to the actual bytes.

Regards,
Martin

abhi · Mar 25, 2009

No. printf indeed will see a terminating character. However, wprintf
should correctly know that a wchar_t has four bytes per character,
and print it correctly. Make sure to use %ls to print wchar_t arrays;
%s would print multi-byte character strings.

Are you absolutely sure about that? Why does that library expect
UCS-2, when you system's wchar_t is four bytes?

In any case, do what MAL told you: use the UCS-2 codec to convert
the Unicode string to a 2-bytes-per-char byte string. The PyObject
you get from the conversion is a byte string object; use
PyString_AsStringAndSize to get to the actual bytes.

Regards,
Martin

Thanks Marc and Martin, my preliminary trials are showing positive
results with this method.

-
Abhigyan

how to know argument name with which a function of extended c called	5	Apr 14, 2009
Problems in Using C-API for Unicode handling	4	Jan 13, 2009
Problem in PyArg_ParseTuple on python 2.5.2 with AIX	3	Jul 27, 2009
Reference counting problems?	0	Dec 9, 2010
accepting cStringIO in an extension	0	Sep 5, 2006
Extension module question	0	Jan 15, 2012
problem with PyMapping_SetItemString()	2	Apr 21, 2009
problem in compiling C API in mingw	0	Jul 27, 2011

Unicode problem in ucs4

abhi

Martin v. Löwis

abhi

M.-A. Lemburg

abhi

John Machin

John Machin

M.-A. Lemburg

abhi

M.-A. Lemburg

abhi

abhi

M.-A. Lemburg

M.-A. Lemburg

Martin v. Löwis

abhi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads