Unicode problem in ucs4

A

abhi

Hi,
I have a C extension, which takes a unicode or string value from
python and convert it to unicode before doing more operations on it.
The skeleton looks like:

static PyObject *unicode_helper( PyObject *self, PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}
// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
............
// perform other operations.
.............
}

This piece of code is working fine on python with ucs2 configuration
but fails with python ucs4 config. By failing, I mean that only the
first letter comes in variable sample i.e. if I pass "test" from
python then sample will contain only "t". However, PyUnicode_GetSize
(sampleObj) function is returning correct value (4 in this case).

Any idea on why this is happening? Any help will be appreciated.

Regards,
Abhigyan
 
M

Martin v. Löwis

Any idea on why this is happening?

Can you provide a complete example? Your code looks correct, and should
just work.

How do you know the result contains only 't' (i.e. how do you know it
does not contain 'e', 's', 't')?

Regards,
Martin
 
A

abhi

Can you provide a complete example? Your code looks correct, and should
just work.

How do you know the result contains only 't' (i.e. how do you know it
does not contain 'e', 's', 't')?

Regards,
Martin

Hi Martin,
Here is the code:
unicodeTest.c

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Regards,
Abhigyan
 
M

M.-A. Lemburg

Hi Martin,
Here is the code:
unicodeTest.c

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
wprintf(L"database value after unicode conversion is : %s\n",
sample);

You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.
return Py_BuildValue("");
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg

import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world

but it prints the following on ucs4 configured python:
database value after unicode conversion is : h

Regards,
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 20 2009)________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
A

abhi

Hi Martin,
 Here is the code:
unicodeTest.c

static PyObject *unicode_helper(PyObject *self,PyObject *args){
   PyObject *sampleObj = NULL;
           Py_UNICODE *sample = NULL;
      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }
    // Explicitly convert it to unicode and get Py_UNICODE value
      sampleObj = PyUnicode_FromObject(sampleObj);
      sample = PyUnicode_AS_UNICODE(sampleObj);
      wprintf(L"database value after unicode conversion is : %s\n",
sample);

You have to use PyUnicode_AsWideChar() to convert a Python
Unicode object to a wchar_t representation.

Please don't make any assumptions on what Py_UNICODE maps
to and always use the the Unicode API for this. It is designed
to provide a portable interface and will not do more conversion
work than necessary.




      return Py_BuildValue("");
}
static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
void initunicodeTest(void){
   Py_InitModule3("unicodeTest",funcs,"");
}
When i install this unicodeTest on python ucs2 wprintf prints whatever
is passed eg
import unicodeTest
unicodeTest.unicodeTest("hello world")
database value after unicode conversion is : hello world
but it prints the following on ucs4 configured python:
database value after unicode conversion is : h
Regards,
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 20 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));
printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}


static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan
 
J

John Machin

[snip]
Hi Mark,
     Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
        PyObject *sampleObj = NULL;
        wchar_t *sample = NULL;
        int size = 0;

      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }

         // use wide char function
      size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

What is databaseObj??? Copy/paste the *actual* code that you compiled
and ran.
      printf("%d chars are copied to sample\n", size);
      wprintf(L"database value after unicode conversion is : %s\n",
sample);
      return Py_BuildValue("");

}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
        Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

[presuming littleendian] The ucs4 string will look like "\t\0\0\0e
\0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
and see what you get. Use bog-standard printf to print the hex
representation of each of the 16 bytes starting at the address sample
is pointing to.
 
J

John Machin

[presuming littleendian] The ucs4 string will look like "\t\0\0\0e
\0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
and see what you get. Use bog-standard printf to print the hex
representation of each of the 16 bytes starting at the address sample
is pointing to.

and typed \t in two places where he should have typed t :)
 
M

M.-A. Lemburg

Hi Mark,
Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.

sample code:

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
PyObject *sampleObj = NULL;
wchar_t *sample = NULL;
int size = 0;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}

// use wide char function
size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w. At
most size wchar_t characters are copied.

Note that the resulting wchar_t string may or may not be
0-terminated. It is the responsibility of the caller to make sure
that the wchar_t string is 0-terminated in case this is required by
the application.

Returns the number of wchar_t characters copied (excluding a
possibly trailing 0-termination character) or -1 in case of an
error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
PyUnicodeObject *unicode, /* Unicode object */
register wchar_t *w, /* wchar_t buffer */
Py_ssize_t size /* size of buffer */
);


printf("%d chars are copied to sample\n", size);
wprintf(L"database value after unicode conversion is : %s\n",
sample);
return Py_BuildValue("");

}


static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");

}

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t

Any ideas?

-
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
A

abhi

Hi Mark,
     Thanks for the help. I tried PyUnicode_AsWideChar() but I am
getting the same result i.e. only the first letter.
sample code:

static PyObject *unicode_helper(PyObject *self,PyObject *args){
        PyObject *sampleObj = NULL;
        wchar_t *sample = NULL;
        int size = 0;
      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }
         // use wide char function
      size = PyUnicode_AsWideChar(databaseObj, sample,
PyUnicode_GetSize(databaseObj));

The 3. argument is the buffer size in bytes, not code points.
The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
bytes without a trailing NUL, otherwise sizeof(wchar_t) *
(PyUnicode_GetSize(databaseObj) + 1).

You also have to allocate the buffer to store the wchar_t data in.
Passing in a NULL pointer will result in a seg fault. The function
does not allocate a buffer for you:

/* Copies the Unicode Object contents into the wchar_t buffer w.  At
   most size wchar_t characters are copied.

   Note that the resulting wchar_t string may or may not be
   0-terminated.  It is the responsibility of the caller to make sure
   that the wchar_t string is 0-terminated in case this is required by
   the application.

   Returns the number of wchar_t characters copied (excluding a
   possibly trailing 0-termination character) or -1 in case of an
   error. */

PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
    PyUnicodeObject *unicode,   /* Unicode object */
    register wchar_t *w,        /* wchar_t buffer */
    Py_ssize_t size             /* size of buffer */
    );


      printf("%d chars are copied to sample\n", size);
      wprintf(L"database value after unicode conversion is : %s\n",
sample);
      return Py_BuildValue("");

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
void initunicodeTest(void){
        Py_InitModule3("unicodeTest",funcs,"");

This prints the following when input value is given as "test":
4 chars are copied to sample
database value after unicode conversion is : t
Any ideas?
-
Abhigyan

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix..com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}


// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

-
Abhigyan
 
M

M.-A. Lemburg

Thanks Marc, John,
With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

#include<Python.h>

static PyObject *unicode_helper(PyObject *self,PyObject *args){
const char *name;
PyObject *sampleObj = NULL;
Py_UNICODE *sample = NULL;
wchar_t * w=NULL;
int size = 0;
int i;

if (!PyArg_ParseTuple(args, "O", &sampleObj)){
return NULL;
}


// Explicitly convert it to unicode and get Py_UNICODE value
sampleObj = PyUnicode_FromObject(sampleObj);
sample = PyUnicode_AS_UNICODE(sampleObj);
printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
printf("%d chars are copied to w\n",size);
printf("size of wchar_t is : %d\n", sizeof(wchar_t));
printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
printf("sample is : %c\n",sample);
printf("w is : %c\n",w);
}
return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

void initunicodeTest(void){
Py_InitModule3("unicodeTest",funcs,"");
}

This gives the following output when I pass "abc" as input:

size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?


Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

object is passed through the encoder function found for the given
encoding using the error handling method defined by errors. errors
may be NULL to use the default method defined for the codec.

Raises a LookupError in case no encoder can be found.

*/

PyAPI_FUNC(PyObject *) PyCodec_Encode(
PyObject *object,
const char *encoding,
const char *errors
);

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
A

abhi

Thanks Marc, John,
         With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.

static PyObject *unicode_helper(PyObject *self,PyObject *args){
   const char *name;
   PyObject *sampleObj = NULL;
           Py_UNICODE *sample = NULL;
   wchar_t * w=NULL;
   int size = 0;
   int i;
      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }
        // Explicitly convert it to unicode and get Py_UNICODE value
        sampleObj = PyUnicode_FromObject(sampleObj);
        sample = PyUnicode_AS_UNICODE(sampleObj);
        printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
        w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
   size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
   printf("%d chars are copied to w\n",size);
   printf("size of wchar_t is : %d\n", sizeof(wchar_t));
   printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
   for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
           printf("sample is : %c\n",sample);
           printf("w is : %c\n",w);
   }
   return sampleObj;
}

static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
void initunicodeTest(void){
   Py_InitModule3("unicodeTest",funcs,"");
}
This gives the following output when I pass "abc" as input:
size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c
So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

Sure: just use the appropriate UTF-16 codec for this.

/* Generic codec based encoding API.

   object is passed through the encoder function found for the given
   encoding using the error handling method defined by errors. errors
   may be NULL to use the default method defined for the codec.

   Raises a LookupError in case no encoder can be found.

 */

PyAPI_FUNC(PyObject *) PyCodec_Encode(
       PyObject *object,
       const char *encoding,
       const char *errors
       );

encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix..com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started. :)

-
Abhigyan
 
A

abhi

On 2009-03-23 11:50, abhi wrote:
Thanks Marc, John,
         With your help, I am at least somewhere. I re-wrote the code
to compare Py_Unicode and wchar_t outputs and they both look exactly
the same.
#include<Python.h>
static PyObject *unicode_helper(PyObject *self,PyObject *args){
   const char *name;
   PyObject *sampleObj = NULL;
           Py_UNICODE *sample = NULL;
   wchar_t * w=NULL;
   int size = 0;
   int i;
      if (!PyArg_ParseTuple(args, "O", &sampleObj)){
                return NULL;
      }
        // Explicitly convert it to unicode and get Py_UNICODE value
        sampleObj = PyUnicode_FromObject(sampleObj);
        sample = PyUnicode_AS_UNICODE(sampleObj);
        printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
(sampleObj));
        w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
(wchar_t));
   size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
+1)*sizeof(wchar_t));
   printf("%d chars are copied to w\n",size);
   printf("size of wchar_t is : %d\n", sizeof(wchar_t));
   printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
   for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
           printf("sample is : %c\n",sample);
           printf("w is : %c\n",w);
   }
   return sampleObj;
}
static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
void initunicodeTest(void){
   Py_InitModule3("unicodeTest",funcs,"");
}
This gives the following output when I pass "abc" as input:
size of sampleObj is : 3
3 chars are copied to w
size of wchar_t is : 4
size of Py_UNICODE is: 4
sample is : a
w is : a
sample is : b
w is : b
sample is : c
w is : c
So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail. Is there any way
by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
data to UCS2 explicitly?

Sure: just use the appropriate UTF-16 codec for this.
/* Generic codec based encoding API.
   object is passed through the encoder function found for the given
   encoding using the error handling method defined by errors. errors
   may be NULL to use the default method defined for the codec.
   Raises a LookupError in case no encoder can be found.

PyAPI_FUNC(PyObject *) PyCodec_Encode(
       PyObject *object,
       const char *encoding,
       const char *errors
       );
encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
for big endian.
Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str..48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started. :)

-
Abhigyan


Hi Marc,
Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

-
Abhigyan
 
M

M.-A. Lemburg

Hi Marc,
Is there any way to ensure that wchar_t size would always be 2
instead of 4 in ucs4 configured python? Googling gave me the
impression that there is some logic written in PyUnicode_AsWideChar()
which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
and wchar_t are different.

wchar_t is defined by your compiler. There's no way to change that.

However, you can configure Python to use UCS2 (default) or UCS4 (used
on most Unix platforms), so it's easy to customize for your needs.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
M

M.-A. Lemburg

Thanks, but this is returning PyObject *, whereas I need value in some
variable which can be printed using wprintf() like wchar_t (having a
size of 2 bytes). If I again convert this PyObject to wchar_t or
PyUnicode, I go back to where I started. :)

It will return a PyString object with the UTF-16 data. You can
use PyString_AS_STRING() to access the data stored by it.

Note that writing your own UCS2/UCS4 converter isn't all that hard
either. Just have a look at the code in unicodeobject.c for
PyUnicode_AsWideChar().

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Mar 23 2009)________________________________________________________________________
2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
 
M

Martin v. Löwis

So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
\0s after a char, printf or wprintf is only printing one letter.

No. printf indeed will see a terminating character. However, wprintf
should correctly know that a wchar_t has four bytes per character,
and print it correctly. Make sure to use %ls to print wchar_t arrays;
%s would print multi-byte character strings.
I need to further process the data and those libraries will need the
data in UCS2 format (2 bytes), otherwise they fail.

Are you absolutely sure about that? Why does that library expect
UCS-2, when you system's wchar_t is four bytes?

In any case, do what MAL told you: use the UCS-2 codec to convert
the Unicode string to a 2-bytes-per-char byte string. The PyObject
you get from the conversion is a byte string object; use
PyString_AsStringAndSize to get to the actual bytes.

Regards,
Martin
 
A

abhi

No. printf indeed will see a terminating character. However, wprintf
should correctly know that a wchar_t has four bytes per character,
and print it correctly. Make sure to use %ls to print wchar_t arrays;
%s would print multi-byte character strings.


Are you absolutely sure about that? Why does that library expect
UCS-2, when you system's wchar_t is four bytes?

In any case, do what MAL told you: use the UCS-2 codec to convert
the Unicode string to a 2-bytes-per-char byte string. The PyObject
you get from the conversion is a byte string object; use
PyString_AsStringAndSize to get to the actual bytes.

Regards,
Martin

Thanks Marc and Martin, my preliminary trials are showing positive
results with this method.

-
Abhigyan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top