Unicode problem in ucs4

Discussion in 'Python' started by abhi, Mar 19, 2009.

  1. abhi

    abhi Guest

    Hi,
    I have a C extension, which takes a unicode or string value from
    python and convert it to unicode before doing more operations on it.
    The skeleton looks like:

    static PyObject *unicode_helper( PyObject *self, PyObject *args){
    PyObject *sampleObj = NULL;
    Py_UNICODE *sample = NULL;

    if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    return NULL;
    }
    // Explicitly convert it to unicode and get Py_UNICODE value
    sampleObj = PyUnicode_FromObject(sampleObj);
    sample = PyUnicode_AS_UNICODE(sampleObj);
    ............
    // perform other operations.
    .............
    }

    This piece of code is working fine on python with ucs2 configuration
    but fails with python ucs4 config. By failing, I mean that only the
    first letter comes in variable sample i.e. if I pass "test" from
    python then sample will contain only "t". However, PyUnicode_GetSize
    (sampleObj) function is returning correct value (4 in this case).

    Any idea on why this is happening? Any help will be appreciated.

    Regards,
    Abhigyan
     
    abhi, Mar 19, 2009
    #1
    1. Advertising

  2. > Any idea on why this is happening?

    Can you provide a complete example? Your code looks correct, and should
    just work.

    How do you know the result contains only 't' (i.e. how do you know it
    does not contain 'e', 's', 't')?

    Regards,
    Martin
     
    Martin v. Löwis, Mar 20, 2009
    #2
    1. Advertising

  3. abhi

    abhi Guest

    On Mar 20, 11:03 am, "Martin v. Löwis" <> wrote:
    > > Any idea on why this is happening?

    >
    > Can you provide a complete example? Your code looks correct, and should
    > just work.
    >
    > How do you know the result contains only 't' (i.e. how do you know it
    > does not contain 'e', 's', 't')?
    >
    > Regards,
    > Martin


    Hi Martin,
    Here is the code:
    unicodeTest.c

    #include<Python.h>

    static PyObject *unicode_helper(PyObject *self,PyObject *args){
    PyObject *sampleObj = NULL;
    Py_UNICODE *sample = NULL;

    if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    return NULL;
    }

    // Explicitly convert it to unicode and get Py_UNICODE value
    sampleObj = PyUnicode_FromObject(sampleObj);
    sample = PyUnicode_AS_UNICODE(sampleObj);
    wprintf(L"database value after unicode conversion is : %s\n",
    sample);
    return Py_BuildValue("");
    }

    static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    void initunicodeTest(void){
    Py_InitModule3("unicodeTest",funcs,"");
    }

    When i install this unicodeTest on python ucs2 wprintf prints whatever
    is passed eg

    import unicodeTest
    unicodeTest.unicodeTest("hello world")
    database value after unicode conversion is : hello world

    but it prints the following on ucs4 configured python:
    database value after unicode conversion is : h

    Regards,
    Abhigyan
     
    abhi, Mar 20, 2009
    #3
  4. On 2009-03-20 12:13, abhi wrote:
    > On Mar 20, 11:03 am, "Martin v. Löwis" <> wrote:
    >>> Any idea on why this is happening?

    >> Can you provide a complete example? Your code looks correct, and should
    >> just work.
    >>
    >> How do you know the result contains only 't' (i.e. how do you know it
    >> does not contain 'e', 's', 't')?
    >>
    >> Regards,
    >> Martin

    >
    > Hi Martin,
    > Here is the code:
    > unicodeTest.c
    >
    > #include<Python.h>
    >
    > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > PyObject *sampleObj = NULL;
    > Py_UNICODE *sample = NULL;
    >
    > if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > return NULL;
    > }
    >
    > // Explicitly convert it to unicode and get Py_UNICODE value
    > sampleObj = PyUnicode_FromObject(sampleObj);
    > sample = PyUnicode_AS_UNICODE(sampleObj);
    > wprintf(L"database value after unicode conversion is : %s\n",
    > sample);


    You have to use PyUnicode_AsWideChar() to convert a Python
    Unicode object to a wchar_t representation.

    Please don't make any assumptions on what Py_UNICODE maps
    to and always use the the Unicode API for this. It is designed
    to provide a portable interface and will not do more conversion
    work than necessary.

    > return Py_BuildValue("");
    > }
    >
    > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
    >
    > void initunicodeTest(void){
    > Py_InitModule3("unicodeTest",funcs,"");
    > }
    >
    > When i install this unicodeTest on python ucs2 wprintf prints whatever
    > is passed eg
    >
    > import unicodeTest
    > unicodeTest.unicodeTest("hello world")
    > database value after unicode conversion is : hello world
    >
    > but it prints the following on ucs4 configured python:
    > database value after unicode conversion is : h
    >
    > Regards,
    > Abhigyan
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Mar 20 2009)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
     
    M.-A. Lemburg, Mar 20, 2009
    #4
  5. abhi

    abhi Guest

    On Mar 20, 5:47 pm, "M.-A. Lemburg" <> wrote:
    > On 2009-03-20 12:13, abhi wrote:
    >
    >
    >
    >
    >
    > > On Mar 20, 11:03 am, "Martin v. Löwis" <> wrote:
    > >>> Any idea on why this is happening?
    > >> Can you provide a complete example? Your code looks correct, and should
    > >> just work.

    >
    > >> How do you know the result contains only 't' (i.e. how do you know it
    > >> does not contain 'e', 's', 't')?

    >
    > >> Regards,
    > >> Martin

    >
    > > Hi Martin,
    > >  Here is the code:
    > > unicodeTest.c

    >
    > > #include<Python.h>

    >
    > > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > >    PyObject *sampleObj = NULL;
    > >            Py_UNICODE *sample = NULL;

    >
    > >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > >                 return NULL;
    > >       }

    >
    > >     // Explicitly convert it to unicode and get Py_UNICODE value
    > >       sampleObj = PyUnicode_FromObject(sampleObj);
    > >       sample = PyUnicode_AS_UNICODE(sampleObj);
    > >       wprintf(L"database value after unicode conversion is : %s\n",
    > > sample);

    >
    > You have to use PyUnicode_AsWideChar() to convert a Python
    > Unicode object to a wchar_t representation.
    >
    > Please don't make any assumptions on what Py_UNICODE maps
    > to and always use the the Unicode API for this. It is designed
    > to provide a portable interface and will not do more conversion
    > work than necessary.
    >
    >
    >
    >
    >
    > >       return Py_BuildValue("");
    > > }

    >
    > > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    >
    > > void initunicodeTest(void){
    > >    Py_InitModule3("unicodeTest",funcs,"");
    > > }

    >
    > > When i install this unicodeTest on python ucs2 wprintf prints whatever
    > > is passed eg

    >
    > > import unicodeTest
    > > unicodeTest.unicodeTest("hello world")
    > > database value after unicode conversion is : hello world

    >
    > > but it prints the following on ucs4 configured python:
    > > database value after unicode conversion is : h

    >
    > > Regards,
    > > Abhigyan
    > > --
    > >http://mail.python.org/mailman/listinfo/python-list

    >
    > --
    > Marc-Andre Lemburg
    > eGenix.com
    >
    > Professional Python Services directly from the Source  (#1, Mar 20 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
    > >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
    > >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

    >
    > ________________________________________________________________________
    >
    > ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
    >
    >    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    >     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    >            Registered at Amtsgericht Duesseldorf: HRB 46611
    >                http://www.egenix.com/company/contact/- Hide quoted text -
    >
    > - Show quoted text -- Hide quoted text -
    >
    > - Show quoted text -


    Hi Mark,
    Thanks for the help. I tried PyUnicode_AsWideChar() but I am
    getting the same result i.e. only the first letter.

    sample code:

    #include<Python.h>

    static PyObject *unicode_helper(PyObject *self,PyObject *args){
    PyObject *sampleObj = NULL;
    wchar_t *sample = NULL;
    int size = 0;

    if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    return NULL;
    }

    // use wide char function
    size = PyUnicode_AsWideChar(databaseObj, sample,
    PyUnicode_GetSize(databaseObj));
    printf("%d chars are copied to sample\n", size);
    wprintf(L"database value after unicode conversion is : %s\n",
    sample);
    return Py_BuildValue("");

    }


    static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    void initunicodeTest(void){
    Py_InitModule3("unicodeTest",funcs,"");

    }

    This prints the following when input value is given as "test":
    4 chars are copied to sample
    database value after unicode conversion is : t

    Any ideas?

    -
    Abhigyan
     
    abhi, Mar 23, 2009
    #5
  6. abhi

    John Machin Guest

    On Mar 23, 6:18 pm, abhi <> wrote:

    [snip]
    > Hi Mark,
    >      Thanks for the help. I tried PyUnicode_AsWideChar() but I am
    > getting the same result i.e. only the first letter.
    >
    > sample code:
    >
    > #include<Python.h>
    >
    > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    >         PyObject *sampleObj = NULL;
    >         wchar_t *sample = NULL;
    >         int size = 0;
    >
    >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    >                 return NULL;
    >       }
    >
    >          // use wide char function
    >       size = PyUnicode_AsWideChar(databaseObj, sample,
    > PyUnicode_GetSize(databaseObj));


    What is databaseObj??? Copy/paste the *actual* code that you compiled
    and ran.

    >       printf("%d chars are copied to sample\n", size);
    >       wprintf(L"database value after unicode conversion is : %s\n",
    > sample);
    >       return Py_BuildValue("");
    >
    > }
    >
    > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
    >
    > void initunicodeTest(void){
    >         Py_InitModule3("unicodeTest",funcs,"");
    >
    > }
    >
    > This prints the following when input value is given as "test":
    > 4 chars are copied to sample
    > database value after unicode conversion is : t


    [presuming littleendian] The ucs4 string will look like "\t\0\0\0e
    \0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
    grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
    end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
    and see what you get. Use bog-standard printf to print the hex
    representation of each of the 16 bytes starting at the address sample
    is pointing to.
     
    John Machin, Mar 23, 2009
    #6
  7. abhi

    John Machin Guest

    On Mar 23, 6:41 pm, John Machin <> had a severe
    attack of backslashitis:

    > [presuming littleendian] The ucs4 string will look like "\t\0\0\0e
    > \0\0\0s\0\0\0t\0\0\0" in memory. I suspect that your wprintf is
    > grokking only 16-bit doodads -- "\t\0" is printed and then "\0\0" is
    > end-of-string. Try your wprintf on sample[0], ..., sample[3] in a loop
    > and see what you get. Use bog-standard printf to print the hex
    > representation of each of the 16 bytes starting at the address sample
    > is pointing to.


    and typed \t in two places where he should have typed t :)
     
    John Machin, Mar 23, 2009
    #7
  8. On 2009-03-23 08:18, abhi wrote:
    > On Mar 20, 5:47 pm, "M.-A. Lemburg" <> wrote:
    >>> unicodeTest.c
    >>> #include<Python.h>
    >>> static PyObject *unicode_helper(PyObject *self,PyObject *args){
    >>> PyObject *sampleObj = NULL;
    >>> Py_UNICODE *sample = NULL;
    >>> if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    >>> return NULL;
    >>> }
    >>> // Explicitly convert it to unicode and get Py_UNICODE value
    >>> sampleObj = PyUnicode_FromObject(sampleObj);
    >>> sample = PyUnicode_AS_UNICODE(sampleObj);
    >>> wprintf(L"database value after unicode conversion is : %s\n",
    >>> sample);

    >> You have to use PyUnicode_AsWideChar() to convert a Python
    >> Unicode object to a wchar_t representation.
    >>
    >> Please don't make any assumptions on what Py_UNICODE maps
    >> to and always use the the Unicode API for this. It is designed
    >> to provide a portable interface and will not do more conversion
    >> work than necessary.

    >
    > Hi Mark,
    > Thanks for the help. I tried PyUnicode_AsWideChar() but I am
    > getting the same result i.e. only the first letter.
    >
    > sample code:
    >
    > #include<Python.h>
    >
    > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > PyObject *sampleObj = NULL;
    > wchar_t *sample = NULL;
    > int size = 0;
    >
    > if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > return NULL;
    > }
    >
    > // use wide char function
    > size = PyUnicode_AsWideChar(databaseObj, sample,
    > PyUnicode_GetSize(databaseObj));


    The 3. argument is the buffer size in bytes, not code points.
    The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
    bytes without a trailing NUL, otherwise sizeof(wchar_t) *
    (PyUnicode_GetSize(databaseObj) + 1).

    You also have to allocate the buffer to store the wchar_t data in.
    Passing in a NULL pointer will result in a seg fault. The function
    does not allocate a buffer for you:

    /* Copies the Unicode Object contents into the wchar_t buffer w. At
    most size wchar_t characters are copied.

    Note that the resulting wchar_t string may or may not be
    0-terminated. It is the responsibility of the caller to make sure
    that the wchar_t string is 0-terminated in case this is required by
    the application.

    Returns the number of wchar_t characters copied (excluding a
    possibly trailing 0-termination character) or -1 in case of an
    error. */

    PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
    PyUnicodeObject *unicode, /* Unicode object */
    register wchar_t *w, /* wchar_t buffer */
    Py_ssize_t size /* size of buffer */
    );



    > printf("%d chars are copied to sample\n", size);
    > wprintf(L"database value after unicode conversion is : %s\n",
    > sample);
    > return Py_BuildValue("");
    >
    > }
    >
    >
    > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
    >
    > void initunicodeTest(void){
    > Py_InitModule3("unicodeTest",funcs,"");
    >
    > }
    >
    > This prints the following when input value is given as "test":
    > 4 chars are copied to sample
    > database value after unicode conversion is : t
    >
    > Any ideas?
    >
    > -
    > Abhigyan
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Mar 23 2009)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________
    2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
     
    M.-A. Lemburg, Mar 23, 2009
    #8
  9. abhi

    abhi Guest

    On Mar 23, 3:04 pm, "M.-A. Lemburg" <> wrote:
    > On 2009-03-23 08:18, abhi wrote:
    >
    >
    >
    > > On Mar 20, 5:47 pm, "M.-A. Lemburg" <> wrote:
    > >>> unicodeTest.c
    > >>> #include<Python.h>
    > >>> static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > >>>    PyObject *sampleObj = NULL;
    > >>>            Py_UNICODE *sample = NULL;
    > >>>       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > >>>                 return NULL;
    > >>>       }
    > >>>     // Explicitly convert it to unicode and get Py_UNICODE value
    > >>>       sampleObj = PyUnicode_FromObject(sampleObj);
    > >>>       sample = PyUnicode_AS_UNICODE(sampleObj);
    > >>>       wprintf(L"database value after unicode conversion is : %s\n",
    > >>> sample);
    > >> You have to use PyUnicode_AsWideChar() to convert a Python
    > >> Unicode object to a wchar_t representation.

    >
    > >> Please don't make any assumptions on what Py_UNICODE maps
    > >> to and always use the the Unicode API for this. It is designed
    > >> to provide a portable interface and will not do more conversion
    > >> work than necessary.

    >
    > > Hi Mark,
    > >      Thanks for the help. I tried PyUnicode_AsWideChar() but I am
    > > getting the same result i.e. only the first letter.

    >
    > > sample code:

    >
    > > #include<Python.h>

    >
    > > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > >         PyObject *sampleObj = NULL;
    > >         wchar_t *sample = NULL;
    > >         int size = 0;

    >
    > >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > >                 return NULL;
    > >       }

    >
    > >          // use wide char function
    > >       size = PyUnicode_AsWideChar(databaseObj, sample,
    > > PyUnicode_GetSize(databaseObj));

    >
    > The 3. argument is the buffer size in bytes, not code points.
    > The result will require sizeof(wchar_t) * PyUnicode_GetSize(databaseObj)
    > bytes without a trailing NUL, otherwise sizeof(wchar_t) *
    > (PyUnicode_GetSize(databaseObj) + 1).
    >
    > You also have to allocate the buffer to store the wchar_t data in.
    > Passing in a NULL pointer will result in a seg fault. The function
    > does not allocate a buffer for you:
    >
    > /* Copies the Unicode Object contents into the wchar_t buffer w.  At
    >    most size wchar_t characters are copied.
    >
    >    Note that the resulting wchar_t string may or may not be
    >    0-terminated.  It is the responsibility of the caller to make sure
    >    that the wchar_t string is 0-terminated in case this is required by
    >    the application.
    >
    >    Returns the number of wchar_t characters copied (excluding a
    >    possibly trailing 0-termination character) or -1 in case of an
    >    error. */
    >
    > PyAPI_FUNC(Py_ssize_t) PyUnicode_AsWideChar(
    >     PyUnicodeObject *unicode,   /* Unicode object */
    >     register wchar_t *w,        /* wchar_t buffer */
    >     Py_ssize_t size             /* size of buffer */
    >     );
    >
    >
    >
    > >       printf("%d chars are copied to sample\n", size);
    > >       wprintf(L"database value after unicode conversion is : %s\n",
    > > sample);
    > >       return Py_BuildValue("");

    >
    > > }

    >
    > > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    >
    > > void initunicodeTest(void){
    > >         Py_InitModule3("unicodeTest",funcs,"");

    >
    > > }

    >
    > > This prints the following when input value is given as "test":
    > > 4 chars are copied to sample
    > > database value after unicode conversion is : t

    >
    > > Any ideas?

    >
    > > -
    > > Abhigyan
    > > --
    > >http://mail.python.org/mailman/listinfo/python-list

    >
    > --
    > Marc-Andre Lemburg
    > eGenix.com
    >
    > Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
    > >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
    > >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

    >
    > ________________________________________________________________________
    > 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix..com/
    >
    > ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
    >
    >    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    >     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    >            Registered at Amtsgericht Duesseldorf: HRB 46611
    >                http://www.egenix.com/company/contact/


    Thanks Marc, John,
    With your help, I am at least somewhere. I re-wrote the code
    to compare Py_Unicode and wchar_t outputs and they both look exactly
    the same.

    #include<Python.h>

    static PyObject *unicode_helper(PyObject *self,PyObject *args){
    const char *name;
    PyObject *sampleObj = NULL;
    Py_UNICODE *sample = NULL;
    wchar_t * w=NULL;
    int size = 0;
    int i;

    if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    return NULL;
    }


    // Explicitly convert it to unicode and get Py_UNICODE value
    sampleObj = PyUnicode_FromObject(sampleObj);
    sample = PyUnicode_AS_UNICODE(sampleObj);
    printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
    (sampleObj));
    w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
    (wchar_t));
    size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
    +1)*sizeof(wchar_t));
    printf("%d chars are copied to w\n",size);
    printf("size of wchar_t is : %d\n", sizeof(wchar_t));
    printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
    for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
    printf("sample is : %c\n",sample);
    printf("w is : %c\n",w);
    }
    return sampleObj;
    }

    static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    void initunicodeTest(void){
    Py_InitModule3("unicodeTest",funcs,"");
    }

    This gives the following output when I pass "abc" as input:

    size of sampleObj is : 3
    3 chars are copied to w
    size of wchar_t is : 4
    size of Py_UNICODE is: 4
    sample is : a
    w is : a
    sample is : b
    w is : b
    sample is : c
    w is : c

    So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    \0s after a char, printf or wprintf is only printing one letter.
    I need to further process the data and those libraries will need the
    data in UCS2 format (2 bytes), otherwise they fail. Is there any way
    by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
    data to UCS2 explicitly?

    -
    Abhigyan
     
    abhi, Mar 23, 2009
    #9
  10. On 2009-03-23 11:50, abhi wrote:
    > On Mar 23, 3:04 pm, "M.-A. Lemburg" <> wrote:
    > Thanks Marc, John,
    > With your help, I am at least somewhere. I re-wrote the code
    > to compare Py_Unicode and wchar_t outputs and they both look exactly
    > the same.
    >
    > #include<Python.h>
    >
    > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > const char *name;
    > PyObject *sampleObj = NULL;
    > Py_UNICODE *sample = NULL;
    > wchar_t * w=NULL;
    > int size = 0;
    > int i;
    >
    > if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > return NULL;
    > }
    >
    >
    > // Explicitly convert it to unicode and get Py_UNICODE value
    > sampleObj = PyUnicode_FromObject(sampleObj);
    > sample = PyUnicode_AS_UNICODE(sampleObj);
    > printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
    > (sampleObj));
    > w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
    > (wchar_t));
    > size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
    > +1)*sizeof(wchar_t));
    > printf("%d chars are copied to w\n",size);
    > printf("size of wchar_t is : %d\n", sizeof(wchar_t));
    > printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
    > for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
    > printf("sample is : %c\n",sample);
    > printf("w is : %c\n",w);
    > }
    > return sampleObj;
    > }
    >
    > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};
    >
    > void initunicodeTest(void){
    > Py_InitModule3("unicodeTest",funcs,"");
    > }
    >
    > This gives the following output when I pass "abc" as input:
    >
    > size of sampleObj is : 3
    > 3 chars are copied to w
    > size of wchar_t is : 4
    > size of Py_UNICODE is: 4
    > sample is : a
    > w is : a
    > sample is : b
    > w is : b
    > sample is : c
    > w is : c
    >
    > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    > \0s after a char, printf or wprintf is only printing one letter.
    > I need to further process the data and those libraries will need the
    > data in UCS2 format (2 bytes), otherwise they fail. Is there any way
    > by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
    > data to UCS2 explicitly?


    Sure: just use the appropriate UTF-16 codec for this.

    /* Generic codec based encoding API.

    object is passed through the encoder function found for the given
    encoding using the error handling method defined by errors. errors
    may be NULL to use the default method defined for the codec.

    Raises a LookupError in case no encoder can be found.

    */

    PyAPI_FUNC(PyObject *) PyCodec_Encode(
    PyObject *object,
    const char *encoding,
    const char *errors
    );

    encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
    for big endian.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Mar 23 2009)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________
    2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
     
    M.-A. Lemburg, Mar 23, 2009
    #10
  11. abhi

    abhi Guest

    On Mar 23, 4:37 pm, "M.-A. Lemburg" <> wrote:
    > On 2009-03-23 11:50, abhi wrote:
    >
    >
    >
    > > On Mar 23, 3:04 pm, "M.-A. Lemburg" <> wrote:
    > > Thanks Marc, John,
    > >          With your help, I am at least somewhere. I re-wrote the code
    > > to compare Py_Unicode and wchar_t outputs and they both look exactly
    > > the same.

    >
    > > #include<Python.h>

    >
    > > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > >    const char *name;
    > >    PyObject *sampleObj = NULL;
    > >            Py_UNICODE *sample = NULL;
    > >    wchar_t * w=NULL;
    > >    int size = 0;
    > >    int i;

    >
    > >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > >                 return NULL;
    > >       }

    >
    > >         // Explicitly convert it to unicode and get Py_UNICODE value
    > >         sampleObj = PyUnicode_FromObject(sampleObj);
    > >         sample = PyUnicode_AS_UNICODE(sampleObj);
    > >         printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
    > > (sampleObj));
    > >         w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
    > > (wchar_t));
    > >    size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
    > > +1)*sizeof(wchar_t));
    > >    printf("%d chars are copied to w\n",size);
    > >    printf("size of wchar_t is : %d\n", sizeof(wchar_t));
    > >    printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
    > >    for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
    > >            printf("sample is : %c\n",sample);
    > >            printf("w is : %c\n",w);
    > >    }
    > >    return sampleObj;
    > > }

    >
    > > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    >
    > > void initunicodeTest(void){
    > >    Py_InitModule3("unicodeTest",funcs,"");
    > > }

    >
    > > This gives the following output when I pass "abc" as input:

    >
    > > size of sampleObj is : 3
    > > 3 chars are copied to w
    > > size of wchar_t is : 4
    > > size of Py_UNICODE is: 4
    > > sample is : a
    > > w is : a
    > > sample is : b
    > > w is : b
    > > sample is : c
    > > w is : c

    >
    > > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    > > \0s after a char, printf or wprintf is only printing one letter.
    > > I need to further process the data and those libraries will need the
    > > data in UCS2 format (2 bytes), otherwise they fail. Is there any way
    > > by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
    > > data to UCS2 explicitly?

    >
    > Sure: just use the appropriate UTF-16 codec for this.
    >
    > /* Generic codec based encoding API.
    >
    >    object is passed through the encoder function found for the given
    >    encoding using the error handling method defined by errors. errors
    >    may be NULL to use the default method defined for the codec.
    >
    >    Raises a LookupError in case no encoder can be found.
    >
    >  */
    >
    > PyAPI_FUNC(PyObject *) PyCodec_Encode(
    >        PyObject *object,
    >        const char *encoding,
    >        const char *errors
    >        );
    >
    > encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
    > for big endian.
    >
    > --
    > Marc-Andre Lemburg
    > eGenix.com
    >
    > Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
    > >>> mxODBC.Zope.Database.Adapter ...            http://zope.egenix.com/
    > >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

    >
    > ________________________________________________________________________
    > 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix..com/
    >
    > ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
    >
    >    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    >     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    >            Registered at Amtsgericht Duesseldorf: HRB 46611
    >                http://www.egenix.com/company/contact/


    Thanks, but this is returning PyObject *, whereas I need value in some
    variable which can be printed using wprintf() like wchar_t (having a
    size of 2 bytes). If I again convert this PyObject to wchar_t or
    PyUnicode, I go back to where I started. :)

    -
    Abhigyan
     
    abhi, Mar 23, 2009
    #11
  12. abhi

    abhi Guest

    On Mar 23, 4:57 pm, abhi <> wrote:
    > On Mar 23, 4:37 pm, "M.-A. Lemburg" <> wrote:
    >
    >
    >
    > > On 2009-03-23 11:50, abhi wrote:

    >
    > > > On Mar 23, 3:04 pm, "M.-A. Lemburg" <> wrote:
    > > > Thanks Marc, John,
    > > >          With your help, I am at least somewhere. I re-wrote the code
    > > > to compare Py_Unicode and wchar_t outputs and they both look exactly
    > > > the same.

    >
    > > > #include<Python.h>

    >
    > > > static PyObject *unicode_helper(PyObject *self,PyObject *args){
    > > >    const char *name;
    > > >    PyObject *sampleObj = NULL;
    > > >            Py_UNICODE *sample = NULL;
    > > >    wchar_t * w=NULL;
    > > >    int size = 0;
    > > >    int i;

    >
    > > >       if (!PyArg_ParseTuple(args, "O", &sampleObj)){
    > > >                 return NULL;
    > > >       }

    >
    > > >         // Explicitly convert it to unicode and get Py_UNICODE value
    > > >         sampleObj = PyUnicode_FromObject(sampleObj);
    > > >         sample = PyUnicode_AS_UNICODE(sampleObj);
    > > >         printf("size of sampleObj is : %d\n",PyUnicode_GET_SIZE
    > > > (sampleObj));
    > > >         w = (wchar_t *) malloc((PyUnicode_GET_SIZE(sampleObj)+1)*sizeof
    > > > (wchar_t));
    > > >    size = PyUnicode_AsWideChar(sampleObj,w,(PyUnicode_GET_SIZE(sampleObj)
    > > > +1)*sizeof(wchar_t));
    > > >    printf("%d chars are copied to w\n",size);
    > > >    printf("size of wchar_t is : %d\n", sizeof(wchar_t));
    > > >    printf("size of Py_UNICODE is: %d\n",sizeof(Py_UNICODE));
    > > >    for(i=0;i<PyUnicode_GET_SIZE(sampleObj);i++){
    > > >            printf("sample is : %c\n",sample);
    > > >            printf("w is : %c\n",w);
    > > >    }
    > > >    return sampleObj;
    > > > }

    >
    > > > static PyMethodDef funcs[]={{"unicodeTest",(PyCFunction)
    > > > unicode_helper,METH_VARARGS,"test ucs2, ucs4"},{NULL}};

    >
    > > > void initunicodeTest(void){
    > > >    Py_InitModule3("unicodeTest",funcs,"");
    > > > }

    >
    > > > This gives the following output when I pass "abc" as input:

    >
    > > > size of sampleObj is : 3
    > > > 3 chars are copied to w
    > > > size of wchar_t is : 4
    > > > size of Py_UNICODE is: 4
    > > > sample is : a
    > > > w is : a
    > > > sample is : b
    > > > w is : b
    > > > sample is : c
    > > > w is : c

    >
    > > > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    > > > \0s after a char, printf or wprintf is only printing one letter.
    > > > I need to further process the data and those libraries will need the
    > > > data in UCS2 format (2 bytes), otherwise they fail. Is there any way
    > > > by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
    > > > data to UCS2 explicitly?

    >
    > > Sure: just use the appropriate UTF-16 codec for this.

    >
    > > /* Generic codec based encoding API.

    >
    > >    object is passed through the encoder function found for the given
    > >    encoding using the error handling method defined by errors. errors
    > >    may be NULL to use the default method defined for the codec.

    >
    > >    Raises a LookupError in case no encoder can be found.

    >
    > >  */

    >
    > > PyAPI_FUNC(PyObject *) PyCodec_Encode(
    > >        PyObject *object,
    > >        const char *encoding,
    > >        const char *errors
    > >        );

    >
    > > encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
    > > for big endian.

    >
    > > --
    > > Marc-Andre Lemburg
    > > eGenix.com

    >
    > > Professional Python Services directly from the Source  (#1, Mar 23 2009)>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
    > > >>> mxODBC.Zope.Database.Adapter ...            http://zope..egenix.com/
    > > >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

    >
    > > ________________________________________________________________________
    > > 2009-03-19: Released mxODBC.Connect 1.0.1      http://python.egenix.com/

    >
    > > ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    >
    > >    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str..48
    > >     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    > >            Registered at Amtsgericht Duesseldorf: HRB 46611
    > >                http://www.egenix.com/company/contact/

    >
    > Thanks, but this is returning PyObject *, whereas I need value in some
    > variable which can be printed using wprintf() like wchar_t (having a
    > size of 2 bytes). If I again convert this PyObject to wchar_t or
    > PyUnicode, I go back to where I started. :)
    >
    > -
    > Abhigyan


    Hi Marc,
    Is there any way to ensure that wchar_t size would always be 2
    instead of 4 in ucs4 configured python? Googling gave me the
    impression that there is some logic written in PyUnicode_AsWideChar()
    which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
    and wchar_t are different.

    -
    Abhigyan
     
    abhi, Mar 23, 2009
    #12
  13. On 2009-03-23 14:05, abhi wrote:
    > Hi Marc,
    > Is there any way to ensure that wchar_t size would always be 2
    > instead of 4 in ucs4 configured python? Googling gave me the
    > impression that there is some logic written in PyUnicode_AsWideChar()
    > which can take care of ucs4 to ucs2 conversion if sizes of Py_UNICODE
    > and wchar_t are different.


    wchar_t is defined by your compiler. There's no way to change that.

    However, you can configure Python to use UCS2 (default) or UCS4 (used
    on most Unix platforms), so it's easy to customize for your needs.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Mar 23 2009)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________
    2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
     
    M.-A. Lemburg, Mar 23, 2009
    #13
  14. On 2009-03-23 12:57, abhi wrote:
    >>> Is there any way
    >>> by which I can force wchar_t to be 2 bytes, or can I convert this UCS4
    >>> data to UCS2 explicitly?

    >> Sure: just use the appropriate UTF-16 codec for this.
    >>
    >> /* Generic codec based encoding API.
    >>
    >> object is passed through the encoder function found for the given
    >> encoding using the error handling method defined by errors. errors
    >> may be NULL to use the default method defined for the codec.
    >>
    >> Raises a LookupError in case no encoder can be found.
    >>
    >> */
    >>
    >> PyAPI_FUNC(PyObject *) PyCodec_Encode(
    >> PyObject *object,
    >> const char *encoding,
    >> const char *errors
    >> );
    >>
    >> encoding needs to be set to 'utf-16-le' for little endian, 'utf-16-be'
    >> for big endian.

    >
    > Thanks, but this is returning PyObject *, whereas I need value in some
    > variable which can be printed using wprintf() like wchar_t (having a
    > size of 2 bytes). If I again convert this PyObject to wchar_t or
    > PyUnicode, I go back to where I started. :)


    It will return a PyString object with the UTF-16 data. You can
    use PyString_AS_STRING() to access the data stored by it.

    Note that writing your own UCS2/UCS4 converter isn't all that hard
    either. Just have a look at the code in unicodeobject.c for
    PyUnicode_AsWideChar().

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Mar 23 2009)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________
    2009-03-19: Released mxODBC.Connect 1.0.1 http://python.egenix.com/

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
     
    M.-A. Lemburg, Mar 23, 2009
    #14
  15. > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    > \0s after a char, printf or wprintf is only printing one letter.


    No. printf indeed will see a terminating character. However, wprintf
    should correctly know that a wchar_t has four bytes per character,
    and print it correctly. Make sure to use %ls to print wchar_t arrays;
    %s would print multi-byte character strings.

    > I need to further process the data and those libraries will need the
    > data in UCS2 format (2 bytes), otherwise they fail.


    Are you absolutely sure about that? Why does that library expect
    UCS-2, when you system's wchar_t is four bytes?

    In any case, do what MAL told you: use the UCS-2 codec to convert
    the Unicode string to a 2-bytes-per-char byte string. The PyObject
    you get from the conversion is a byte string object; use
    PyString_AsStringAndSize to get to the actual bytes.

    Regards,
    Martin
     
    Martin v. Löwis, Mar 23, 2009
    #15
  16. abhi

    abhi Guest

    On Mar 24, 4:55 am, "Martin v. Löwis" <> wrote:
    > > So, both Py_UNICODE and wchar_t are 4 bytes and since it contains 3
    > > \0s after a char, printf or wprintf is only printing one letter.

    >
    > No. printf indeed will see a terminating character. However, wprintf
    > should correctly know that a wchar_t has four bytes per character,
    > and print it correctly. Make sure to use %ls to print wchar_t arrays;
    > %s would print multi-byte character strings.
    >
    > > I need to further process the data and those libraries will need the
    > > data in UCS2 format (2 bytes), otherwise they fail.

    >
    > Are you absolutely sure about that? Why does that library expect
    > UCS-2, when you system's wchar_t is four bytes?
    >
    > In any case, do what MAL told you: use the UCS-2 codec to convert
    > the Unicode string to a 2-bytes-per-char byte string. The PyObject
    > you get from the conversion is a byte string object; use
    > PyString_AsStringAndSize to get to the actual bytes.
    >
    > Regards,
    > Martin


    Thanks Marc and Martin, my preliminary trials are showing positive
    results with this method.

    -
    Abhigyan
     
    abhi, Mar 25, 2009
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,956
    Robert Mark Bram
    Sep 28, 2003
  2. Daniel Dittmar

    How to dectect UCS4 Python at runtime?

    Daniel Dittmar, Aug 7, 2003, in forum: Python
    Replies:
    2
    Views:
    378
    Andreas Jung
    Aug 7, 2003
  3. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    562
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  4. Neal Becker

    ucs2 or ucs4?

    Neal Becker, Jan 14, 2008, in forum: Python
    Replies:
    1
    Views:
    331
    Paul Hankin
    Jan 14, 2008
  5. Alan Kesselmann

    ucs2 and ucs4 python

    Alan Kesselmann, May 15, 2012, in forum: Python
    Replies:
    6
    Views:
    1,007
    zayatzz
    May 16, 2012
Loading...

Share This Page