can't get utf8 / unicode strings from embedded python

David M. Cotter · Aug 23, 2013

note everything works great if i use Ascii, but:

in my utf8-encoded script i have this:

print "frÃ¸Ã¢nÃ§Ã¯Ã©"

in my embedded C++ i have this:

PyObject* CPython_Script:

rint(PyObject *args)
{
PyObject *resultObjP = NULL;
const char *utf8_strZ = NULL;

if (PyArg_ParseTuple(args, "s", &utf8_strZ)) {
Log(utf8_strZ, false);

resultObjP = Py_None;
Py_INCREF(resultObjP);
}

return resultObjP;
}

Now, i know that my Log() can print utf8 (has for years, very well debugged)

but what it *actually* prints is this:

print "frÃ¸Ã¢nÃ§Ã¯Ã©"

--> frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©

another method i use looks like this:

kj_commands.menu("æŽ§ä»¶", "åŒæ¥æ»‘å¸§", "å…¨å±€æ— æ»‘å¸§") or
kj_commands.menu(u"æŽ§ä»¶", u"åŒæ¥æ»‘å¸§", u"å…¨å±€æ— æ»‘å¸§")

and in my C++ i have:

SuperString ScPyObject::GetAs_String()
{
SuperString str;

if (PyUnicode_Check(i_objP)) {
#if 1
// method 1
{
ScPyObject utf8Str(PyUnicode_AsUTF8String(i_objP));

str = utf8Str.GetAs_String();
}
#elif 0
// method 2
{
UTF8Char *uniZ = (UTF8Char *)PyUnicode_AS_UNICODE(i_objP);

str.assign(&uniZ[0], &uniZ[PyUnicode_GET_DATA_SIZE(i_objP)], kCFStringEncodingUTF16);
}
#else
// method 3
{
UTF32Vec charVec(32768); CF_ASSERT(sizeof(UTF32Vec::value_type) ==sizeof(wchar_t));
PyUnicodeObject *uniObjP = (PyUnicodeObject *)(i_objP);
Py_ssize_t sizeL(PyUnicode_AsWideChar(uniObjP, (wchar_t *)&charVec[0],charVec.size()));

charVec.resize(sizeL);
charVec.push_back(0);
str.Set(SuperString(&charVec[0]));
}
#endif
} else {
str.Set(uc(PyString_AsString(i_objP)));
}

Log(str.utf8Z());

return str;
}

for the string, "æŽ§ä»¶", i get:
--> ÃŠÃ©ÃŸâ€°Âªâˆ‚

for the *unicode* string, u"æŽ§ä»¶", Methods 1, 2, and 3, i getthe same thing:
--> ÃŠÃ©ÃŸâ€°Âªâˆ‚

okay so what am i doing wrong???

Steven D'Aprano · Aug 24, 2013

note everything works great if i use Ascii, but:

in my utf8-encoded script i have this:

I see you are using Python 2, in which case there are probably two or
three errors being made here.

Firstly, in Python 2, the compiler assumes that the source code is
encoded in ASCII, actually ASCII plus arbitrary bytes. Since your source
code is *actually* UTF-8, the bytes in the file are:

70 72 69 6E 74 20 22 66 72 C3 B8 C3 A2 6E C3 A7 C3 AF C3 A9 22

But Python doesn't know the file is encoded in UTF-8, it thinks it is
reading ASCII plus junk, so when it reads the file it parses those bytes
into a line of code:

print "~~~~~"

where the ~~~~~ represents a bunch of 13 rubbish junk bytes. So that's
the first problem to fix. You can fix this by adding an encoding cookie
at the beginning of your module, in the first or second line:

# -*- coding: utf-8 -*-

The second problem is that even once you've fixed the source encoding,
you're still not dealing with a proper Unicode string. In Python 2, you
need to use u" ... " delimiters for Unicode, otherwise the results you
get are completely arbitrary and depend on the encoding of your terminal.
For example, if I set my terminal encoding to IBM-850, I get:

frÂ°Ã”nÃ¾Â´Ãš

from those bytes. If I set it to Central European ISO-8859-3 I get this:

frÄÃ¢nÃ§Ã¯Ã©

Clearly not what I intended. So change the line of code to:

print u"frÃ¸Ã¢nÃ§Ã¯Ã©"

Those two changes ought to fix the problem, but if they don't, try
setting your terminal encoding to UTF-8 as well and see if that helps.

[...]

but what it *actually* prints is this:

--> frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©

It's hard to say what *exactly* is happening here, because you don't
explain how the python print statement somehow gets into your C++ Log
code. Do I guess right that it catches stdout?

If so, then what I expect is happening is that Python has read in the
source code of

print "~~~~~"

with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying
those junk bytes according to whatever encoding it happens to be using.
Since you are seeing this:

frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©

my guess is that you're using a Mac, and the encoding is set to the
MacRoman encoding. Am I close?

To summarise:

* Add an encoding cookie, to tell Python to use UTF-8 when parsing your
source file.

* Use a Unicode string u"frÃ¸Ã¢nÃ§Ã¯Ã©".

* Consider setting your terminal to use UTF-8, otherwise it may not be
able to print all the characters you would like.

* You may need to change the way data gets into your C++ Log function. If
it expects bytes, you may need to use u"...".encode('utf-8') rather than
just u"...". But since I don't understand how data is getting into your
Log function, I can't be sure about this.

I think that is everything. Does that fix your problem?

David M. Cotter · Aug 24, 2013

I see you are using Python 2
correct

Firstly, in Python 2, the compiler assumes that the source code is encoded in ASCII

gar, i must have been looking at doc for v3, as i thought it was all assumed to be utf8

# -*- coding: utf-8 -*-

okay, did that, still no change

you need to use u" ... " delimiters for Unicode, otherwise the results you get are completely arbitrary and depend on the encoding of your terminal.

okay, well, i'm on a mac, and not using "terminal" at all. but if i were, it would be utf8
but it's still not flying

For example, if I set my terminal encoding to IBM-850

okay how do you even do that? this is not an interactive session, this is embedded python, within a C++ app, so there's no terminal.

but that is a good question: all the docs say "default encoding" everywhere(as in "If string is a Unicode object, this function computes the default encoding of string and operates on that"), but fail to specify just HOW i can set the default encoding. if i could just say "hey, default encoding isutf8", i think i'd be done?

So change the line of code to:
print u"frÃ¸Ã¢nÃ§Ã¯Ã©"

okay, sure...
but i get the exact same results

Those two changes ought to fix the problem, but if they don't, try setting your terminal encoding to UTF-8 as well

well, i'm not sure what you mean by that. i don't have a terminal here.
i'm logging to a utf8 log file (when i print)

but what it *actually* prints is this:

print "frÃ¸Ã¢nÃ§Ã¯Ã©"
--> frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©

It's hard to say what *exactly* is happening here, because you don't explain how the python print statement somehow gets into your C++ Log code. Do Iguess right that it catches stdout?

yes, i'm redirecting stdout to my own custom print class, and then from that function i call into my embedded C++ print function

If so, then what I expect is happening is that Python has read in the source code of

print "~~~~~"

with ~~~~~ as a bunch of junk bytes, and then your terminal is displaying those junk bytes according to whatever encoding it happens to be using.
Since you are seeing this:

frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©

my guess is that you're using a Mac, and the encoding is set to the MacRoman encoding. Am I close?

you hit the nail on the head there, i think. using that as a hint, i took this text "frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©" and pasted that into a "macRoman" document, then *reinterpreted* it as UTF8, and voala: "frÃ¸Ã¢nÃ§Ã¯Ã©"

so, it seems that i AM getting my utf8 bytes, but i'm getting them converted to macRoman. huh? where is macRoman specified, and how to i change thatto utf8? i think that's the missing golden ticket

Dave Angel · Aug 24, 2013

David said:
yes, i'm redirecting stdout to my own custom print class, and then from that function i call into my embedded C++ print function

I don't know much about embedding Python, but each file object has an
encoding property.

Why not examine sys.stdout.encoding ? And change it to "UTF-8" ?

print "encoding is", sys.stdout.encoding

sys.stdout.encoding = "UTF-8"

random832 · Aug 24, 2013

okay, well, i'm on a mac, and not using "terminal" at all. but if i
were, it would be utf8
but it's still not flying

so, it seems that i AM getting my utf8 bytes, but i'm getting them
converted to macRoman. huh? where is macRoman specified, and how to i
change that to utf8? i think that's the missing golden ticket

You say you're not using terminal. What _are_ you using?

David M. Cotter · Aug 24, 2013

What _are_ you using?
i have scripts in a file, that i am invoking into my embedded python withina C++ program. there is no terminal involved. the "print" statement has been redirected (via sys.stdout) to my custom print class, which does not specify "encoding", so i tried the suggestion above to set it:

static const char *s_RedirectScript =
"import " kEmbeddedModuleName "\n"
"import sys\n"
"\n"
"class CustomPrintClass:\n"
" def write(self, stuff):\n"
" " kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
"class CustomErrClass:\n"
" def write(self, stuff):\n"
" " kEmbeddedModuleName "." kCustomErr "(stuff)\n"
"sys.stdout = CustomPrintClass()\n"
"sys.stderr = CustomErrClass()\n"
"sys.stdout.encoding = 'UTF-8'\n"
"sys.stderr.encoding = 'UTF-8'\n";

but it didn't help.

I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string. who is specifying macRoman, and where, and how do i tellwhoever that is that i really *really* want utf8?

wxjmfauth · Aug 24, 2013

Le samedi 24 aoÃ»t 2013 18:47:19 UTC+2, David M. Cotter a Ã©critÂ :

i have scripts in a file, that i am invoking into my embedded python within a C++ program. there is no terminal involved. the "print" statement has been redirected (via sys.stdout) to my custom print class, which does notspecify "encoding", so i tried the suggestion above to set it:

static const char *s_RedirectScript =

"import " kEmbeddedModuleName "\n"

"import sys\n"

"\n"

"class CustomPrintClass:\n"

" def write(self, stuff):\n"

" " kEmbeddedModuleName "." kCustomPrint "(stuff)\n"

"class CustomErrClass:\n"

" def write(self, stuff):\n"

" " kEmbeddedModuleName "." kCustomErr "(stuff)\n"

"sys.stdout = CustomPrintClass()\n"

"sys.stderr = CustomErrClass()\n"

"sys.stdout.encoding = 'UTF-8'\n"

"sys.stderr.encoding = 'UTF-8'\n";

but it didn't help.

I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string. who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?

--------

Always encode a "unicode" into the coding of the "system"
which will host it.

Adapting the hosting system to your "unicode" (encoded
unicode) is not a valid solution. A non sense.

sys.std***.encodings do nothing. They only give you
information about the coding of the hosting system.

The "system" can be anything, a db, a terminal, a gui, ...

Shortly, your "writer" should encode your "stuff"
to your "host" in a adequate way. It is up to you to
manage coherence. If your passive "writer" support only one
coding, adapt "stuff", if "stuff" lives in its own coding
(due to c++ ?) adapt your "writer".

Example from my interactive interpreter. It is in Python 3,
not important, basically the job is the same in Python 2.
This interpreter has the capability to support many codings,
and the coding of this host system can be changed on the
fly.

A commented session.

By default, a string, type str, is a unicode. The
host accepts "unicode". So, by default the sys.stdout

Setting the host to utf-8 and printing the above string gives
"something", but encoding into utf-8 works fine.
'frÃ¸Ã¢nÃ§Ã¯Ã©'

Setting the host to 'mac-roman' works fine too,
as long it is properly encoded!

'frÃ¸Ã¢nÃ§Ã¯Ã©'

But

'frâˆšâˆâˆšÂ¢nâˆšÃŸâˆšÃ˜âˆšÂ©'

Ditto for cp850
'frÃ¸Ã¢nÃ§Ã¯Ã©'

If the repertoire of characters of a coding scheme does not
contain the characters -> replace
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
File "c:\python32\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\xf8' in position 2: character maps to

'fr?Ã¢nÃ§Ã¯Ã©'

Curiousities

'frÃ¸Ã¢nÃ§Ã¯Ã©'

jmf

Benjamin Kaplan · Aug 24, 2013

i have scripts in a file, that i am invoking into my embedded python within a C++ program. there is no terminal involved. the "print" statement has been redirected (via sys.stdout) to my custom print class, which does notspecify "encoding", so i tried the suggestion above to set it:

static const char *s_RedirectScript =
"import " kEmbeddedModuleName "\n"
"import sys\n"
"\n"
"class CustomPrintClass:\n"
" def write(self, stuff):\n"
" " kEmbeddedModuleName "." kCustomPrint "(stuff)\n"
"class CustomErrClass:\n"
" def write(self, stuff):\n"
" " kEmbeddedModuleName "." kCustomErr "(stuff)\n"
"sys.stdout = CustomPrintClass()\n"
"sys.stderr = CustomErrClass()\n"
"sys.stdout.encoding = 'UTF-8'\n"
"sys.stderr.encoding = 'UTF-8'\n";

but it didn't help.

I'm still getting back a string that is a utf-8 string of characters that, if converted to "macRoman" and then interpreted as UTF8, shows the original, correct string. who is specifying macRoman, and where, and how do i tell whoever that is that i really *really* want utf8?
--

If you're running this from a C++ program, then you aren't getting
back characters. You're getting back bytes. If you treat them as
UTF-8, they'll work properly. The only thing wrong is the text editor
you're using to open the file afterwards- since you aren't specifying
an encoding, it's assuming MacRoman. You can try putting the UTF-8 BOM
(it's not really a BOM) at the front of the file- the bytes 0xEF 0xBB
0xBF are used by some editors to identify a file as UTF-8.

random832 · Aug 25, 2013

i have scripts in a file, that i am invoking into my embedded python
within a C++ program. there is no terminal involved. the "print"
statement has been redirected (via sys.stdout) to my custom print class,
which does not specify "encoding", so i tried the suggestion above to set
it:

That doesn't answer my real question. What does your "custom print
class" do with the text?

David M. Cotter · Aug 25, 2013

i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.

I have a C++ program, with very well tested unicode support. All logging is done in utf8. I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.

I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

I have a script, encoded in utf8, and *marked* as utf8 with this line:
# -*- coding: utf-8 -*-

In that script, i have inline unicode text. When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode. To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret"them as utf8. In this way i can recover the original unicode.

When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.

This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring. Let's focus on just passing a string into C++ then back out.

This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.

Let's prove my statements. Here is the script, *interpreted* as MacRoman:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman..png

and here it is again *interpreted* as utf8:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.png

here is the string conversion code:

SuperString ScPyObject::GetAs_String()
{
SuperString str; // underlying format of SuperString is unicode

if (PyUnicode_Check(i_objP)) {
ScPyObject utf8Str(PyUnicode_AsUTF8String(i_objP));

str = utf8Str.GetAs_String();
} else {
const UTF8Char *bytes_to_interpetZ = uc(PyString_AsString(i_objP));

// the "Set" call *interprets*, does not *convert*
str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);

// str is now unicode characters which *represent* macRoman characters
// so *convert* these to actual macRoman

// fyi: Update_utf8 means "convert to this encoding and
// store the resulting bytes in the variable named "utf8"
str.Update_utf8(kCFStringEncodingMacRoman);

// str is now unicode characters converted from macRoman
// so *reinterpret* them as UTF8

// FYI, we're just taking the pure bytes that are stored in the utf8 variable
// and *interpreting* them to this encoding
bytes_to_interpetZ = str.utf8().c_str();

str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
}

return str;
}

PyObject* PyString_FromString(const SuperString& str)
{
SuperString localStr(str);

// localStr is the real, actual unicode string
// but we must *interpret* it as macRoman, then take these "macRoman" characters
// and "convert" them to unicode for Python to "get it"
const UTF8Char *bytes_to_interpetZ = localStr.utf8().c_str();

// take the utf8 bytes (actual utf8 prepresentation of string)
// and say "no, these bytes are macRoman"
localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);

// okay so now we have unicode of MacRoman characters (!?)
// return the underlying utf8 bytes of THAT as our string
return PyString_FromString(localStr.utf8Z());
}

And here is the results from running the script:
18: ---------------
18: Original string: frÃ¸Ã¢nÃ§Ã¯Ã©
18: converting...
18: it worked: frÃ¸Ã¢nÃ§Ã¯Ã©
18: ---------------
18: ---------------
18: Original string: æŽ§ä»¶
18: converting...
18: it worked: æŽ§ä»¶
18: ---------------

Now the thing that absolutely utterly baffles me (if i'm not baffled enough) is that i get the EXACT same results on both Mac and Windows. Why do they both insist on interpreting my script's bytes as MacRoman?

Vlastimil Brom · Aug 25, 2013

2013/8/25 David M. Cotter said:
i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.

I have a C++ program, with very well tested unicode support. All loggingis done in utf8. I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in theunderlying program.

I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

I have a script, encoded in utf8, and *marked* as utf8 with this line:
# -*- coding: utf-8 -*-

In that script, i have inline unicode text. When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode. To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8. In this way i can recover the original unicode.

When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.

This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring. Let's focus on just passing a string into C++ then back out.

This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.

Let's prove my statements. Here is the script, *interpreted* as MacRoman:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png

and here it is again *interpreted* as utf8:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_utf8.png

here is the string conversion code:

SuperString ScPyObject::GetAs_String()
{
SuperString str; // underlying format of SuperString is unicode

if (PyUnicode_Check(i_objP)) {
ScPyObject utf8Str(PyUnicode_AsUTF8String(i_objP));

str = utf8Str.GetAs_String();
} else {
const UTF8Char *bytes_to_interpetZ = uc(PyString_AsString(i_objP));

// the "Set" call *interprets*, does not *convert*
str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);

// str is now unicode characters which *represent* macRoman characters
// so *convert* these to actual macRoman

// fyi: Update_utf8 means "convert to this encoding and
// store the resulting bytes in the variable named "utf8"
str.Update_utf8(kCFStringEncodingMacRoman);

// str is now unicode characters converted from macRoman
// so *reinterpret* them as UTF8

// FYI, we're just taking the pure bytes that are stored in the utf8 variable
// and *interpreting* them to this encoding
bytes_to_interpetZ = str.utf8().c_str();

str.Set(bytes_to_interpetZ, kCFStringEncodingUTF8);
}

return str;
}

PyObject* PyString_FromString(const SuperString& str)
{
SuperString localStr(str);

// localStr is the real, actual unicode string
// but we must *interpret* it as macRoman, then take these "macRoman" characters
// and "convert" them to unicode for Python to "get it"
const UTF8Char *bytes_to_interpetZ = localStr.utf8().c_str();

// take the utf8 bytes (actual utf8 prepresentation of string)
// and say "no, these bytes are macRoman"
localStr.Set(bytes_to_interpetZ, kCFStringEncodingMacRoman);

// okay so now we have unicode of MacRoman characters (!?)
// return the underlying utf8 bytes of THAT as our string
return PyString_FromString(localStr.utf8Z());
}

And here is the results from running the script:
18: ---------------
18: Original string: frÃ¸Ã¢nÃ§Ã¯Ã©
18: converting...
18: it worked: frÃ¸Ã¢nÃ§Ã¯Ã©
18: ---------------
18: ---------------
18: Original string: æŽ§ä»¶
18: converting...
18: it worked: æŽ§ä»¶
18: ---------------

Now the thing that absolutely utterly baffles me (if i'm not baffled enough) is that i get the EXACT same results on both Mac and Windows. Why do they both insist on interpreting my script's bytes as MacRoman?

Hi,
unfortunately, I don't have experience with embedding python and C++,
but he python (for python 2) part seems to be missing the u prefix in
the unicode literals.
like
u"frÃ¸Ã¢nÃ§Ã¯Ã©"
Is the c++ part prepared for python unicode object, or does it require
utf-8 encoded string (or the respective bytes)?
would
oldstr.encode("utf-8")
in the call make a difference?

regards,
vbr

Terry Reedy · Aug 25, 2013

i'm sorry this is so confusing, let me try to re-state the problem in as clear a way as i can.

I have a C++ program, with very well tested unicode support. All logging is done in utf8. I have conversion routines that work flawlessly, so i can assure you there is nothing wrong with logging and unicode support in the underlying program.

I am embedding python 2.7 into the program, and extending python with routines in my C++ program.

If you want 'well-tested' (correct) unicode support from Python, use
3.3. Unicode in 2.x is somewhat buggy and definitely flakey. The first
fix was to make unicode *the* text type, in 3.0. The second was to
redesign the internals in 3.3. It is possible that 2.7 is too broken for
what you want to do.

I have a script, encoded in utf8, and *marked* as utf8 with this line:
# -*- coding: utf-8 -*-

In that script, i have inline unicode text.

The example scripts that you posted pictures of do *not* have unicode
text. They have bytestring literals with (encoded) non-ascii chars
inside them. This is not a great idea. I am not sure what bytes you end
up with. Apparently, not what you expect.

To make them 'unicode text', you must prepend the literals with 'u'.
Didn't someone say this before?

When I pass that text to my C++ program, the Python interpreter decides that these bytes are macRoman, and handily "converts" them to unicode. To compensate, i must "convert" these "macRoman" characters encoded as utf8, back to macRoman, then "interpret" them as utf8. In this way i can recover the original unicode.

When i return a unicode string back to python, i must do the reverse so that Python gets back what it expects.

This is not related to printing, or sys.stdout, it does happen with that too but focusing on that is a red-herring. Let's focus on just passing a string into C++ then back out.

This would all actually make sense IF my script was marked as being "macRoman" even tho i entered UTF8 Characters, but that is not the case.

Let's prove my statements. Here is the script, *interpreted* as MacRoman:
http://karaoke.kjams.com/screenshots/bugs/python_unicode/script_as_macroman.png

Why are you posting pictures of code, instead of the (runnable) code
itself, as you did with C code?

David M. Cotter · Aug 25, 2013

fair enough. I can provide further proof of strangeness.
here is my latest script: this is saved on disk as a UTF8 encoded file, and when viewing as UTF8, it shows the correct characters.

==================
# -*- coding: utf-8 -*-
import time, kjams, kjams_lib

def log_success(msg, successB, str):
if successB:
print msg + " worked: " + str
else:
print msg + "failed: " + str

def do_test(orig_str):
cmd_enum = kjams.enum_cmds()

print "---------------"
print "Original string: " + orig_str
print "converting..."

oldstr = orig_str;
newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
log_success("first", oldstr == newstr, newstr);

oldstr = unicode(orig_str, "UTF-8")
newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
newstr = unicode(newstr, "UTF-8")
log_success("second", oldstr == newstr, newstr);

oldstr = unicode(orig_str, "UTF-8")
oldstr.encode("UTF-8")
newstr = kjams_lib.do_command(cmd_enum.kScriptCommand_Unicode_Test, oldstr)
newstr = unicode(newstr, "UTF-8")
log_success("third", oldstr == newstr, newstr);

print "---------------"

def main():
do_test("frÃ¸Ã¢nÃ§Ã¯Ã©")
do_test("æŽ§ä»¶")

#-----------------------------------------------------
if __name__ == "__main__":
main()

==================
and the latest results:

20: ---------------
20: Original string: frÃ¸Ã¢nÃ§Ã¯Ã©
20: converting...
20: first worked: frÃ¸Ã¢nÃ§Ã¯Ã©
20: second worked: frÃ¸Ã¢nÃ§Ã¯Ã©
20: third worked: frÃ¸Ã¢nÃ§Ã¯Ã©
20: ---------------
20: ---------------
20: Original string: æŽ§ä»¶
20: converting...
20: first worked: æŽ§ä»¶
20: second worked: æŽ§ä»¶
20: third worked: æŽ§ä»¶
20: ---------------

now, given the C++ source code, this should NOT work, given that i'm doing some crazy re-coding of the bytes.

so, you see, it does not matter whether i pass "unicode" strings or regular"strings", they all translate to the same, weird macroman.

for completeness, here is the C++ code that the script calls:

===================
case kScriptCommand_Unicode_Test: {
pyArg = iterP.NextArg_OrSyntaxError();

if (pyArg.get()) {
SuperString str = pyArg.GetAs_String();

resultObjP = PyString_FromString(str);
}
break;
}

===================

David M. Cotter · Aug 25, 2013

i got it!! OMG! so sorry for the confusion, but i learned a lot, and i can share the result:

the CORRECT code *was* what i had assumed. the Python side has always been correct (no need to put "u" in front of strings, it is known that the bytes are utf8 bytes)

it was my "run script" function which read in the file. THAT was what was "reinterpreting" the utf8 bytes as macRoman (on both platforms). correct code below:

SuperString ScPyObject::GetAs_String()
{
SuperString str;

if (PyUnicode_Check(i_objP)) {
ScPyObject utf8Str(PyUnicode_AsUTF8String(i_objP));

str = utf8Str.GetAs_String();
} else {
// calling "uc" on this means "assume this is utf8"
str.Set(uc(PyString_AsString(i_objP)));
}

return str;
}

PyObject* PyString_FromString(const SuperString& str)
{
return PyString_FromString(str.utf8Z());
}

MRAB · Aug 26, 2013

i got it!! OMG! so sorry for the confusion, but i learned a lot,
and i can share the result:

the CORRECT code *was* what i had assumed. the Python side has
always been correct (no need to put "u" in front of strings, it is
known that the bytes are utf8 bytes)

it was my "run script" function which read in the file. THAT was
what was "reinterpreting" the utf8 bytes as macRoman (on both
platforms). correct code below:

When working with Unicode, what you should be doing is:

1. Specifying the encoding line in the special comment.

2. Setting the encoding of the source file.

3. Using Unicode string literals in the source file.

You're doing (1) and (2), but not (3).

If you want to pass UTF-8 to the the C++, then encode the Unicode
string to bytes when you pass it. Using bytestring literals and relying
on the source file being UTF-8, like you doing, is just asking for
trouble, as you've found out!

David M. Cotter · Aug 27, 2013

i am already doing (3), and all is working perfectly. bytestring literals are fine, i'm not sure what this trouble is that you speak of.

note that i'm not using PyRun_AnyFile(), i'm loading the script myself, assumed as utf8 (which was my original problem, i had assumed it was macRoman), then calling PyRun_SimpleString(). it works flawlessly now, on both mac and windows.

Steven D'Aprano · Aug 28, 2013

i am already doing (3), and all is working perfectly. bytestring
literals are fine, i'm not sure what this trouble is that you speak of.

Neither is anyone else, because your post is completely devoid of any
context. Who are you talking to?

Wait, let me see if I can peer into my crystal ball and see if the
spirits tell me what you are talking about... I see a post... no,
repeated posts, by many people, telling you not to embed Unicode
characters in Python 2.x plain byte strings...

You know what? You obviously know so much more about Unicode and Python
than the entire Python community, you must be right. There is no possible
way that misusing byte strings in this manner could possibly go wrong.
Since byte strings literals containing Unicode data are "fine", it was
clearly a complete waste of time to introduce Unicode strings in the
first place.

Why bother using the official interface designed to work correctly with
Unicode, when you can rely on an accident of implementation that just
happens to work correctly in your environment but no guarantee it will
work correctly anywhere else? What could *possibly* go wrong by relying
on code working by accident like this?

David M. Cotter · Aug 28, 2013

I am very sorry that I have offended you to such a degree you feel it necessary to publicly eviscerate me.

Perhaps I could have worded it like this: "So far I have not seen any troubles including unicode characters in my strings, they *seem* to be fine for my use-case. What kind of trouble has been seen with this by others?"

Really, I wonder why you are so angry at me for having made a mistake? I'm going to guess that you don't have kids.

Steven D'Aprano · Aug 28, 2013

I am very sorry that I have offended you to such a degree you feel it
necessary to publicly eviscerate me.

You know David, you are right. I did over-react. And I apologise for
that. I am sorry, I was excessively confrontational. (Although I think
"eviscerate" is a bit strong.)

Putting aside my earlier sarcasm, the basic message remains the same:
Python byte strings are not designed to work with Unicode characters, and
if they do work, it is an accident, not defined behaviour.

Perhaps I could have worded it like this: "So far I have not seen any
troubles including unicode characters in my strings, they *seem* to be
fine for my use-case. What kind of trouble has been seen with this by
others?"

Exactly the same sort of trouble you were having earlier when you were
inadvertently decoding the source file as MacRoman rather than UTF-8.
Mojibake, garbage characters in your text, corrupted data.

http://en.wikipedia.org/wiki/Mojibake

The point is, you might not see these errors, because by accident all the
relevant factors conspire to give you the correct result. You might test
it on a Mac and on Windows and it all works well. You might even test it
on a dozen different machines, and it works fine on all of them. But
since you're relying on an accident of implementation, none of this is
guaranteed. And then in eighteen months time, *something* changes -- a
minor update to Python, a different version of Mac OS/X, an unusual
Registry setting in Windows, who knows what?, and all of a sudden the
factors no longer line up to give you the correct results and it all
comes tumbling down in a big stinking mess. If you are lucky you will get
a nice clear exception telling you something is broken, but more likely
you'll just get corrupted data and mojibake and you, or the poor guy who
maintains the code after you, will have no idea why. And you'll probably
come here asking for our help to solve it.

If you came back and said "I tried it with the u prefix, and it broke a
bunch of other code, and I don't have time to fix it now so I'm reverting
to the u-less byte string form" I wouldn't *like* it but I could *accept*
it as one of those sub-optimal compromises people make in Real Life. I've
done the same thing myself, we probably all have: written code we knew
was broken, but fixing it was too hard or too low a priority.

Really, I wonder why you are so angry at me for having made a mistake?
I'm going to guess that you don't have kids.

What do kids have to do with this? Are you an adult or a child? *wink*

You didn't offend me so much as frustrate me. You had multiple people
telling you the same thing, don't embed Unicode characters in a byte
string, but you choose to not just ignore them but effectively declare
that they were all wrong to give that advice, not just the people here
but essentially the entire Python development community responsible for
adding Unicode strings to the language. Can you blame me for feeling that
your reply seemed rather arrogant?

In any case, I'm glad you responded with a little more restraint than I
did, and I hope you can see my point of view and hopefully I haven't
soured you on this forum.

David M. Cotter · Aug 28, 2013

Thank you for your thoughtful and thorough response. I now understand muchbetter what you (and apparently the others) were warning me against and I will certainly consider that moving forward.

I very much appreciate your help as I learn about python and embedding and all these crazy encoding problems.

What do kids have to do with this?

When a person has children, they quickly learn that the best way to deal with some one who seems to be not listening or having a tantrum: show understanding and compassion, restraint and patience, as you, in the most neutral way that you can, gently bit firmly guide said person back on track. You learn that if you instead express your frustration at said person, that it never, ever helps the situation, and only causes more hurt to be spread around to the very people you are ostensibly attempting to help.

Are you an adult or a child?

Perhaps my comment was lost in translation, but this is rather the questionthat I was obliquely asking you. *wink right back*

In any case I thank you for your help, which has in fact been quite great! My demo script is working, and I know now to properly advise my script writers regarding how to properly encode strings.

can't get utf8 / unicode strings from embedded python

David M. Cotter

Steven D'Aprano

David M. Cotter

Dave Angel

random832

David M. Cotter

wxjmfauth

Benjamin Kaplan

random832

David M. Cotter

Vlastimil Brom

Terry Reedy

David M. Cotter

David M. Cotter

MRAB

David M. Cotter

Steven D'Aprano

David M. Cotter

Steven D'Aprano

David M. Cotter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads