treating str as unicode in legacy code?

B

Ben

I'm left with some legacy code using plain old str, and I need to make
sure it works with unicode input/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals become
unicode litrals.
- Add this statement

str = unicode

to all .py files so the type comparison (e.g., type('123') == str)
would work.


Did I miss anything? Does this sound like a workable plan?

Thanks!
 
S

Steve Holden

Ben said:
I'm left with some legacy code using plain old str, and I need to make
sure it works with unicode input/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals become
unicode litrals.
- Add this statement

str = unicode

to all .py files so the type comparison (e.g., type('123') == str)
would work.


Did I miss anything? Does this sound like a workable plan?

Thanks!

Well, don't forget that the assignment to str *shadows* the built-in
rather than replacing it, so there may be places (imported modules being
the example that most readily springs to mind) where that replacement
won't be effective.

Plus which in CPython the C parts of the code may well be creating and
expecting objects of type str but they won't use the Python naming
mechanism at all, so you will have no way to effect changes in those
behaviors.

This will probably account for about 95% of any strangeness you see, but
it's probably a good first step in the conversion process.

regards
Steve
 
J

John Machin

I'm left with some legacy code using plain oldstr, and I need to make
sure it works withunicodeinput/output. I have a simple plan to do
this:

- Run the code with "python -U" so all the string literals becomeunicodelitrals.

Requiring that the code is always run with a non-default argument
doesn't seem very robust/portable to me.
- Add this statement

str=unicode

to all .py files so the type comparison (e.g., type('123') ==str)
would work.

IMVHO (1) doing that merely changes "legacy code" to "kludged legacy
code" (2) there is no substitute for reading the code and trying to
nut out what it is doing.

Do you mean that those two things are the ONLY changes you plan to
make?
Did I miss anything? Does this sound like a workable plan?

Do you need to make sure it still works with ASCII input? With input
in some other encoding e.g. cp1252?

What do you mean by "unicode input"? Bear in mind that if you want to
work with Python unicode objects internally, input from a file /
socket / whatever will need to be decoded i.e. you will have to read
the code and make appropriate changes. Data stored in (say) utf_16_le
encoding is not "unicode" in the sense that you need; it still has to
be decoded.

What do you mean by "unicode output"? You are going to need to encode
your output.

This doesn't work; the output is not "unicode" in any meaningful
sense:### Warning: you need to hope that all builtins etc that you are
calling cope with unicode arguments as well as the above one does.'abcde\r\n'

This doesn't work; it crashes.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in
position 5:
ordinal not in range(128)
Some object methods work differently with unicode; e.g. (1)
str.translate and unicode.translate.

(2)
'abc\xA0def'.split() ['abc\xa0def']
u'abc\xA0def'.split()
[u'abc', u'def']
NameError: name 'isspace' is not defined
HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top