doctests compatibility for python 2 & python 3

R

Robin Becker

I have some problems making some doctests for python2 code compatible with
python3. The problem is that as part of our approach we are converting the code
to use unicode internally. So we allow eihter byte strings or unicode in inputs,
but we are trying to convert to unicode outputs.

That makes doctests quite hard as

def func(a):
"""'aaa'
"""
return a

fails in python2 whilst

def func(a):
"""u'aaa'
"""
return a

fails in python3. Aside from changing the tests so they look like
"""True
"""
which make the test utility harder. If the test fails I don't see the actual
outcome and expected I see expected True got False.

Is there an easy way to make these kinds of tests work in python 2 & 3?
 
S

Steven D'Aprano

I have some problems making some doctests for python2 code compatible
with python3. The problem is that as part of our approach we are
converting the code to use unicode internally. So we allow eihter byte
strings or unicode in inputs, but we are trying to convert to unicode
outputs.

Alas, I think you've run into one of the weaknesses of doctest. Don't get
me wrong, I am a huge fan of doctest, but it is hard to write polyglot
string tests with it, as you have discovered.

However, you may be able to get 95% of the way by using print.

def func(a):
"""aaa
"""
return a

ought to behave identically in both Python 2 and Python 3.3, provided you
only print one object at a time. This ought to work with both ASCII and
non-ASCII (at least in the BMP).
 
R

Robin Becker

def func(a):
"""
aaa
"""
return a
I think this approach seems to work if I turn the docstring into unicode

def func(a):
u"""aaa\u020b
"""
return a
def _doctest():
import doctest
doctest.testmod()

if __name__ == "__main__":
_doctest()

If I leave the u off the docstring it goes wrong in python 2.7. I also tried to
put an encoding onto the file and use the actual utf8 characters ie

# -*- coding: utf-8 -*-
def func(a):
"""aaaȋ
"""
return a
def _doctest():
import doctest
doctest.testmod()

and that works in python3, but fails in python 2 with this
 
S

Steven D'Aprano

I think this approach seems to work if I turn the docstring into unicode

def func(a):
u"""
>>> print(func(u'aaa\u020b'))
aaa\u020b
"""
return a

Good catch! Without the u-prefix, the \u... is not interpreted as an
escape sequence, but as a literal backslash-u.

If I leave the u off the docstring it goes wrong in python 2.7. I also
tried to put an encoding onto the file and use the actual utf8
characters ie

# -*- coding: utf-8 -*-
def func(a):
"""
aaaȋ
"""
return a

There seems to be some mojibake in your post, which confuses issues.

You refer to \u020b, which is LATIN SMALL LETTER I WITH INVERTED BREVE.
At least, that's what it ought to be. But in your post, it shows up as
the two character mojibake, ╚ followed by ï (BOX DRAWINGS DOUBLE UP AND
RIGHT followed by LATIN SMALL LETTER I WITH DIAERESIS). It appears that
your posting software somehow got confused and inserted the two
characters which you would have got using cp-437 while claiming that they
are UTF-8. (Your post is correctly labelled as UTF-8.)

I'm confident that the problem isn't with my newsreader, Pan, because it
is pretty damn good at getting encodings right, but also because your
post shows the same mojibake in the email archive:

https://mail.python.org/pipermail/python-list/2014-January/664771.html

To clarify: you tried to show \u020B as a literal. As a literal, it ought
to be the single character È‹ which is a lower case I with curved accent on
top. The UTF-8 of that character is b'\xc8\x8b', which in the cp-437 code
page is two characters ╚ ï.

py> '\u020b'.encode('utf8').decode('cp437')
'ȋ'

Hence, mojibake.

def _doctest():
import doctest
doctest.testmod()

and that works in python3, but fails in python 2 with this

I cannot replicate this specific exception. I think it may be a side-
effect of you being on Windows. (I'm on Linux, and everything is UTF-8.)

The difficulty here is that it is damn near impossible to sort out which,
if any, bits are mojibake inserted by your posting software, which by
your editor, your terminal, which by Python, and which are artifacts of
the doctest system.

The usual way to debug these sorts of errors is to stick a call to repr()
just before the print.

print(repr(func(u'aaa\u020b')))
 
R

Robin Becker

On 17/01/2014 15:27, Steven D'Aprano wrote:
...........
There seems to be some mojibake in your post, which confuses issues.

You refer to \u020b, which is LATIN SMALL LETTER I WITH INVERTED BREVE.
At least, that's what it ought to be. But in your post, it shows up as
the two character mojibake, ╚ followed by ï (BOX DRAWINGS DOUBLE UP AND
RIGHT followed by LATIN SMALL LETTER I WITH DIAERESIS). It appears that
your posting software somehow got confused and inserted the two
characters which you would have got using cp-437 while claiming that they
are UTF-8. (Your post is correctly labelled as UTF-8.)

I'm confident that the problem isn't with my newsreader, Pan, because it
is pretty damn good at getting encodings right, but also because your
post shows the same mojibake in the email archive:

https://mail.python.org/pipermail/python-list/2014-January/664771.html

To clarify: you tried to show \u020B as a literal. As a literal, it ought
to be the single character È‹ which is a lower case I with curved accent on
top. The UTF-8 of that character is b'\xc8\x8b', which in the cp-437 code
page is two characters ╚ ï.

when I edit the file in vim with ut88 encoding I do see your È‹ as the literal.
However, as you note I'm on windows and no amount of cajoling will get it to
work reasonably so my printouts are broken. So on windows

(py27) C:\code\hg-repos>python -c"print(u'aaa\u020b')"
aaaȋ

on my linux

$ python2 -c"print(u'aaa\u020b')"
aaaȋ

$ python2 tdt1.py
/usr/lib/python2.7/doctest.py:1531: UnicodeWarning: Unicode equal comparison
failed to convert both arguments to Unicode - interpreting them as being unequal
if got == want:
/usr/lib/python2.7/doctest.py:1551: UnicodeWarning: Unicode equal comparison
failed to convert both arguments to Unicode - interpreting them as being unequal
if got == want:
**********************************************************************
File "tdt1.py", line 4, in __main__.func
Failed example:
print(func(u'aaa\u020b'))
Expected:
aaaȋ
Got:
aaaȋ
**********************************************************************
1 items had failures:
1 of 1 in __main__.func
***Test Failed*** 1 failures.
robin@everest ~/tmp:
$ cat tdt1.py
# -*- coding: utf-8 -*-
def func(a):
"""aaaȋ
"""
return a
def _doctest():
import doctest
doctest.testmod()

if __name__ == "__main__":
_doctest()
robin@everest ~/tmp:

so the error persists with our without copying errors.

Note that on my putty terminal I don't see the character properly (I see unknown
glyph square box), but it copies OK.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top