Python 2.1 / 2.3: xreadlines not working with codecs.open

E

Eric Brunel

Hi all,

I just found a problem in the xreadlines method/module when used with codecs.open: the codec specified in the open does not seem to be taken into account by xreadlines which also returns byte-strings instead of unicode strings.

For example, if a file foo.txt contains some text encoded in latin1:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in f.xreadlines()]
['\xe9\xe0\xe7\xf9\n']

But:
[u'\ufffd\ufffd']

The characters in latin1 are correctly "dumped" with readlines, but are still in latin1 encoding in byte-strings with xreadlines.

I tested with Python 2.1 and 2.3 on Linux and Windows: same result (I haven't Python 2.4 installed here)

Can anybody confirm the problem? Is this a bug? I searched this usegroup and the known Python bugs, but the problem did not seem to be reported yet.

TIA
 
E

Eric Brunel

Hi all,

I just found a problem in the xreadlines method/module when used with codecs.open: the codec specified in the open does not seem to be taken into account by xreadlines which also returns byte-strings instead of unicode strings.

For example, if a file foo.txt contains some text encoded in latin1:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in f.xreadlines()]
['\xe9\xe0\xe7\xf9\n']

But:
import codecs
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
f.readlines()
[u'\ufffd\ufffd']

The characters in latin1 are correctly "dumped" with readlines, but are still in latin1 encoding in byte-strings with xreadlines.

Replying to myself. One more funny thing:
import codecs, xreadlines
f = codecs.open('foo.txt', 'r', 'utf-8', 'replace')
[l for l in xreadlines.xreadlines(f)]
[u'\ufffd\ufffd']

So f.xreadlines does not work, but xreadlines.xreadlines(f) does. And this happens in Python 2.3, but also in Python 2.1, where the implementation for f.xreadlines() calls xreadlines.xreadlines(f) (?!?). Something's escaping me here... Reading the source didn't help.

At least, it does provide a workaround...
 
P

Peter Otten

Eric said:
I just found a problem in the xreadlines method/module when used with
codecs.open: the codec specified in the open does not seem to be taken
into account by xreadlines which also returns byte-strings instead of
unicode strings.
So f.xreadlines does not work, but xreadlines.xreadlines(f) does. And this
happens in Python 2.3, but also in Python 2.1, where the implementation
for f.xreadlines() calls xreadlines.xreadlines(f) (?!?). Something's
escaping me here... Reading the source didn't help.

codecs.StreamReaderWriter seems to delegate everything it doesn't implement
itself to the underlying file instance which is ignorant of the encoding.
The culprit:

def __getattr__(self, name,
getattr=getattr):

""" Inherit all other methods from the underlying stream.
"""
return getattr(self.stream, name)
At least, it does provide a workaround...

Note that the xreadlines module hasn't made it into Python 2.4.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top