UnicodeDecodeError? Argh! Nothing works! I'm tired and hurting and...

A

Alf P. Steinbach

This is the tragic story of this evening:
1. Aspirins to lessen the pain somewhat.
2. Over in [comp.programming] someone mentions paper on Quicksort.
3. I recall that X once sent me link to paper about how to foil
Quicksort, written by was it Doug McIlroy, anyway some Bell Labs guy.
Want to post that link in response to [comp.programming] article.
4. Checking in Thunderbird, no mails from X or about QS there.
5. But his mail address in address list so something funny going on!
6. Googling, yes, it seems Thunderbird has a habit of "forgetting" mails. But
they're really there after all. It's just the index that's screwed up.
7. OK, opening Thunderbird mailbox file (it's just text) in nearest editor.
8. Machine hangs, Windows says it must increase virtual memory, blah blah.
9. Making little Python script to extract individual mails from file.
10. It says UnicodeDecodeError on mail nr. something something.
11. I switch mode to binary. Didn't know if that would work with std input.
12. It's now apparently ten times faster but *still* UnicodeDecodeError!
13. I ask here!

Of course could have googled that paper, but at each step above it seemed just a
half minute more to find the link in mails, and now I decided it must be found.

And I'm hesitant to just delete index file, hoping that it'll rebuild.

Thunderbird does funny things, so best would be if Python script worked.


<code>
import os
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):
if line.startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
else:
f.write( line )
f.close()
</code>


<last few lines of output>
955
956
957
958
Traceback (most recent call last):
File "C:\test\tbfix\splitmails.py", line 11, in <module>
for line in fileinput.input( mode = "rb" ):
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254, in __next__
line = self.readline()
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349, in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line 23, in
decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 2188:
character maps to <undefined
</last few lines of output>


Cheers,

- Alf
 
A

Alf P. Steinbach

* Alf P. Steinbach:
<code>
import os
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):
if line.startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )
else:
f.write( line )
f.close()
</code>


<last few lines of output>
955
956
957
958
Traceback (most recent call last):
File "C:\test\tbfix\splitmails.py", line 11, in <module>
for line in fileinput.input( mode = "rb" ):
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254,
in __next__
line = self.readline()
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349,
in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line
23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
2188: character maps to <undefined
</last few lines of output>

The following worked:


<code>
import sys
import fileinput

def write( s ): print( s, end = "" )

msg_id = 0
f = open( "nul", "w" )
input = sys.stdin.detach() # binary
while True:
line = input.readline()
if len( line ) == 0:
break
elif line.decode( "ascii", "ignore" ).startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "wb+" )
else:
f.write( line )
f.close()
</code>


Cheers,

- Alf
 
T

Terry Reedy

Alf said:
import os
import fileinput

def write( s ): print( s, end = "" )

I believe this is the same as
write = sys.stdout.write
though you never use it that I see.
msg_id = 0
f = open( "nul", "w" )
for line in fileinput.input( mode = "rb" ):

I presume you are expecting the line to be undecoded bytes, as with
open(f,'rb'). To be sure, add write(type(line)).
if line.startswith( "From - " ):
msg_id += 1;
f.close()
print( msg_id )
f = open( "msg_{0:0>6}.txt".format( msg_id ), "w+" )

I do not understand why you are writing since you just wanted to look.
In any case, you open in text mode.

else:
f.write( line )
f.close()
</code>


<last few lines of output>
955
956
957
958
Traceback (most recent call last):
File "C:\test\tbfix\splitmails.py", line 11, in <module>
for line in fileinput.input( mode = "rb" ):
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 254,
in __next__
line = self.readline()
File "C:\Program Files\cpython\python31\lib\fileinput.py", line 349,
in readline
self._buffer = self._file.readlines(self._bufsize)
File "C:\Program Files\cpython\python31\lib\encodings\cp1252.py", line
23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
2188: character maps to <undefined

It goes ahead and tries to decode to str anyway. Maybe there is a bug,
though maybe the text-mode open in the loop somehow changes fileinput,
especially if you write to something it has open. So I would not report
a bug until I tried reading without writing.

tjr
 
L

Lie Ryan

Alf said:
And I'm hesitant to just delete index file, hoping that it'll rebuild.

it'll be rebuild the next time you start Thunderbird:
(MozillaZine: http://kb.mozillazine.org/Disappearing_mail)
* It's possible that the ".msf" files (index files) are corrupted. To
rebuild the index of a folder, right-click it, select Properties, and
choose "Rebuild Index" from the General Information tab. You can also
close Thunderbird and manually delete them from your profile folder;
they will be rebuilt when Thunderbird starts.
 
N

Nobody

10. It says UnicodeDecodeError on mail nr. something something.

That's what you get for using Python 3.x ;)

If you must use 3.x, don't use the standard descriptors. If you must use
the standard descriptors in 3.x, call detach() on them to get the
underlying binary stream, i.e.

stdin = sys.stdin.detach()
stdout = sys.stdout.detach()

and use those instead.

Or set LC_ALL or LC_CTYPE to an ISO-8859-* locale (any stream of bytes can
be decoded, and any string resulting from decoding can be encoded).
 
S

Steven D'Aprano

6. Googling, yes, it seems Thunderbird has a habit of "forgetting"
mails. But they're really there after all. It's just the index that's
screwed up. [...]
And I'm hesitant to just delete index file, hoping that it'll rebuild.

Right-click on the mailbox and choose "Rebuild Index".

If you're particularly paranoid, and you probably should be, make a
backup copy of the entire mail folder first.

http://kb.mozillazine.org/Compacting_folders
http://kb.mozillazine.org/Recover_messages_from_a_corrupt_folder
http://kb.mozillazine.org/Disappearing_mail


Good grief, it's about six weeks away from 2010 and Thunderbird still
uses mbox as it's default mail box format. Hello, the nineties called,
they want their mail formats back! Are the tbird developers on crack or
something? I can't believe that they're still using that crappy format.

No, I tell a lie. I can believe it far too well.
 
C

Chris Jones

Good grief, it's about six weeks away from 2010 and Thunderbird still
uses mbox as it's default mail box format. Hello, the nineties called,
they want their mail formats back! Are the tbird developers on crack or
something? I can't believe that they're still using that crappy format.

No, I tell a lie. I can believe it far too well.

:)

I realize that's somewhat OT, but what mail box format do you recommend,
and why?

Thanks,

CJ
 
S

Steven D'Aprano

:)

I realize that's somewhat OT, but what mail box format do you recommend,
and why?

maildir++

http://en.wikipedia.org/wiki/Maildir

Corruption is less likely, if there is corruption you'll only lose a
single message rather than potentially everything in the mail folder[*],
at a pinch you can read the emails using a text editor or easily grep
through them, and compacting the mail folder is lightning fast, there's
no wasted space in the mail folder, and there's no need to mangle lines
starting with "From " in the body of the email.

The only major downside is that because you're dealing with potentially
thousands of smallish files, it *may* have reduced performance on some
older file systems that don't deal well with lots of files. These days,
that's not a real issue.

Oh yes, and people using Windows can't use maildir because (1) it doesn't
allow colons in names, and (2) it doesn't have atomic renames. Neither of
these are insurmountable problems: an implementation could substitute
another character for the colon, and while that would be a technical
violation of the standard, it would still work. And the lack of atomic
renames would simply mean that implementations have to be more careful
about not having two threads writing to the one mailbox at the same time.




[*] I'm assuming normal "oops there's a bug in the mail client code"
corruption rather than "I got drunk and started deleting random files and
directories" corruption.
 
S

samwyse

Oh yes, and people using Windows can't use maildir because (1) it doesn't
allow colons in names, and (2) it doesn't have atomic renames. Neither of
these are insurmountable problems: an implementation could substitute
another character for the colon, and while that would be a technical
violation of the standard, it would still work. And the lack of atomic
renames would simply mean that implementations have to be more careful
about not having two threads writing to the one mailbox at the same time.

A common work around for the former is to URL encode the names, which
let's you stick all sorts of odd characters.

I'm afraid I can't help with the latter, though.
 
C

Chris Jones


Outside the two pluses, maildir also goes back to the 90s - 1995, Daniel
Berstein's orginal specification.
Corruption is less likely, if there is corruption you'll only lose a
single message rather than potentially everything in the mail folder[*],
at a pinch you can read the emails using a text editor or easily grep
through them, and compacting the mail folder is lightning fast, there's
no wasted space in the mail folder, and there's no need to mangle lines
starting with "From " in the body of the email.

This last aspect very welcome.
The only major downside is that because you're dealing with potentially
thousands of smallish files, it *may* have reduced performance on some
older file systems that don't deal well with lots of files. These days,
that's not a real issue.

Oh yes, and people using Windows can't use maildir because (1) it doesn't
allow colons in names, and (2) it doesn't have atomic renames. Neither of
these are insurmountable problems: an implementation could substitute
another character for the colon, and while that would be a technical
violation of the standard, it would still work. And the lack of atomic
renames would simply mean that implementations have to be more careful
about not having two threads writing to the one mailbox at the same time.


[*] I'm assuming normal "oops there's a bug in the mail client code"
corruption rather than "I got drunk and started deleting random files and
directories" corruption.

I'm not concerned with the other aspects, but I'm reaching a point where
mutt is becoming rather sluggish with the mbox format, especially those
mail boxes that have more than about 3000 messages and it looks like
maildir, especially with some form of header caching might help.

Looks like running a local IMAP server would probably be more effective,
though.

Thank you for your comments.

CJ
 
T

Terry Reedy

Ken said:
I need to create a pipe where I have one thread (or maybe a generator)
writing data to the tail while another python object is reading from the
head. This will run in real time, so the data must be deallocated after
it is consumed.

CPython does that when last reference disappears.
Reading should block until data is written, and writing
should block when the buffer is full (i.e. until some of the data is
consumed). I assume there must be a trivial way to do this, but I don't
see it. Any ideas or examples?

I'm using python 2.6.
queue module
 
D

Dave Angel

Ken said:
I need to create a pipe where I have one thread (or maybe a generator)
writing data to the tail while another python object is reading from
the head. This will run in real time, so the data must be deallocated
after it is consumed. Reading should block until data is written, and
writing should block when the buffer is full (i.e. until some of the
data is consumed). I assume there must be a trivial way to do this,
but I don't see it. Any ideas or examples?

I'm using python 2.6.
Seems to me collections.deque is a good data structure for the purpose,
at least if both operations are in the same thread.

For multithreading, consider Queue module (or queue in Python 3.x).

DaveA
 
A

Aahz

Good grief, it's about six weeks away from 2010 and Thunderbird still
uses mbox as it's default mail box format. Hello, the nineties called,
they want their mail formats back! Are the tbird developers on crack or
something? I can't believe that they're still using that crappy format.

Just to be contrary, I *like* mbox.
 
N

Nobody

Me too.
Why? What features or benefits of mbox do you see that make up for it's
disadvantages?

Simplicity and performance.

Maildir isn't simple when you add in the filesystem or archive format
(leaving aside the fact that maildir cannot be processed using nothing but
ANSI C).

Nor is it particularly quick if you want to grep for a message in a
decade's worth of archives (even on Linux; and NTFS is *much* worse for
dealing with many small files).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top