Handling some isolated iso-8859-1 characters

Daniel Mahoney · Jun 3, 2008

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Anaïs".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?

Justin Ezequiel · Jun 4, 2008

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Anaïs".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?

from email.Header import decode_header
decode_header("=?iso-8859-1?Q?Ana=EFs?=") [('Ana\xefs', 'iso-8859-1')]
(s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=")
s 'Ana\xefs'
e 'iso-8859-1'
s.decode(e) u'Ana\xefs'
import unicodedata
import htmlentitydefs
for c in s.decode(e):

Click to expand...

Click to expand...

.... print ord(c), unicodedata.name(c)
....
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S

htmlentitydefs.codepoint2name[239] 'iuml'

Click to expand...

Click to expand...

Gabriel Genellina · Jun 4, 2008

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle.
=?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads
"=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Anaïs".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?

No, it's not you, those headers are formatted following RFC 2047
<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,
see <http://docs.python.org/lib/module-email.header.html>

Daniel Mahoney · Jun 4, 2008

No, it's not you, those headers are formatted following RFC 2047

<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,
see <http://docs.python.org/lib/module-email.header.html>

Excellent, that's exactly what I was looking for. Thanks!

Daniel Mahoney · Jun 4, 2008

... print ord(c), unicodedata.name(c)

...
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S

Looks like I need to explore the unicodedata class. Thanks!

Max M · Jun 4, 2008

Daniel Mahoney skrev:

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Anaïs".

There is a mention of email headers and unicode in the end of this article:

http://mxm-mad-science.blogspot.com/2008/03/python-unicode-lessons-from-school-of.html

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science

?ISO-8859-1?Q? from IMAP server	0	Mar 6, 2008
Pb with characters ISO-8859-1	1	Oct 11, 2007
How to parse xml with ISO-8859-1 encoding using ElementTree andSimpleXMLTreeBuilder?	0	May 13, 2008
requestEncoding = "ISO-8859-1"	12	Feb 7, 2006
iso-8859-1 and UTF-8	3	Feb 24, 2006
Newbie question: Working with iso-8859-1 files in Ruby	0	Jul 27, 2006
Single byte ISO-8859-1 characters from web service in VB.NET	0	May 25, 2007
UTF-8 vs. iso-8859-1	5	Aug 4, 2005

Handling some isolated iso-8859-1 characters

Daniel Mahoney

Justin Ezequiel

Gabriel Genellina

Daniel Mahoney

Daniel Mahoney

Max M

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads