Handling some isolated iso-8859-1 characters

D

Daniel Mahoney

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?
 
J

Justin Ezequiel

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle. =?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads "=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?
from email.Header import decode_header
decode_header("=?iso-8859-1?Q?Ana=EFs?=") [('Ana\xefs', 'iso-8859-1')]
(s, e), = decode_header("=?iso-8859-1?Q?Ana=EFs?=")
s 'Ana\xefs'
e 'iso-8859-1'
s.decode(e) u'Ana\xefs'
import unicodedata
import htmlentitydefs
for c in s.decode(e):
.... print ord(c), unicodedata.name(c)
....
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S
htmlentitydefs.codepoint2name[239] 'iuml'
 
G

Gabriel Genellina

I'm working on an app that's processing Usenet messages. I'm making a
connection to my NNTP feed and grabbing the headers for the groups I'm
interested in, saving the info to disk, and doing some post-processing.
I'm finding a few bizarre characters and I'm not sure how to handle them
pythonically.

One of the lines I'm finding this problem with contains:
137050 Cleo and I have an anouncement! "Mlle.
=?iso-8859-1?Q?Ana=EFs?="
<[email protected]> Sun, 21 Nov 2004 16:21:50 -0500
<[email protected]> 4478 69 Xref:
sn-us rec.pets.cats.community:137050

The interesting patch is the string that reads
"=?iso-8859-1?Q?Ana=EFs?=".
An HTML rendering of what this string should look would be "Ana&iuml;s".

What I'm doing now is a brute-force substitution from the version in the
file to the HTML version. That's ugly. What's a better way to translate
that string? Or is my problem that I'm grabbing the headers from the NNTP
server incorrectly?

No, it's not you, those headers are formatted following RFC 2047
<http://www.faqs.org/ftp/rfc/rfc2047.txt>
Python already has support for that format, use the email.header class,
see <http://docs.python.org/lib/module-email.header.html>
 
D

Daniel Mahoney

... print ord(c), unicodedata.name(c)
...
65 LATIN CAPITAL LETTER A
110 LATIN SMALL LETTER N
97 LATIN SMALL LETTER A
239 LATIN SMALL LETTER I WITH DIAERESIS
115 LATIN SMALL LETTER S

Looks like I need to explore the unicodedata class. Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top