Trouble fixing a broken ASCII string - "replace" mode in codec notworking.

R

Robert Kern

John said:
I'm trying to clean up a bad ASCII string, one read from a
web page that is supposedly in the ASCII character set but has some
characters above 127. And I get this:

File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
sitetext = sitetext.encode('ascii','replace') # force to clean ASCII

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
ordinal not in range(128)

Why is that exception being raised when the codec was told 'replace'?

The .encode('ascii') takes unicode strings to str strings. Since you gave it a
str string, it first tried to convert it to a unicode string using the default
codec ('ascii'), just as if you were to have done
unicode(sitetext).encode('ascii', 'replace').

I think you want something like this:

sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Nagle

I'm trying to clean up a bad ASCII string, one read from a
web page that is supposedly in the ASCII character set but has some
characters above 127. And I get this:

File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
sitetext = sitetext.encode('ascii','replace') # force to clean ASCII

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
ordinal not in range(128)

Why is that exception being raised when the codec was told 'replace'?

(And no, just converting it to Unicode with "sitetext = unicode(sitetext)"
won't work either; that correctly raises a Unicode conversion exception.)

[Python 2.4, Win32]

JohnNagle
 
N

Neil Cerutti

The .encode('ascii') takes unicode strings to str strings.
Since you gave it a str string, it first tried to convert it to
a unicode string using the default codec ('ascii'), just as if
you were to have done unicode(sitetext).encode('ascii',
'replace').

I think you want something like this:

sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

This is the cue for the translate method, which will be much
faster and simpler for cases like this. You can build the
translation table yourself, or use maketrans.
.... '?'*127)


You'd only want to do that once. Then to strip off the non-ascii:

sitetext.translate(asciitable)

I used a similar solution in an application I'm working on that
must uses a Latin-1 byte-encoding internally, but displays on
stdout in ascii.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top