Trouble fixing a broken ASCII string - "replace" mode in codec notworking.

Discussion in 'Python' started by Robert Kern, Feb 6, 2007.

  1. Robert Kern

    Robert Kern Guest

    Re: Trouble fixing a broken ASCII string - "replace" mode in codecnot working.

    John Nagle wrote:
    > I'm trying to clean up a bad ASCII string, one read from a
    > web page that is supposedly in the ASCII character set but has some
    > characters above 127. And I get this:
    >
    > File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
    > sitetext = sitetext.encode('ascii','replace') # force to clean ASCII
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
    > ordinal not in range(128)
    >
    > Why is that exception being raised when the codec was told 'replace'?


    The .encode('ascii') takes unicode strings to str strings. Since you gave it a
    str string, it first tried to convert it to a unicode string using the default
    codec ('ascii'), just as if you were to have done
    unicode(sitetext).encode('ascii', 'replace').

    I think you want something like this:

    sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Feb 6, 2007
    #1
    1. Advertising

  2. Robert Kern

    John Nagle Guest

    I'm trying to clean up a bad ASCII string, one read from a
    web page that is supposedly in the ASCII character set but has some
    characters above 127. And I get this:

    File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
    sitetext = sitetext.encode('ascii','replace') # force to clean ASCII

    UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 29151:
    ordinal not in range(128)

    Why is that exception being raised when the codec was told 'replace'?

    (And no, just converting it to Unicode with "sitetext = unicode(sitetext)"
    won't work either; that correctly raises a Unicode conversion exception.)

    [Python 2.4, Win32]

    JohnNagle
     
    John Nagle, Feb 6, 2007
    #2
    1. Advertising

  3. Robert Kern

    Neil Cerutti Guest

    Re: Trouble fixing a broken ASCII string - "replace" mode in codec not working.

    On 2007-02-06, Robert Kern <> wrote:
    > John Nagle wrote:
    >> File "D:\projects\sitetruth\InfoSitePage.py", line 285, in httpfetch
    >> sitetext = sitetext.encode('ascii','replace') # force to clean ASCII
    >>
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in
    >> position 29151: ordinal not in range(128)
    >>
    >> Why is that exception being raised when the codec was told 'replace'?

    >
    > The .encode('ascii') takes unicode strings to str strings.
    > Since you gave it a str string, it first tried to convert it to
    > a unicode string using the default codec ('ascii'), just as if
    > you were to have done unicode(sitetext).encode('ascii',
    > 'replace').
    >
    > I think you want something like this:
    >
    > sitetext = sitetext.decode('ascii', 'replace').encode('ascii', 'replace')


    This is the cue for the translate method, which will be much
    faster and simpler for cases like this. You can build the
    translation table yourself, or use maketrans.

    >>> asciitable = string.maketrans(''.join(chr(a) for a in xrange(127, 256)),

    .... '?'*127)


    You'd only want to do that once. Then to strip off the non-ascii:

    sitetext.translate(asciitable)

    I used a similar solution in an application I'm working on that
    must uses a Latin-1 byte-encoding internally, but displays on
    stdout in ascii.

    --
    Neil Cerutti
     
    Neil Cerutti, Feb 7, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Hahn

    ascii codec missing under py2exe

    Mark Hahn, Sep 12, 2003, in forum: Python
    Replies:
    1
    Views:
    479
    vincent wehren
    Sep 12, 2003
  2. oziko
    Replies:
    1
    Views:
    549
    Leif K-Brooks
    Aug 17, 2004
  3. John Nagle
    Replies:
    3
    Views:
    657
    Waldemar Osuch
    Nov 10, 2007
  4. David De
    Replies:
    1
    Views:
    462
    David De
    Apr 25, 2008
  5. Tom Link
    Replies:
    19
    Views:
    260
    Tom Link
    Dec 16, 2008
Loading...

Share This Page