usage of <string>.encode('utf-8','xmlcharrefreplace')?

J

J Peyret

Well, as usual I am confused by unicode encoding errors.

I have a string with problematic characters in it which I'd like to
put into a postgresql table.
That results in a postgresql error so I am trying to fix things with

Trying for an encode:
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

OK, that's pretty much as expected, I know this is not valid utf-8.
But I should be able to fix this with the errors parameter of the
encode method.
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

Same exact error I got without the errors parameter.

Did I mistype the error handler name? Nope.
<built-in function xmlcharrefreplace_errors>

Same results with 'ignore' as an error handler.
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

And with a bogus error handler:

print s.encode('utf-8','bogus')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

This all looks unusually complicated for Python.
Am I missing something incredibly obvious?
How does one use the errors parameter on strings' encode method?

Also, why are the exceptions above complaining about the 'ascii' codec
if I am asking for 'utf-8' conversion?

Version and environment below. Should I try to update my python from
somewhere?

./$ python
Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2

Cheers
 
C

Carsten Haese

Well, as usual I am confused by unicode encoding errors.

I have a string with problematic characters in it which I'd like to
put into a postgresql table.
That results in a postgresql error so I am trying to fix things with
s = 'he Company\xef\xbf\xbds ticker'
print s he [UTF-8?]Company�s ticker

Trying for an encode:
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)

OK, that's pretty much as expected, I know this is not valid utf-8.

Actually, the string *is* valid UTF-8, but you're confused about encoding and
decoding. Encoding is the process of turning a Unicode object into a byte
string. Decoding is the process of turning a byte string into a Unicode object.

You need to decode your byte string into a Unicode object, and then encode the
result to a byte string in a different encoding. For example:
'he Company�s ticker'

By the way, whether this is the correct fix for your PostgreSQL error is not
clear, since you kept that error message a secret for some reason. There could
be a better solution than transcoding the string in this way, but we won't
know until you show us the actual error you're trying to fix. At the moment,
it's like showing you the best way to inflate a tire with a hammer.

Hope this helps,
 
J

J Peyret

OK, txs a lot. I will have to think a bit more about you said, what I
am doing and how encode/decode fits in.

You are right, I am confused about unicode. Guilty as charged.

I've seen the decode+encode chaining invoked in some of the examples,
but not the rationale for it.
Also doesn't help that I am not sure what encoding is used in the data
file that I'm using.

I didn't set out to "hide" the original error, just wanted to simplify
my error posting, after having researched enough to see that
encode/decode was part of the solution.
Adding the db aspect to the equation doesn't really help much and I
should have left it out entirely.

FWIW:

<class 'psycopg2.ProgrammingError'>
invalid byte sequence for encoding "UTF8": 0x92
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

column is a varchar(2000) and the "guilty characters" are those used
in my posting.

Txs again.
 
7

7stud

On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote


Well, as usual I am confused by unicode encoding errors.
I have a string with problematic characters in it which I'd like to
put into a postgresql table.
That results in a postgresql error so I am trying to fix things with
<string>.encode
s = 'he Company\xef\xbf\xbds ticker'
print s
he [UTF-8?]Company�s ticker
Trying for an encode:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position
10: ordinal not in range(128)
OK, that's pretty much as expected, I know this is not valid utf-8.

Actually, the string *is* valid UTF-8, but you're confused about encoding and
decoding. Encoding is the process of turning a Unicode object into a byte
string. Decoding is the process of turning a byte string into a Unicode object.

...or to put it more simply: encode() is used to covert a unicode
string into a regular string. A unicode string looks like this:

s = u'\u0041'

but your string looks like this:

s = 'he Company\xef\xbf\xbds ticker'

Note that there is no 'u' in front of your string. Therefore, you
can't call encode() on that string.
Also, why are the exceptions above complaining about the 'ascii'
codec if I am asking for 'utf-8' conversion?

If a python function requires a unicode string and a unicode string
isn't provided, then python will implicitly try to convert the string
it was given into a unicode string. In order to convert a given
string into a unicode string, python needs to know the secret code
that was used to produce the given string. The secret code is
otherwise known as a 'codec'. When python attempts an implicit
conversion of a given string into a unicode string, python uses the
default codec, which is normally set to 'ascii'. Since your string
contains non-ascii characters, you get an error. That all happens
long before your 'utf-8' argument ever comes into play.

decode() is used to convert a regular string into a unicode string
(the opposite of encode()). Your error is a 'decode' error(rather
than an 'encode' error):

UnicodeDecodeError

because python is implicitly trying to convert the given regular
string into a unicode string with the default ascii codec, and python
is unable to do that.
 
7

7stud

To clarify a couple of points:

 A unicode string looks like this:

s = u'\u0041'

but your string looks like this:

s = 'he Company\xef\xbf\xbds ticker'

Note that there is no 'u' in front of your string.  

That means your string is a regular string.

If a python function requires a unicode string and a unicode string
isn't provided..

For example: encode().


One last point: you can't display a unicode string. The very act of
trying to print a unicode string causes it to be converted to a
regular string. If you try to display a unicode string without
explicitly encode()'ing it first, i.e. converting it to a regular
string using a specified secret code--a so called 'codec', python will
implicitly attempt to convert the unicode string to a regular string
using the default codec, which is usually set to ascii.
 
J

J Peyret

One last point: you can't display a unicode string. The very act of
trying to print a unicode string causes it to be converted to a
regular string. If you try to display a unicode string without
explicitly encode()'ing it first, i.e. converting it to a regular
string using a specified secret code--a so called 'codec', python will
implicitly attempt to convert the unicode string to a regular string
using the default codec, which is usually set to ascii.

Yes, the string above was obtained by printing, which got it into
ASCII format, as you picked up.
Something else to watch out for when posting unicode issues.

The solution I ended up with was

1) Find out the encoding in the data file.

In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
the bottom of the save prompt dialog.

ISO-8859-15 in my case.

2) Look up encoding corresponding to ISO-8859-15 at

http://docs.python.org/lib/standard-encodings.html

3) Applying the decode/encode recipe suggested previously, for which I
do understand the reason now.

#converting rawdescr
#from ISO-8859-15 (from the file)
#to UTF-8 (what postgresql wants)
#no error handler required.
decodeddescr = rawdescr.decode('iso8859_15').encode('utf-8')

postgresql insert is done using decodeddescr variable.

Postgresql is happy, I'm happy.
 
7

7stud

Yes, the string above was obtained by printing, which got it into
ASCII format, as you picked up.
Something else to watch out for when posting unicode issues.

The solution I ended up with was

1) Find out the encoding in the data file.

In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
the bottom of the save prompt dialog.

ISO-8859-15 in my case.

2) Look up encoding corresponding to ISO-8859-15 at

http://docs.python.org/lib/standard-encodings.html

3) Applying the decode/encode recipe suggested previously, for which I
do understand the reason now.

#converting rawdescr
#from ISO-8859-15 (from the file)
#to UTF-8 (what postgresql wants)
#no error handler required.
decodeddescr = rawdescr.decode('iso8859_15').encode('utf-8')

postgresql insert is done using decodeddescr variable.

Postgresql is happy, I'm happy.

Or, you can cheat. If you are reading from a file, you can make set
it up so any string that you read from the file automatically gets
converted from its encoding to another encoding. You don't even have
to be aware of the fact that a regular string has to be converted into
a unicode string before it can be converted to a regular string with a
different encoding. Check out the codecs module and the EncodedFile()
function:

import codecs

s = 'he Company\xef\xbf\xbds ticker'

f = open('data2.txt', 'w')
f.write(s)
f.close()

f = open('data2.txt')
f_special = codecs.EncodedFile(f, 'utf-8', 'iso8859_15') #file, new
encoding, file's encoding
print f_special.read() #If your display device understands utf-8, you
will see the troublesome character displayed.
#Are you sure that character is legitimate?

f.close()
f_special.close()
 
C

Carsten Haese

[...]
You are right, I am confused about unicode. Guilty as charged.

You should read http://www.amk.ca/python/howto/unicode to clear up some of
your confusion.
[...]
Also doesn't help that I am not sure what encoding is used in the
data file that I'm using.

That is, incidentally, the direct cause of the error message below.
[...]
<class 'psycopg2.ProgrammingError'>
invalid byte sequence for encoding "UTF8": 0x92
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

What this error message means is that you've given the database a byte string
in an unknown encoding, but you're pretending (by default, i.e. by not telling
the database otherwise) that the string is utf-8 encoded. The database is
encountering a byte that should never appear in a valid utf-8 encoded byte
string, so it's raising this error, because your string is meaningless as
utf-8 encoded text.

This is not surprising, since you don't know the encoding of the string. Well,
now we know it's not utf-8.
column is a varchar(2000) and the "guilty characters" are those used
in my posting.

I doubt that. The error message is complaining about a byte with the value
0x92. That byte appeared nowhere in the string you posted, so the error
message must have been caused by a different string.

Now for the solution of your problem: If you don't care what the encoding of
your byte string is and you simply want to treat it as binary data, you should
use client_encoding "latin-1" or "iso8859_1" (they're different names for the
same thing). Since latin-1 simply maps the bytes 0 to 255 to unicode code
points 0 to 255, you can store any byte string in the database, and get the
same byte string back from the database. (The same is not true for utf-8 since
not every random string of bytes is a valid utf-8 encoded string.)

Hope this helps,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,072
Latest member
trafficcone

Latest Threads

Top