sqlalchemy and Unicode strings: errormessage

W

Wolfgang Meiners

Hi,

I am trying to build an application using sqlalchemy.

in principle i have the structure

#==============================================

from sqlalchemy import *
from sqlalchemy.orm import *

metadata = MetaData('sqlite://')
a_table = Table('tf_lehrer', metadata,
Column('id', Integer, primary_key=True),
Column('Kuerzel', Text),
Column('Name', Text))

A_class = Class(object):
def __init__(self, Kuerzel, Name)
self.Kuerzel=Kuerzel
self.Name=Name

mapper(A_class, a_table)

A_record = A_class('BUM', 'Bäumer')

Session = sessionmaker()
session = Session()

session.add(A_record)

session.flush()

#================================================

At this time it runs to the line

session.flush()

where i get the following errormessage:

sqlalchemy.exc.ProgrammingError: (ProgrammingError) You must not use
8-bit bytestrings unless you use a text_factory that can interpret 8-bit
bytestrings (like text_factory = str). It is highly recommended that you
instead just switch your application to Unicode strings. u'INSERT INTO
tf_lehrer ("Kuerzel", "Name") VALUES (?, ?)' ('BUM', 'B\xc3\xa4umer')

but where can i switch my application to Unicode strings?

Thank you for all hints
Wolfgang
 
D

Daniel Kluev

metadata = MetaData('sqlite://')
a_table = Table('tf_lehrer', metadata,
   Column('id', Integer, primary_key=True),
   Column('Kuerzel', Text),
   Column('Name', Text))

Use UnicodeText instead of Text.
A_record = A_class('BUM', 'Bäumer')

If this is python2.x, use u'Bäumer' instead.
 
W

Wolfgang Meiners

Am 31.05.11 13:32, schrieb Daniel Kluev:
Use UnicodeText instead of Text.


If this is python2.x, use u'Bäumer' instead.

Thank you Daniel.
So i came a little bit closer to the solution. Actually i dont write the
strings in a python program but i read them from a file, which is
utf8-encoded.

So i changed the lines

for line in open(file,'r'):
line = line.strip()

first to

for line in open(file,'r'):
line = unicode(line.strip())

and finally to

for line in open(file,'r'):
line = unicode(line.strip(),'utf8')

and now i get really utf8-strings. It does work but i dont know why it
works. For me it looks like i change an utf8-string to an utf8-string.

By the way: when i run a python program from eclipse, then

print sys.getdefaultencoding()

returns utf-8

and when i run the same python program from the command line, then

print sys.getdefaultencoding()

returns ascii

but my locale is set to
$ locale
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL="de_DE.UTF-8"

I think, utf8 is somewhat confusing in python - at least to me.

Wolfgang
 
W

Wolfgang Meiners

Am 31.05.11 11:55, schrieb Chris Withers:
Hi Wolfgang,



You're likely to get much better help here:

http://www.sqlalchemy.org/support.html#mailinglist

When you post there, make sure you include:

- what python version you're using
- what sqlalchemy version you're using

cheers,

Chris

Thank you for pointing me to this list. I will have a look to it. At the
moment i think i am really struggeling with python and uft8.

Wolfgang
 
P

Prasad, Ramit

line = unicode(line.strip(),'utf8')
and now i get really utf8-strings. It does work but i dont know why it works. For me it looks like i change an utf8-string to an utf8-string.


I would like to point out that UTF-8 is not exactly "Unicode". From what I understand,Unicode is a standard while UTF-8 is like an implementation of that standard (called an encoding). Being able to convert to Unicode (the standard) should mean you are then able to convert to any encoding that supports the Unicode characters used.

As you can see below a string in UTF-8 is actually not Unicode. (decode converts to Unicode, encode converts away from Unicode)

<type 'unicode'>


Ramit



Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423


This communication is for informational purposes only. It is not
intended as an offer or solicitation for the purchase or sale of
any financial instrument or as an official confirmation of any
transaction. All market prices, data and other information are not
warranted asto completeness or accuracy and are subject to change
without notice. Any comments or statements made herein do not
necessarily reflect those of JPMorgan Chase & Co., its subsidiaries
and affiliates.

This transmission may contain information that is privileged,
confidential, legally privileged, and/or exempt from disclosure
under applicable law. If you are not the intended recipient, you
are hereby notified that any disclosure, copying, distribution, or
use of the information contained herein (including any reliance
thereon) is STRICTLY PROHIBITED. Although this transmission and any
attachments arebelieved to be free of any virus or other defect
that might affect any computer system into which it is received and
opened, it is the responsibility of the recipient to ensure that it
is virus free and no responsibility is accepted by JPMorgan Chase &
Co., its subsidiaries and affiliates, as applicable, for any loss
or damage arising in any wayfrom its use. If you received this
transmission in error, please immediately contact the sender and
destroy the material in its entirety,whether in electronic or hard
copy format. Thank you.

Please refer to http://www.jpmorgan.com/pages/disclosures for
disclosures relating to European legal entities.
 
B

Benjamin Kaplan

Am 31.05.11 13:32, schrieb Daniel Kluev:

Thank you Daniel.
So i came a little bit closer to the solution. Actually i dont write the
strings in a python program but i read them from a file, which is
utf8-encoded.

So i changed the lines

   for line in open(file,'r'):
       line = line.strip()

first to

   for line in open(file,'r'):
       line = unicode(line.strip())

and finally to

   for line in open(file,'r'):
       line = unicode(line.strip(),'utf8')

and now i get really utf8-strings. It does work but i dont know why it
works. For me it looks like i change an utf8-string to an utf8-string.

There's no such thing as a UTF-8 string. You have a list of bytes
(byte string) and you have a list of characters (unicode). UTF-8 is a
function that can convert bytes into characters (and the reverse). You
may recognize that the list of bytes was encoded using UTF-8 but the
computer does not unless you explicitly tell it to. Does that help
clear it up?
 
C

Chris Angelico

I would like to point out that UTF-8 is not exactly "Unicode". From what I understand, Unicode is a standard while UTF-8 is like an implementation of that standard (called an encoding). Being able to convert to Unicode (thestandard) should mean you are then able to convert to any encoding that supports the Unicode characters used.

Unicode defines characters; UTF-8 is one way (of many) to represent
those characters in bytes. UTF-16 and UTF-32 are other ways of
representing those characters in bytes, and internally, Python
probably uses one of them - but there is no guarantee, and you should
never need to know. Unicode strings can be stored in memory and
manipulated in various ways, but they're a high level construct on par
with lists and dictionaries - they can't be stored on disk or
transmitted to another computer without using an encoding system.

UTF-8 is an efficient way to translate Unicode text consisting
primarily of low codepoint characters into bytes. It's not so much an
implementation of Unicode as a means of converting a mythical concept
of "Unicode characters" into a concrete stream of bytes.

Hope that clarifies things a little!

Chris Angelico
 
W

Wolfgang Meiners

I think it helped me very much to understand the problem.

So if i deal with nonascii strings, i have a 'list of bytes' and need an
encoding to interpret this list and transform it to a meaningful unicode
string. Decoding does the opposite.

Whenever i 'cross the border' of my program, i have to encode the 'list
of bytes' to an unicode string or decode the unicode string to a 'list
of bytes' which is meaningful to the world outside.

So encode early, decode lately means, to do it as near to the border as
possible and to encode/decode i need a coding system, for example 'utf8'

That means, there should be an encoding/decoding possibility to every
interface i can use: files, stdin, stdout, stderr, gui (should be the
most important ones).

While trying to understand this, i wrote the following program. Maybe
someone can give me a hint, how to print correctly:

######################################################
#! python
# -*- coding: utf-8 -*-

class EncTest:
def __init__(self,Name=None):
self.Name=unicode(Name, encoding='utf8')

def __repr__(self):
return u'My name is %s' % self.Name

if __name__ == '__main__':

a = EncTest('Müller')

# this does work
print a.__repr__()

# throws an error if default encoding is ascii
# but works if default encoding is utf8
print a

# throws an error because a is not a string
print unicode(a, encoding='utf8')
######################################################

Wolfgang
 
C

Chris Angelico

Whenever i 'cross the border' of my program, i have to encode the 'list
of bytes' to an unicode string or decode the unicode string to a 'list
of bytes' which is meaningful to the world outside.

Most people use "encode" and "decode" the other way around; you encode
a string as UTF-8, and decode UTF-8 into a Unicode string. But yes,
you're correct.
So encode early, decode lately means, to do it as near to the border as
possible and to encode/decode i need a coding system, for example 'utf8'

Correct on both counts.
That means, there should be an encoding/decoding possibility to every
interface i can use: files, stdin, stdout, stderr, gui (should be the
most important ones).

The file objects (as returned by open()) have an encoding, which
(IMHO) defaults to "utf8". GUI work depends on your GUI toolkit, and
might well accept Unicode strings directly - check the docs.
   def __repr__(self):
       return u'My name is %s' % self.Name

This means that repr() will return a Unicode string.
   # this does work
   print a.__repr__()

   # throws an error if default encoding is ascii
   # but works if default encoding is utf8
   print a

   # throws an error because a is not a string
   print unicode(a, encoding='utf8')

The __repr__ function is supposed to return a string object, in Python
2. See http://docs.python.org/reference/datamodel.html#object.__repr__
for that and other advice on writing __repr__. The problems you're
seeing are a result of the built-in repr() function calling
a.__repr__() and then treating the return value as an ASCII str, not a
Unicode string.

This would work:
   def __repr__(self):
       return (u'My name is %s' % self.Name).encode('utf8')

Alternatively, migrate to Python 3, where the default is Unicode
strings. I tested this in Python 3.2 on Windows, but it should work on
anything in the 3.x branch:

class NoEnc:
def __init__(self,Name=None):
self.Name=Name
def __repr__(self):
return 'My name is %s' % self.Name

if __name__ == '__main__':

a = NoEnc('Müller')

# this will still work (print is now a function, not a statement)
print(a.__repr__())

# this will work in Python 3.x
print(a)

# 'unicode' has been renamed to 'str', but it's already unicode so
this makes no sense
print(str(a, encoding='utf8'))

# to convert it to UTF-8, convert it to a string with str() or
repr() and then print:
print(str(a).encode('utf8'))
############################

Note that the last one will probably not do what you expect. The
Python 3 'print' function (it's not a statement any more, so you need
parentheses around its argument) wants a Unicode string, so you don't
need to encode it. When you encode a Unicode string as in the last
example, it returns a bytes string (an array of bytes), which looks
like this: b'My name is M\xc3\xbcller' The print function wants
Unicode, though, so it takes this unexpected object and calls str() on
it, hence the odd display.

Hope that helps!

Chris Angelico
 
W

Wolfgang Meiners

Am 31.05.11 23:56, schrieb Chris Angelico:
Most people use "encode" and "decode" the other way around; you encode
a string as UTF-8, and decode UTF-8 into a Unicode string. But yes,
you're correct.

Ok. I think i will adapt to the majority in this point.
I think i mixed up
unicodestring=unicode(bytestring,encoding='utf8')
and
bytestring=u'unicodestring'.encode('utf8')

I think i should change this to decode early, encode lately.
Correct on both counts.


The file objects (as returned by open()) have an encoding, which
(IMHO) defaults to "utf8". GUI work depends on your GUI toolkit, and
might well accept Unicode strings directly - check the docs.


This means that repr() will return a Unicode string.


The __repr__ function is supposed to return a string object, in Python
2. See http://docs.python.org/reference/datamodel.html#object.__repr__
for that and other advice on writing __repr__. The problems you're
seeing are a result of the built-in repr() function calling
a.__repr__() and then treating the return value as an ASCII str, not a
Unicode string.

This would work:
def __repr__(self):
return (u'My name is %s' % self.Name).encode('utf8')

Alternatively, migrate to Python 3, where the default is Unicode
strings. I tested this in Python 3.2 on Windows, but it should work on
anything in the 3.x branch:

class NoEnc:
def __init__(self,Name=None):
self.Name=Name
def __repr__(self):
return 'My name is %s' % self.Name

if __name__ == '__main__':

a = NoEnc('Müller')

# this will still work (print is now a function, not a statement)
print(a.__repr__())

# this will work in Python 3.x
print(a)

# 'unicode' has been renamed to 'str', but it's already unicode so
this makes no sense
print(str(a, encoding='utf8'))

# to convert it to UTF-8, convert it to a string with str() or
repr() and then print:
print(str(a).encode('utf8'))
############################

Note that the last one will probably not do what you expect. The
Python 3 'print' function (it's not a statement any more, so you need
parentheses around its argument) wants a Unicode string, so you don't
need to encode it. When you encode a Unicode string as in the last
example, it returns a bytes string (an array of bytes), which looks
like this: b'My name is M\xc3\xbcller' The print function wants
Unicode, though, so it takes this unexpected object and calls str() on
it, hence the odd display.

Hope that helps!

Yes it helped a lot. One last question here: When i have free choice and
i dont know Python 2 and Python 3 very good: What would be the
recommended choice?
Chris Angelico

Wolfgang
 
C

Chris Angelico

Yes it helped a lot. One last question here: When i have free choice and
i dont know Python 2 and Python 3 very good: What would be the
recommended choice?

Generally, Python 3. Unless there's something you really need in
Python 2 (a module that isn't available in 3.x, for instance, or
you're deploying to a site that doesn't have Python 3 installed), it's
worth going with the newer one.

Chris Angelico
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top