Unicode (UTF8) in dbhas on 2.5

Discussion in 'Python' started by Yves Dorfsman, Oct 20, 2008.

  1. Yves Dorfsman, Oct 20, 2008
    #1
    1. Advertising

  2. Yves Dorfsman wrote:

    > Can you put UTF-8 characters in a dbhash in python 2.5 ?
    > It fails when I try:
    >
    > #!/bin/env python
    > # -*- coding: utf-8 -*-
    >
    > import dbhash
    >
    > db = dbhash.open('dbfile.db', 'w')
    > db[u'smiley'] = u'☺'
    > db.close()
    >
    > Do I need to change the bsd db library, or there is no way to make it work
    > with python 2.5 ?


    Please write the following program and meditate at least 30min in front of
    it:

    while True:
    print "utf-8 is not unicode"

    Once this seemingly minor detail has sunken in, you are ready to work with
    the below variant that will work:

    #!/bin/env python
    # -*- coding: utf-8 -*-
    import dbhash
    db = dbhash.open('dbfile.db', 'w')
    db[u'smiley'.encode('utf-8')] = u'☺'.encode('utf-8')
    db.close()


    What is the difference? The dbhash module can only work with *bytestrings*.
    Bytestrings are just that - a sequence of 8-bit-values.

    u""-literals are *unicode objects*. These are an abstract sequence of
    characters, smileys or others.

    Now the real world of databases, network-connections and harddrives doesn't
    know about unicode. They only know bytes. So before you can write to them,
    you need to "encode" the unicode data to a byte-stream-representation.
    There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
    which has the property that it can render *all* unicode characters,
    potentially needing more than one byte per character.

    Which is why the code above has those encode-calls on the unicode-objects.

    But beware! Once you encoded the data, there is no way to *know* it's
    encoding. So when reading the data, you will get *bytestrings*. So you need
    to "decode" them, with the proper encoding. In this case, again utf-8.

    Which brings us to the second part of the program:

    db = dbhash.open('dbfile.db')
    smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')

    print smiley.encode('utf-8')


    The last encode is there to print out the smiley on a terminal - one of
    those pesky bytestream-eaters that don't know about unicode.

    Diez
     
    Diez B. Roggisch, Oct 20, 2008
    #2
    1. Advertising

  3. Diez B. Roggisch <> wrote:

    > Please write the following program and meditate at least 30min in front of
    > it:


    > while True:
    > print "utf-8 is not unicode"


    I hope you will have a better day today than yesterday !
    Now, I did this:

    while True:
    print "¡ Python knows about encoding, but only sometimes !"

    My terminal is setup in UTF-8, and... It did print correctly. I expected
    that by setting coding: utf-8, all the I/O functions would do the encoding
    for me, because if they don't then I, and everybody who writes a script, will
    need to subclass every single I/O class (ok, except for print !).


    > Bytestrings are just that - a sequence of 8-bit-values.


    It used to be that int were 8 bits, we did not stay stuck in time and int are
    now typically longer. I expect a high level language to let me set the
    encoding once, and do simple I/O operation... without having encode/decode.

    > Now the real world of databases, network-connections and harddrives doesn't
    > know about unicode. They only know bytes. So before you can write to them,
    > you need to "encode" the unicode data to a byte-stream-representation.
    > There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
    > which has the property that it can render *all* unicode characters,
    > potentially needing more than one byte per character.


    Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
    chars right, but that's why I use a high level language. Here's an example
    of an implementation that let you write Unicode directly to a dbhash, I
    hoped there would be something similar in python:
    http://www.oracle.com/technology/documentation/berkeley-db/db/gsg/JAVA/DBEntry.html

    > db = dbhash.open('dbfile.db')
    > smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')


    > print smiley.encode('utf-8')



    > The last encode is there to print out the smiley on a terminal - one of
    > those pesky bytestream-eaters that don't know about unicode.


    What are you talking about ?
    I just copied this right from my terminal (LANG=en_CA.utf8):

    >>> print unichr(0x020ac)

    €
    >>>


    Now, I have read that python 2.6 has better support for Unicode. Does it allow
    to write to file, bsddb etc... without having to encode/decode every time ?
    This is a big enough issue for me right now that I will manually install 2.6
    if it does.

    Thanks.

    --
    Yves.
    http://www.sollers.ca/blog/2008/no_sound_PulseAudio
    http://www.sollers.ca/blog/2008/PulseAudio_pas_de_son/.fr
     
    Yves Dorfsman, Oct 21, 2008
    #3
  4. Yves Dorfsman

    Paul Boddie Guest

    On 20 Okt, 16:04, "Diez B. Roggisch" <> wrote:
    >
    > What is the difference? The dbhash module can only work with *bytestrings*.
    > Bytestrings are just that - a sequence of 8-bit-values.


    Sounds like a prime candidate for some improvement work. Patches,
    anyone? ;-)

    > u""-literals are *unicode objects*. These are an abstract sequence of
    > characters, smileys or others.


    It's important to point this out, though. However...

    > Now the real world of databases, network-connections and harddrives doesn't
    > know about unicode. They only know bytes. So before you can write to them,
    > you need to "encode" the unicode data to a byte-stream-representation.


    Although this is true, what the inquirer probably expected was the
    interfaces to these things handling such details. In the case of
    filesystems, this can be awkward on, say, Linux or UNIX for various
    historical reasons. With regard to database systems, some messy
    configuration may need to be done for each database, but it would be
    nice to see the interface modules doing a bit more of the work.

    [...]

    > print smiley.encode('utf-8')
    >
    > The last encode is there to print out the smiley on a terminal - one of
    > those pesky bytestream-eaters that don't know about unicode.


    With respect to output encodings, you don't need to perform an encode
    operation if the locale is compatible, as discussed recently in
    another thread. Encoding manually to UTF-8 may avoid errors, but it
    doesn't guarantee that the output will make any sense.

    Paul
     
    Paul Boddie, Oct 21, 2008
    #4
  5. Yves Dorfsman

    Jerry Hill Guest

    On Tue, Oct 21, 2008 at 10:16 AM, Yves Dorfsman <> wrote:
    > My terminal is setup in UTF-8, and... It did print correctly. I expected
    > that by setting coding: utf-8, all the I/O functions would do the encoding
    > for me, because if they don't then I, and everybody who writes a script, will
    > need to subclass every single I/O class (ok, except for print !).


    No, you don't. You just need to use the tools provided for you in the
    standard library, like this:

    import codecs
    in_file = codecs.open('my_utf8_file.txt', 'r', 'utf8')

    Now your file full of utf8 encoded bytes will be automatically
    transformed into unicode strings as you read them in. You can do the
    same thing on the output side (obviously, using mode 'w' instread of
    'r').

    If you need to wrap things other than files, the codecs module has the
    tools to do that too.

    --
    Jerry
     
    Jerry Hill, Oct 21, 2008
    #5
  6. Yves Dorfsman wrote:

    > Diez B. Roggisch <> wrote:
    >
    >> Please write the following program and meditate at least 30min in front
    >> of it:

    >
    >> while True:
    >> print "utf-8 is not unicode"

    >
    > I hope you will have a better day today than yesterday !


    I had a good day yesterday. And today. Thanks for asking.

    Partially feeling good stemmed from the fact that I didn't "try to put
    UTF-8-characters into a berkley-db" and claimed it fails, where what I
    *really* tried was putting unicode-strings into it. Unicode and UTF-8 are
    two different things, like it or not.

    > Now, I did this:
    >
    > while True:
    > print "¡ Python knows about encoding, but only sometimes !"
    >
    > My terminal is setup in UTF-8, and... It did print correctly. I expected
    > that by setting coding: utf-8, all the I/O functions would do the encoding
    > for me, because if they don't then I, and everybody who writes a script,
    > will need to subclass every single I/O class (ok, except for print !).


    You seriously want all IO to be encoded depending on your terminal setting?
    What about the database that works in latin1? The CSV file you write to
    your vendor, expecting cp1251? And what happens if your process is not
    *started* from a terminal? Or a different user starts the script, and all
    of a sudden the exported data is messed up?

    >
    >> Bytestrings are just that - a sequence of 8-bit-values.

    >
    > It used to be that int were 8 bits, we did not stay stuck in time and int
    > are now typically longer. I expect a high level language to let me set the
    > encoding once, and do simple I/O operation... without having
    > encode/decode.


    Sorry to say so, but you must face the sad truth: IO ops *need* explicit
    encoding applied to them, otherwise errors will occur. Ask the Java-guys
    why the needed to grow encoding-parameters to all their toBytes/fromBytes
    functions in the IO-layer.

    There is nothing that can be done about this. Which is not to say that
    Python couldn't be enhanced at some places wrt unicode-handling, see below.

    > Sure if I write assembly, I'll make sure I get my bits, bytes, multi-bytes
    > chars right, but that's why I use a high level language. Here's an example
    > of an implementation that let you write Unicode directly to a dbhash, I
    > hoped there would be something similar in python:
    >

    http://www.oracle.com/technology/documentation/berkeley-db/db/gsg/JAVA/DBEntry.html

    The inner workings of the DB are still only byte-aware. I agree that you
    could enhance the berkley-db-interface in python so that it takes a
    default-encoding parameter, then transcoding all values from and to it.

    OTOH you can help yourself writing a simple wrapper that does that for you,
    untested:

    class UnicodeWrapper(object):

    def __init__(self, bdb, encoding="utf-8"):
    self.bdb = bdb
    self.encoding = encoding


    def __setitem__(self, key, value):
    if isinstance(key, unicode):
    key = key.encode(self.encoding)
    if isinstance(value, unicode):
    value = value.encode(self.encoding)
    self.bdb[key] = value

    def __getitem__(self, key):
    if isinstance(key, unicode):
    key = key.encode(self.encoding)
    return self.bdb[key]


    >
    >> db = dbhash.open('dbfile.db')
    >> smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')

    >
    >> print smiley.encode('utf-8')

    >
    >
    >> The last encode is there to print out the smiley on a terminal - one of
    >> those pesky bytestream-eaters that don't know about unicode.

    >
    > What are you talking about ?
    > I just copied this right from my terminal (LANG=en_CA.utf8):
    >
    >>>> print unichr(0x020ac)

    > €
    >>>>


    You are right, that works of course - when running inside a terminal. It
    will fail though if the encoding can't be guessed, e.g. because the process
    is not spawned from a terminal.

    Nothing to do with the terminal though.

    Diez
     
    Diez B. Roggisch, Oct 21, 2008
    #6
  7. Paul Boddie wrote:
    > On 20 Okt, 16:04, "Diez B. Roggisch" <> wrote:
    >> What is the difference? The dbhash module can only work with *bytestrings*.
    >> Bytestrings are just that - a sequence of 8-bit-values.

    >
    > Sounds like a prime candidate for some improvement work. Patches,
    > anyone? ;-)


    It's not possible to "fix" this - it isn't even broken. The *db modules,
    by design, support storing of arbitrary bytes, not just character data.
    You can put images into them, or sound files, java byte code files, etc.
    So if Python would assume they have to be UTF-8 encoded character
    strings, it would severely limit the usability of these modules.

    For keys, things are slightly different from values - there is a higher
    chance that the keys are indeed intended to be character strings.
    However, in the bsddb btree format, any byte sequence that has a good
    lexical order can be used as a key, and people do use the interface
    that way (e.g. by putting an md-5 hash as the key, and the original data
    as the value).

    It would be possible to put a layer on top of them which assumes that
    either keys, values, or both are characters, and that they are further
    UTF-8 encoded. However, such a package doesn't need to be part of the
    standard library.

    Regards,
    Martin
     
    Martin v. Löwis, Oct 21, 2008
    #7
  8. Yves Dorfsman

    Joe Strout Guest

    On Oct 21, 2008, at 2:39 PM, Martin v. Löwis wrote:

    > It's not possible to "fix" this - it isn't even broken. The *db
    > modules,
    > by design, support storing of arbitrary bytes, not just character
    > data.


    Many database engines are encoding-aware, and distinguish between
    'text' columns and 'blob' columns -- the latter are arbitrary bags of
    bytes, but text columns store text, and a good database (with a
    sensibly designed database) will be aware of this and handle encoding
    and decoding of text responsibly.

    I can tell you that in REALbasic, if your database is properly
    configured to use UTF-8 encoding, the rest is all handled seamlessly
    -- you just store and retrieve text, and don't have to worry about
    encoding and decoding things all over the place.

    So the OP's request is quite valid. Python's handling of encodings is
    currently primitive compared to some other environments, and I see
    that this extends to the database modules. Fine, fair enough, it is
    what it is, but there is no harm in asking about (or even yearning
    for) a more intelligent system that does more of the grunt work for us.

    Best,
    - Joe
     
    Joe Strout, Oct 22, 2008
    #8
  9. >> Many database engines are encoding-aware, and distinguish between
    >> 'text' columns and 'blob' columns -- the latter are arbitrary bags
    >> of bytes, but text columns store text, and a good database (with a
    >> sensibly designed database) will be aware of this and handle
    >> encoding and decoding of text responsibly.


    Ok, by this definition, the dbm interface of Unix is not a good
    database. Tough luck.

    >> I can tell you that in REALbasic, if your database is properly
    >> configured to use UTF-8 encoding, the rest is all handled
    >> seamlessly -- you just store and retrieve text, and don't have to
    >> worry about encoding and decoding things all over the place.


    In Python, the database system is independent of the programming
    language. Python can deal with

    >> So the OP's request is quite valid.


    Which of the questions specifically?

    Q: Can you put UTF-8 characters in a dbhash in python 2.5 ?
    A: Sure, certainly.

    Q: Do I need to change the bsd db library,
    or there is no way to make it work with python 2.5 ?
    A: You don't need to change the bsd db library; it works out
    of the box.

    Q: What about python 2.6 ?
    A: It's the same.

    He got essentially the answers to the questions he asked.

    >> Python's handling of encodings is currently primitive compared to
    >> some other environments, and I see that this extends to the
    >> database modules.


    That's *not* a question that he had asked. He asked about UTF-8, but
    perhaps meant to ask about Unicode (in particular as his example did
    demonstrate any problems with UTF-8 encoded strings).

    >> Fine, fair enough, it is what it is, but there is no harm in asking
    >> about (or even yearning for) a more intelligent system that does
    >> more of the grunt work for us.


    It *is* important to understand the difference between an "UTF-8
    string", and a "Unicode string". If the OP hadn't been confused
    about the two, and fully understood the difference, he probably
    wouldn't have needed to ask.

    Regards,
    Martin
     
    Martin v. Löwis, Oct 22, 2008
    #9
  10. Yves Dorfsman

    Paul Boddie Guest

    On 21 Okt, 22:39, "Martin v. Löwis" <> wrote:
    >
    > It's not possible to "fix" this - it isn't even broken. The *db modules,
    > by design, support storing of arbitrary bytes, not just character data.
    > You can put images into them, or sound files, java byte code files, etc.
    > So if Python would assume they have to be UTF-8 encoded character
    > strings, it would severely limit the usability of these modules.


    If the inquirer was aware of the Unicode/UTF-8 distinction, then he
    apparently wanted a conversion from Unicode to UTF-8 for the purpose
    of storing text in the database. I don't really see a problem with a
    module like this handling Unicode values in a reasonable fashion
    whilst letting the user supply plain/byte strings if they also want to
    do so, except perhaps for the issue of whether retrieved values should
    be Unicode or something else, how the user gets to override the
    default behaviour, and how this fits in with the existing API. Various
    DB-API modules support Unicode, so this isn't a completely new
    phenomenon, and a connection parameter for alternative encodings would
    be adequate if people wanted to use something other than UTF-8 to
    represent textual values within the database.

    Paul
     
    Paul Boddie, Oct 22, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spamtrap

    UTF8 to Unicode conversion

    Spamtrap, Jul 30, 2004, in forum: Perl
    Replies:
    6
    Views:
    9,964
    Joe Smith
    Jul 31, 2004
  2. Abe Simpson
    Replies:
    1
    Views:
    2,496
    Joerg Jooss
    Dec 15, 2005
  3. Jeff Higgins

    convert Java unicode escape to utf8

    Jeff Higgins, Jul 6, 2007, in forum: Java
    Replies:
    12
    Views:
    12,011
    Jeff Higgins
    Jul 12, 2007
  4. Andrew
    Replies:
    32
    Views:
    2,100
    Arne Vajhøj
    Aug 23, 2009
  5. gry
    Replies:
    2
    Views:
    823
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page