Comparing UTF-8 into USC-2 and vice versa (newbie :-) )

Discussion in 'Python' started by Tzury, Jun 17, 2007.

  1. Tzury

    Tzury Guest

    I recently rewrote a .net application in python.
    The application is basically gets streams via TCP socket and handle
    operations against an existing database.
    The Database is SQLite3 (Encoded as UTF-8).
    The Networks streams are encoded as UCS-2.

    Since in UCS-2, 'A' = '0041' and when I check with the built-in
    functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
    wonder what is the difference, and how can I safely encode/decode
    UCS-2 streams and match them with the UTF-8 representation
     
    Tzury, Jun 17, 2007
    #1
    1. Advertising

  2. > I recently rewrote a .net application in python.
    > The application is basically gets streams via TCP socket and handle
    > operations against an existing database.
    > The Database is SQLite3 (Encoded as UTF-8).
    > The Networks streams are encoded as UCS-2.
    >
    > Since in UCS-2, 'A' = '0041' and when I check with the built-in
    > functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
    > wonder what is the difference, and how can I safely encode/decode
    > UCS-2 streams and match them with the UTF-8 representation


    In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
    that the output is in UTF-8, but the *input*.
    So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
    UTF-8, it consumes only one byte.

    For different letters, that's different: For example, for u'\xf6',
    the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
    'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
    (i.e. three bytes).

    You should use Unicode objects in your program always, and encode
    to or from UCS-2 or UTF-8 only when interfacing with the
    network/database.

    HTH,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jun 17, 2007
    #2
    1. Advertising

  3. Tzury

    Tzury Guest

    On Jun 17, 10:48 am, "Martin v. Löwis" <> wrote:
    > > I recently rewrote a .net application in python.
    > > The application is basically gets streams via TCP socket and handle
    > > operations against an existing database.
    > > The Database is SQLite3 (Encoded as UTF-8).
    > > The Networks streams are encoded as UCS-2.

    >
    > > Since in UCS-2, 'A' = '0041' and when I check with the built-in
    > > functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
    > > wonder what is the difference, and how can I safely encode/decode
    > > UCS-2 streams and match them with the UTF-8 representation

    >
    > In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
    > that the output is in UTF-8, but the *input*.
    > So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
    > UTF-8, it consumes only one byte.
    >
    > For different letters, that's different: For example, for u'\xf6',
    > the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
    > 'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
    > (i.e. three bytes).
    >
    > You should use Unicode objects in your program always, and encode
    > to or from UCS-2 or UTF-8 only when interfacing with the
    > network/database.
    >
    > HTH,
    > Martin


    Thanks Martin for this guideline
     
    Tzury, Jun 17, 2007
    #3
  4. Tzury

    Tzury Guest

    On Jun 17, 10:48 am, "Martin v. Löwis" <> wrote:
    > > I recently rewrote a .net application in python.
    > > The application is basically gets streams via TCP socket and handle
    > > operations against an existing database.
    > > The Database is SQLite3 (Encoded as UTF-8).
    > > The Networks streams are encoded as UCS-2.

    >
    > > Since in UCS-2, 'A' = '0041' and when I check with the built-in
    > > functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
    > > wonder what is the difference, and how can I safely encode/decode
    > > UCS-2 streams and match them with the UTF-8 representation

    >
    > In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
    > that the output is in UTF-8, but the *input*.
    > So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
    > UTF-8, it consumes only one byte.
    >
    > For different letters, that's different: For example, for u'\xf6',
    > the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
    > 'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
    > (i.e. three bytes).
    >
    > You should use Unicode objects in your program always, and encode
    > to or from UCS-2 or UTF-8 only when interfacing with the
    > network/database.
    >
    > HTH,
    > Martin


    Thanks Martin for this guideline. But in fact say I get a USC-2 string
    and need to compare it with UTF-8 value in the database. How can I do
    it given the Python built-in libraries?
     
    Tzury, Jun 17, 2007
    #4
  5. Tzury

    John Machin Guest

    On Jun 17, 6:48 pm, Tzury <> wrote:
    > On Jun 17, 10:48 am, "Martin v. Löwis" <> wrote:
    >
    >
    >
    > > > I recently rewrote a .net application in python.
    > > > The application is basically gets streams via TCP socket and handle
    > > > operations against an existing database.
    > > > The Database is SQLite3 (Encoded as UTF-8).
    > > > The Networks streams are encoded as UCS-2.

    >
    > > > Since in UCS-2, 'A' = '0041' and when I check with the built-in
    > > > functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
    > > > wonder what is the difference, and how can I safely encode/decode
    > > > UCS-2 streams and match them with the UTF-8 representation

    >
    > > In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
    > > that the output is in UTF-8, but the *input*.
    > > So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
    > > UTF-8, it consumes only one byte.

    >
    > > For different letters, that's different: For example, for u'\xf6',
    > > the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
    > > 'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
    > > (i.e. three bytes).

    >
    > > You should use Unicode objects in your program always, and encode
    > > to or from UCS-2 or UTF-8 only when interfacing with the
    > > network/database.

    >
    > > HTH,
    > > Martin

    >
    > Thanks Martin for this guideline. But in fact say I get a USC-2 string
    > and need to compare it with UTF-8 value in the database. How can I do
    > it given the Python built-in libraries?


    Use the str.decode method with the appropriate encoding. Borrowing
    Martin's last example:

    >>> '\xE2\x82\xAC'.decode('utf8')

    u'\u20ac'
    >>> '\x20\xAC'.decode('utf_16_be')

    u'\u20ac'

    BTW TLA 'USC' AAF SBE 'UCS'
    HTH
    SJM
     
    John Machin, Jun 17, 2007
    #5
  6. Tzury

    Tzury Guest

    Yet,
    'utf_16_be' is not 'ucs-2'.
    How would I get ucs-2 encoding and decoding functionality with python?
     
    Tzury, Jun 17, 2007
    #6
  7. Tzury schrieb:
    > Yet,
    > 'utf_16_be' is not 'ucs-2'.


    That's not true. They are virtually identical.

    > How would I get ucs-2 encoding and decoding functionality with python?


    Use the UTF-16 codec.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jun 17, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jimmy
    Replies:
    1
    Views:
    3,161
    Eliyahu Goldin
    Jun 14, 2005
  2. carmen
    Replies:
    4
    Views:
    30,722
    ersecchio
    Jan 12, 2010
  3. -
    Replies:
    8
    Views:
    618
    Antti S. Brax
    Jun 11, 2005
  4. deepak
    Replies:
    1
    Views:
    708
    bruce barker
    Feb 8, 2008
  5. Peter Otten
    Replies:
    1
    Views:
    156
    Andreas Perstinger
    Jun 18, 2013
Loading...

Share This Page