Comparing UTF-8 into USC-2 and vice versa (newbie :-) )

Tzury · Jun 17, 2007

I recently rewrote a .net application in python.
The application is basically gets streams via TCP socket and handle
operations against an existing database.
The Database is SQLite3 (Encoded as UTF-8).
The Networks streams are encoded as UCS-2.

Since in UCS-2, 'A' = '0041' and when I check with the built-in
functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
wonder what is the difference, and how can I safely encode/decode
UCS-2 streams and match them with the UTF-8 representation

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 17, 2007

I recently rewrote a .net application in python.

The application is basically gets streams via TCP socket and handle
operations against an existing database.
The Database is SQLite3 (Encoded as UTF-8).
The Networks streams are encoded as UCS-2.

Since in UCS-2, 'A' = '0041' and when I check with the built-in
functions I get for unicode("A", "utf-8") = u'A' = u'\u0041'. I
wonder what is the difference, and how can I safely encode/decode
UCS-2 streams and match them with the UTF-8 representation

In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
that the output is in UTF-8, but the *input*.
So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
UTF-8, it consumes only one byte.

For different letters, that's different: For example, for u'\xf6',
the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
(i.e. three bytes).

You should use Unicode objects in your program always, and encode
to or from UCS-2 or UTF-8 only when interfacing with the
network/database.

HTH,
Martin

Tzury · Jun 17, 2007

In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
that the output is in UTF-8, but the *input*.
So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
UTF-8, it consumes only one byte.

For different letters, that's different: For example, for u'\xf6',
the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
(i.e. three bytes).

You should use Unicode objects in your program always, and encode
to or from UCS-2 or UTF-8 only when interfacing with the
network/database.

HTH,
Martin

Thanks Martin for this guideline

Tzury · Jun 17, 2007

In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
that the output is in UTF-8, but the *input*.
So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
UTF-8, it consumes only one byte.

For different letters, that's different: For example, for u'\xf6',
the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
(i.e. three bytes).

You should use Unicode objects in your program always, and encode
to or from UCS-2 or UTF-8 only when interfacing with the
network/database.

HTH,
Martin

Thanks Martin for this guideline. But in fact say I get a USC-2 string
and need to compare it with UTF-8 value in the database. How can I do
it given the Python built-in libraries?

John Machin · Jun 17, 2007

Thanks Martin for this guideline. But in fact say I get a USC-2 string
and need to compare it with UTF-8 value in the database. How can I do
it given the Python built-in libraries?

Use the str.decode method with the appropriate encoding. Borrowing
Martin's last example:
u'\u20ac'

BTW TLA 'USC' AAF SBE 'UCS'
HTH
SJM

Tzury · Jun 17, 2007

Yet,
'utf_16_be' is not 'ucs-2'.
How would I get ucs-2 encoding and decoding functionality with python?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 17, 2007

Tzury said:
Yet,
'utf_16_be' is not 'ucs-2'.

That's not true. They are virtually identical.

How would I get ucs-2 encoding and decoding functionality with python?

Use the UTF-16 codec.

Regards,
Martin

Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
UTF-8 question from Dive into Python 3	19	Jan 17, 2011
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
CSV readers and UTF-8 files	2	Feb 19, 2009
UTF-8 in basic CGI mode	0	Jan 15, 2008
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
UTF-8 output problems	2	Mar 10, 2007
usage of <string>.encode('utf-8','xmlcharrefreplace')?	7	Feb 19, 2008

Comparing UTF-8 into USC-2 and vice versa (newbie :-) )

Tzury

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Tzury

Tzury

John Machin

Tzury

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads