how to find out utf or not

  • Thread starter Mohsen Pahlevanzadeh
  • Start date
M

Mohsen Pahlevanzadeh

Dear all,

Suppose i have a variable such as : myVar = 'x'

May be it initialized with myVar = u'x' or myVar = 'x'

So i need determine content of myVar that it's utf-8 or not, how can i
do it?


--mohsen
 
S

Steven D'Aprano

Dear all,

Suppose i have a variable such as : myVar = 'x'

May be it initialized with myVar = u'x' or myVar = 'x'

Can't you just look at the code and tell which it is?

So i need determine content of myVar that it's utf-8 or not, how can i
do it?

I think you misunderstand the difference between Unicode and UTF-8. The
first thing you must understand is that Unicode does not mean UTF-8. They
are different things. Anyone who has told you that they are the same is
also mistaken.

Unicode is an abstract collection of characters, a character set.
(Technically, code points rather than characters, but don't worry about
that yet.) In Python 2, you normally create a Unicode string with either
the u"..." literal syntax, or the unicode() function. A Unicode string
might look like this:

abc§жπxyz

Each character has an ordinal value, which is the same as its Unicode
code point:

py> s = u'abc§жπxyz'
py> for char in s:
.... print char, ord(char)
....
a 97
b 98
c 99
§ 167
ж 1078
Ï€ 960
x 120
y 121
z 122


Note that ordinal values go *far* beyond 256. They go from 0 to 1114111.
So a Unicode string is a string of code points, in this example:

97 98 99 167 1078 960 120 121 122

Of course, computers don't understand "code points" any more than they
understand "sounds" or "movies" or "pictures of cats". Computers only
understand *bytes*. So how are these code points represented as bytes? By
using an encoding -- an encoding tells the computer how to represent
characters like "a", "b" and "ж" as bytes, for storage on disk or in
memory.

There are at least six different encodings for Unicode strings, and UTF-8
is only one of them. The others are two varieties each of UTF-16 and
UTF-32, and UTF-7. Given the unicode string:

u'abc§жπxyz'

it could be stored in memory as any of these sequences of hexadecimal
bytes:

610062006300A7003604C003780079007A00

00610062006300A7043603C000780079007A

610000006200000063000000A700000036040000C003000078000000790000007A000000

000000610000006200000063000000A700000436000003C000000078000000790000007A

616263C2A7D0B6CF8078797A

6162632B414B63454E6750412D78797A


and likely others as well. Which one will Python use? That depends on the
version of Python, how the interpreter was built, what operating system
you are using, and various other factors. Without knowing lots of
technical detail about your specific Python interpreter, I can't tell
which encoding it will be using internally. But I can be pretty sure that
it isn't using UTF-8.

So, you have a variable. Perhaps it has been passed to you from another
function, and you need to find out what it is. In this case, you do the
same thing you would do for any other type (int, list, dict, str, ...)
and use isinstance:

if isinstance(myVar, unicode):
...


If myVar is a Unicode string, you don't need to care about the encoding
(UTF-8 or otherwise) until you're ready to write it to a file. Then I
strongly recommend you always use UTF-8, unless you have to interoperate
with some old, legacy system:

assert isinstance(myVar, unicode)
byte_string = myVar.encode('utf-8')


will return a byte-string encoded using UTF-8.

If myVar is a byte-string, like 'abc' without the u'' prefix, then you
have a bit of a problem. Think of it like a file without a file
extension: it could be a JPEG, a WAV, a DLL, anything. There's no real
way to be sure. You can look inside the file and try to guess, but that's
not always reliable. Without the extension "myfile.jpg", "myfile.wav",
etc. you can't tell for sure what "myfile" is (although sometimes you can
make a good prediction: "my holiday picture" is probably a JPEG.

And so it is with byte-strings. Unless you know where they came from and
how they were prepared, you can't easily tell what encoding they used, at
least not without guessing. But if you control the source of the data,
and make sure you only use the encoding of your choice (let's say UTF-8),
then it is easy to convert the bytes into Unicode:

assert isinstance(myVar, str)
unicode_string = myVar.decode('utf-8')
 
G

Gisle Vanem

Steven D'Aprano said:
If myVar is a Unicode string, you don't need to care about the encoding
(UTF-8 or otherwise) until you're ready to write it to a file. Then I
strongly recommend you always use UTF-8, unless you have to interoperate
with some old, legacy system:

assert isinstance(myVar, unicode)
byte_string = myVar.encode('utf-8')

An excellent summary of the mystics around text-encoding. Thank you.

--gv
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top