how to find out utf or not

Mohsen Pahlevanzadeh · Nov 5, 2013

Dear all,

Suppose i have a variable such as : myVar = 'x'

May be it initialized with myVar = u'x' or myVar = 'x'

So i need determine content of myVar that it's utf-8 or not, how can i
do it?

--mohsen

Steven D'Aprano · Nov 5, 2013

Dear all,

Suppose i have a variable such as : myVar = 'x'

May be it initialized with myVar = u'x' or myVar = 'x'

Can't you just look at the code and tell which it is?

So i need determine content of myVar that it's utf-8 or not, how can i
do it?

I think you misunderstand the difference between Unicode and UTF-8. The
first thing you must understand is that Unicode does not mean UTF-8. They
are different things. Anyone who has told you that they are the same is
also mistaken.

Unicode is an abstract collection of characters, a character set.
(Technically, code points rather than characters, but don't worry about
that yet.) In Python 2, you normally create a Unicode string with either
the u"..." literal syntax, or the unicode() function. A Unicode string
might look like this:

abcÂ§Ð¶Ï€xyz

Each character has an ordinal value, which is the same as its Unicode
code point:

py> s = u'abcÂ§Ð¶Ï€xyz'
py> for char in s:
.... print char, ord(char)
....
a 97
b 98
c 99
Â§ 167
Ð¶ 1078
Ï€ 960
x 120
y 121
z 122

Note that ordinal values go *far* beyond 256. They go from 0 to 1114111.
So a Unicode string is a string of code points, in this example:

97 98 99 167 1078 960 120 121 122

Of course, computers don't understand "code points" any more than they
understand "sounds" or "movies" or "pictures of cats". Computers only
understand *bytes*. So how are these code points represented as bytes? By
using an encoding -- an encoding tells the computer how to represent
characters like "a", "b" and "Ð¶" as bytes, for storage on disk or in
memory.

There are at least six different encodings for Unicode strings, and UTF-8
is only one of them. The others are two varieties each of UTF-16 and
UTF-32, and UTF-7. Given the unicode string:

u'abcÂ§Ð¶Ï€xyz'

it could be stored in memory as any of these sequences of hexadecimal
bytes:

610062006300A7003604C003780079007A00

00610062006300A7043603C000780079007A

610000006200000063000000A700000036040000C003000078000000790000007A000000

000000610000006200000063000000A700000436000003C000000078000000790000007A

616263C2A7D0B6CF8078797A

6162632B414B63454E6750412D78797A

and likely others as well. Which one will Python use? That depends on the
version of Python, how the interpreter was built, what operating system
you are using, and various other factors. Without knowing lots of
technical detail about your specific Python interpreter, I can't tell
which encoding it will be using internally. But I can be pretty sure that
it isn't using UTF-8.

So, you have a variable. Perhaps it has been passed to you from another
function, and you need to find out what it is. In this case, you do the
same thing you would do for any other type (int, list, dict, str, ...)
and use isinstance:

if isinstance(myVar, unicode):
...

If myVar is a Unicode string, you don't need to care about the encoding
(UTF-8 or otherwise) until you're ready to write it to a file. Then I
strongly recommend you always use UTF-8, unless you have to interoperate
with some old, legacy system:

assert isinstance(myVar, unicode)
byte_string = myVar.encode('utf-8')

will return a byte-string encoded using UTF-8.

If myVar is a byte-string, like 'abc' without the u'' prefix, then you
have a bit of a problem. Think of it like a file without a file
extension: it could be a JPEG, a WAV, a DLL, anything. There's no real
way to be sure. You can look inside the file and try to guess, but that's
not always reliable. Without the extension "myfile.jpg", "myfile.wav",
etc. you can't tell for sure what "myfile" is (although sometimes you can
make a good prediction: "my holiday picture" is probably a JPEG.

And so it is with byte-strings. Unless you know where they came from and
how they were prepared, you can't easily tell what encoding they used, at
least not without guessing. But if you control the source of the data,
and make sure you only use the encoding of your choice (let's say UTF-8),
then it is easy to convert the bytes into Unicode:

assert isinstance(myVar, str)
unicode_string = myVar.decode('utf-8')

Gisle Vanem · Nov 5, 2013

Steven D'Aprano said:
If myVar is a Unicode string, you don't need to care about the encoding
(UTF-8 or otherwise) until you're ready to write it to a file. Then I
strongly recommend you always use UTF-8, unless you have to interoperate
with some old, legacy system:

assert isinstance(myVar, unicode)
byte_string = myVar.encode('utf-8')

An excellent summary of the mystics around text-encoding. Thank you.

--gv

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Popen Question	10	Nov 4, 2010
C Python: Running Python code within function scope	1	Sep 4, 2012
I'm tempted to quit out of frustration	1	Aug 13, 2023
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
which better for me?session.query or session.execute?	0	Aug 27, 2013
Help figuring out a directory permission change problem	1	May 12, 2023
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023

how to find out utf or not

Mohsen Pahlevanzadeh

Steven D'Aprano

Gisle Vanem

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads