Best ways of managing text encodings in source/regexes?

T

tinkerbarbet

Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim
 
M

Martin v. Löwis

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin
 
T

tinkerbarbet

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

Depends on what you want to achieve. If you don't prefix your strings
with u, they will stay byte string objects, and won't become Unicode
strings. That should be fine for strings that are pure ASCII; for
ISO-8859-1 strings, I recommend it is safer to only use Unicode
objects to represent such strings.

In Py3k, that will change - string literals will automatically be
Unicode objects.
* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)?

Yes, it will. The encoding declaration is per-module.
* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say
myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash
then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

I'm not actually sure what precisely the semantics is when you match
an expression compiled from a byte string against a Unicode string,
or vice versa. I believe it operates on the internal representation,
so \xf6 in a byte string expression matches with \u00f6 in a Unicode
string; it won't try to convert one into the other.

Regards,
Martin

Thanks Martin, that's a very helpful response to what I was concerned
might be an overly long query.

Yes, I'd read that in Py3k the distinction between byte strings and
Unicode strings would disappear -- I look forward to that...

Tim
 
J

J. Clifford Dyer

Hi

Thanks for your responses, as I said on the reply I posted I thought
later it was a bit long so I'm grateful you held out!

I should have said (but see comment about length) that I'd tried
joining a unicode and a byte string in the interpreter and got that
working, but wondered whether it was safe (I'd seen the odd warning
about mixing byte strings and unicode). Anyway what I was concerned
about was what python does with source files rather than what happens
from the interpreter, since I don't know if it's possible to change
the encoding of the terminal without messing with site.py (and what
else).

Aren't both ASCII and ISO-8859-1 subsets of UTF-8? Can't you then use
chars from either of those charsets in a file saved as UTF-8 by one's
editor, with a # -*- coding: utf-8 -*- pseudo-declaration for python,
without problems? You seem to disagree.
I do disagree. Unicode is a superset of ISO-8859-1, but UTF-8 is a specific encoding, which changes many of the binary values. UTF-8 was designed specifically not to change the values of ascii characters. 0x61 (lower case a) in ascii is encoded with the bits 0110 0001. In UTF-8 it is also encoded 0110 0001. However, Ã, "latin small letter n with tilde", is unicode/iso-8859-1 character 0xf1. In ISO-8859-1, this is represented by the bits 1111 0001.

UTF-8 gets a little tricky here. In order to be extensible beyond 8 bits, it has to insert control bits at the beginning, so this character actually requires 2 bytes to represent instead of just one. In order to show that UTF-8 will be using two bytes to represent the character, the first byte begins with 110 (1110 is used when three bytes are needed). Each successive byte begins with 10 to show that it is not the beginning of a character. Then the code-point value is packed into the remaining free bits, as far to the right as possible. So in this case, the control bits are

110x xxxx 10xx xxxx.

The character value, 0xf1, or:
1111 0001

gets inserted as follows:

110x xx{11} 10{11 0001}

and the remaining free x-es get replaced by zeroes.

1100 0011 1011 0001.

Note that the python interpreter agrees:

py>>> x = u'\u00f1'
py>>> x.encode('utf-8')
'\xc3\xb1'

(Conversion from binary to hex is left as an exercise for the reader)

So while ASCII is a subset of UTF-8, ISO-8859-1 is definitely not. As others have said many times when this issue periodically comes up: UTF-8 is not unicode. Hopefully this will help explain exactly why.

Note that with other encodings, like UTF-16, even ascii is not a subset.

See the wikipedia article on UTF-8 for a more complete explanation and external references to official documentation (http://en.wikipedia.org/wiki/UTF-8).


The reason all this arose was that I was using ISO-8859-1/Latin-1 with
all the right declarations, but then I needed to match a few chars
outside of that range. So I didn't need to use u"" before, but now I
do in some regexes, and I was wondering if this meant that /all/ my
regexes had to be constructed from u"" strings or whether I could just
do the selected ones, either using literals (and saving the file as
UTF-8) or unicode escape sequences (and leaving the file as ASCII -- I
already use hex escape sequences without problems but that doesn't
work past the end of ISO-8859-1).

Do you know about unicode escape sequences?

py>>> u'\xf1' == u'\u00f1'
True
Thanks again for your feedback.

Best wishes
Tim

No problem. It took me a while to wrap my head around it, too.

Cheers,
Cliff
 
K

Kumar McMillan

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

It will work fine if the regular expression restricts itself to ASCII,
and doesn't rely on any of the locale-specific character classes (such
as \w). If it's beyond ASCII, or does use such escapes, you better make
it a Unicode expression.

yes, you have to be careful when writing unicode-senstive regular expressions:
http://effbot.org/zone/unicode-objects.htm

"You can apply the same pattern to either 8-bit (encoded) or Unicode
strings. To create a regular expression pattern that uses Unicode
character classes for \w (and \s, and \b), use the "(?u)" flag prefix,
or the re.UNICODE flag:

pattern = re.compile("(?u)pattern")
pattern = re.compile("pattern", re.UNICODE)

"
 
T

tinkerbarbet

OK, for those interested in this sort of thing, this is what I now
think is necessary to work with Unicode in python. Thanks to those
who gave feedback, and to Cliff in particular (but any remaining
misconceptions are my own!) Here are the results of my attempts to
come to grips with this. Comments/corrections welcome...

(Note this is not about which characters one expects to match with \w
etc. when compiling regexes with python's re.UNICODE flag. It's about
the encoding of one's source strings when building regexes, in order
to match against strings read from files of various encodings.)

I/O: READ TO/WRITE FROM UNICODE STRING OBJECTS. Always use codecs to
read from a specific encoding to a python Unicode string, and use
codecs to encode to a specific encoding when writing the processed
data. codecs.read() delivers a Unicode string from a specific
encoding, and codecs.write() will put the Unicode string into a
specific encoding (be that an encoding of Unicode such as UTF-8).

SOURCE: Save the source as UTF-8 in your editor, tell python with "# -
*- coding: utf-8 -*-", and construct all strings with u'' (or ur''
instead of r''). Then, when you're concatenating strings constructed
in your source with strings read with codecs, you needn't worry about
conversion issues. (When concatenating byte strings from your source
with Unicode strings, python will, without an explicit decode, assume
the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)

Even if you save the source as UTF-8, tell python with "# -*- coding:
utf-8 -*-", and say myString = "blah", myString is a byte string. To
construct a
Unicode string you must say myString = u"blah" or myString =
unicode("blah"), even if your source is UTF-8.

Typing 'u' when constructing all strings isn't too arduous, and less
effort than passing selected non-ASCII source strings to unicode() and
needing to remember where to do it. (You could easily slip a non-
ASCII char into a byte string in your code because most editors and
default system encodings will allow this.) Doing everything in Unicode
simplifies life.

Since the source is now UTF-8, and given Unicode support in the
editor, it
doesn't matter whether you use Unicode escape sequences or literal
Unicode
characters when constructing strings, since
True

REGEXES: I'm a bit less certain about regexes, but this is how I think
it's going to work: Now that my regexes are constructed from Unicode
strings, and those regexes will be compiled to match against Unicode
strings read with codecs, any potential problems with encoding
conversion disappears. If I put an
en-dash into a regex built using u'', and I happen to have read the
file in
the ASCII encoding which doesn't support en-dashes, the regex simply
won't
match because the pattern doesn't exist in the /Unicode/ string served
up by
codecs. There's no actual problem with my string encoding handling,
it just means I'm looking for the wrong chars in a Unicode string read
from a file not saved in a Unicode encoding.

tIM
 
T

tvn

Please see the correction from Cliff pasted here after this excerpt.
Tim
the byte string is ASCII which is a subset of Unicode (IS0-8859-1
isn't).)

The one comment I'd make is that ASCII and ISO-8859-1 are both subsets
of Unicode, (which relates to the abstract code-points) but ASCII is
also a subset of UTF-8, on the bytestream level, while ISO-8859 is not
a
subset of UTF-8, nor, as far as I can tell, any other unicode
*encoding*.

Thus a file encoded in ascii *is* in fact a utf-8 file. There is no
way
to distinguish the two. But an ISO-8859-1 file is not the same (on
the
bytestream level) as a file with identical content in UTF-8 or any
other
unicode encoding.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top