Unicode strings and ascii regular expressions

Fuzzyman · Jan 30, 2006

Hello all,

Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)

import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
print c.groups()[0].encode('cp1252') # which encode safely

Click to expand...

Click to expand...

£££

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Fredrik Lundh · Jan 30, 2006

Fuzzyman said:
Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)

import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
print c.groups()[0].encode('cp1252') # which encode safely

Click to expand...

Click to expand...

£££

ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

</F>

Fuzzyman · Jan 31, 2006

Fredrik said:
Fuzzyman said:

Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

Click to expand...

[snip..]

ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

Thanks - that's what I hoped.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Ascii to Unicode.	4	Jul 28, 2010
Regular expressions and Unicode	1	Oct 2, 2008
Unicode Chars in Windows Path	12	Apr 3, 2014
pexpect and unicode strings	1	Sep 5, 2009
UNICODE mode for regular expressions - time to change the default?	2	Apr 5, 2007
regular expressions, unicode and XML	3	Jan 26, 2006
Correct handling of case in unicode and regexps	1	Feb 23, 2013
Groups in regular expressions don't repeat as expected	7	Apr 20, 2011

Unicode strings and ascii regular expressions

Fuzzyman

Fredrik Lundh

Fuzzyman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads