Unicode strings and ascii regular expressions

F

Fuzzyman

Hello all,

Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)
import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
print c.groups()[0].encode('cp1252') # which encode safely
£££


All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 
F

Fredrik Lundh

Fuzzyman said:
Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?

I've tested it and it works - but can someone confirm that this is
consistent and safe ? (No lurking encode errors - I assume it is only a
decode that is done, in which case is it safe on a system that has a
non-ascii compatible default encoding ? OTOH it would seem to me that
that would break *everything*.)
import re
r = re.compile('(.*)=(.*)')
s = '£££=£££'.decode('cp1252') # yields a unicode string that can't be encoded as ascii
c = r.match(s)
c.groups() # yields two unicode strings (u'\xa3\xa3\xa3', u'\xa3\xa3\xa3')
print c.groups()[0].encode('cp1252') # which encode safely
£££

ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

</F>
 
F

Fuzzyman

Fredrik said:
Fuzzyman said:
Can someone confirm that compiled regular expressions from ascii
strings will always (and safely) yield unicode values when matched
against unicode strings ?
[snip..]

ascii patterns work just fine on unicode strings. the engine doesn't care
what string type you use for the pattern, and it always returns slices of
the target string, so you get back what you pass in.

Thanks - that's what I hoped. :)

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top