unicode regex example: trouble

M

marek

trying this example to make print MatchObject reference. Fails (prints None).
Does anybody know where I am wrong?

# -*- coding: cp1251 -*-

import re

# pattern in Ukrainian ('привіт')
p = '\377\376?\004@\0048\0042\004V\004B\004'

# data (pattern is in the middle of the string)
d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V\004B\004t\000t\000'

re_test = re.compile(p, re.UNICODE)

print re_test.search(d, re.UNICODE)
 
P

Peter Otten

marek said:
trying this example to make print MatchObject reference. Fails (prints
None). Does anybody know where I am wrong?

# -*- coding: cp1251 -*-

import re

# pattern in Ukrainian ('привіт')
p = '\377\376?\004@\0048\0042\004V\004B\004'

# data (pattern is in the middle of the string)
d = '\377\376t\000e\000s\000t\000?\004@\0048\0042\004V\004B\004t\000t\000'

re_test = re.compile(p, re.UNICODE)

print re_test.search(d, re.UNICODE)

What you have here are funny 8 bit characters, not unicode:
ÿþ?@82VB ÿþtest?@82VBtt

I guess the encoding is utf-16, therefore:
ÿþ?@82VB

Works as expected :)

Here's what the docs say about the unicode flag:

UNICODE
Make \w, \W, \b, and \B dependent on the Unicode character properties
database. New in version 2.0.

You may or may not need that when you refine your regexp.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top