Regular expressions and Unicode

J

Jeffrey Barish

I have a regular expression that I use to extract the surname:

surname = r'(?u).+ (\w+)'

However, when I apply it to this Unicode string, I get only the first 3
letters of the surname:

name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'

surname_re = re.compile(surname)
m = surname_re.search(name)
m.groups()
('Dvo\xc5',)

I suppose that there is an encoding problem, but I don't understand Unicode
well enough to know what to do to digest properly the Unicode characters in
the surname.
 
P

Peter Otten

Jeffrey said:
I have a regular expression that I use to extract the surname:

surname = r'(?u).+ (\w+)'

However, when I apply it to this Unicode string, I get only the first 3
letters of the surname:

name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'

That's a byte string. You can either modify the literal

name = u'Anton\xedn Dvo\u0159\xe1k'

or decode it with the proper encoding

name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
name = name.decode("utf-8")
surname_re = re.compile(surname)
m = surname_re.search(name)
m.groups()
('Dvo\xc5',)

I suppose that there is an encoding problem, but I don't understand
Unicode well enough to know what to do to digest properly the Unicode
characters in the surname.
name = 'Anton\xc3\xadn Dvo\xc5\x99\xc3\xa1k'
re.compile(r"(?u).+ (\w+)").search(name.decode("utf-8")).groups() (u'Dvo\u0159\xe1k',)
print _[0]
Dvořák

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top