Python and Cyrillic characters in regular expression

P

phasma

Hi, I'm trying extract all alphabetic characters from string.

reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

Please, help.
 
M

MRAB

Hi, I'm trying extract all alphabetic characters from string.

reg = re.compile('(?u)([\w\s]+)', re.UNICODE)

You don't need both (?u) and re.UNICODE: they mean the same thing.

This will actually match letters and whitespace.
buf = re.match(string)

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:

Program
=======
# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')

found = reg.match(u"ya Ñ")
print found.group(1).encode("utf-8")

found = reg.match(u"Ñ ya")
print found.group(1).encode("utf-8")

Output
======
ya Ñ
Ñ ya
 
F

Fredrik Lundh

phasma said:
Hi, I'm trying extract all alphabetic characters from string.

reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

can you provide a few sample strings that show this behaviour?

</F>
 
P

phasma

string = u"ðÒÉ×ÅÔ"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)

string = u"Hi.ðÒÉ×ÅÔ"
(u'Hi',)

phasma said:
Hi, I'm trying extract all alphabetic characters from string.
reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

can you provide a few sample strings that show this behaviour?

</F>
 
M

MRAB

string = u"ðÒÉ×ÅÔ"

All the characters are letters.
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)

string = u"Hi.ðÒÉ×ÅÔ"

The third character isn't a letter and isn't whitespace.
phasma said:
Hi, I'm trying extract all alphabetic characters from string.
reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.
can you provide a few sample strings that show this behaviour?
 
F

Fredrik Lundh

phasma said:
string = u"ðÒÉ×ÅÔ"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)

string = u"Hi.ðÒÉ×ÅÔ"
(u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, and
whitespace. "." doesn't fall into that category, so the "match" method
stops when it gets to that character.

maybe you could use re.sub or re.findall?
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'
>>> # find runs of alphanumeric characters
>>> re.findall("(?u)\w+", string) [u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
>>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want to
skip, while "findall" expects you to specify what you want to keep.)

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top