Identifying unicode punctuation characters with Python regex

S

Shiao

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
 
M

Martin v. Löwis

I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
 
S

Shiao

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin

Thanks Martin. I'll do this.
 
S

Shiao

You can always build your own pattern. Something like (Python 3.0rc2):
import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
import re
r=re.compile('['+Po+']')
x='§Ú¬O¬ü°ê¤H¡C'
x '§Ú¬O¬ü°ê¤H¡C'
r.findall(x)
['¡C']

-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>> import unicodedata as u
A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A) 65536
len(P) 491
len(re.findall('['+P+']',A)) # ] was naturally
escaped 490
set(P)-set(re.findall('['+P+']',A)) # so only missing \ {'\\'}
P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them..
len(re.findall('['+P+']',A))

491

-Mark

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :)
 
J

jhermann

P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top