Using re to find unicode ranges

E

Eric Abrahamsen

Is it possible to use the re module to find runs of characters within
a certain Unicode range?

I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></span> tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline
pattern (which uses regular expressions). The regular expression
solution would be much simpler and faster, but something tells me
there's no way to use a regex to find character ranges... Chinese
characters appear to fall between 19968 and 40959 using ord(), and I
suppose I can go that route if necessary, but I think it would be ugly.

Any hints or suggestions would be appreciated!

Eric
 
P

Paul McGuire

Is it possible to use the re module to find runs of characters within  
a certain Unicode range?

I'm writing a Markdown extension to go over text and wrap blocks of  
consecutive Chinese characters in <span class="char"></span> tags for  
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline  
pattern (which uses regular expressions). The regular expression  
solution would be much simpler and faster, but something tells me  
there's no way to use a regex to find character ranges... Chinese  
characters appear to fall between 19968 and 40959 using ord(), and I  
suppose I can go that route if necessary, but I think it would be ugly.

Any hints or suggestions would be appreciated!

Eric

Eric -

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.

-- Paul
 
E

Eric Abrahamsen

Is it possible to use the re module to find runs of characters within a
certain Unicode range?
I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in <span class="char"></span> tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline pattern
(which uses regular expressions). The regular expression solution would
be much simpler and faster, but something tells me there's no way to use
a regex to find character ranges... Chinese characters appear to fall
between 19968 and 40959 using ord(), and I suppose I can go that route if
necessary, but I think it would be ugly.

# coding: utf-8
import re
sample = u'My name is Âí¿Ë. I am ÃÀ¹úÈË.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n

Of course! And obvious, once you point it out. Thanks for the help.


This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.
--Mark

Mark - not quite what I'm after here, but pretty interesting
nonetheless...

E
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top