Using re to find unicode ranges

Eric Abrahamsen · Sep 29, 2008

Is it possible to use the re module to find runs of characters within
a certain Unicode range?

I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline
pattern (which uses regular expressions). The regular expression
solution would be much simpler and faster, but something tells me
there's no way to use a regex to find character ranges... Chinese
characters appear to fall between 19968 and 40959 using ord(), and I
suppose I can go that route if necessary, but I think it would be ugly.

Any hints or suggestions would be appreciated!

Eric

Paul McGuire · Sep 29, 2008

Is it possible to use the re module to find runs of characters within
a certain Unicode range?

I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline
pattern (which uses regular expressions). The regular expression
solution would be much simpler and faster, but something tells me
there's no way to use a regex to find character ranges... Chinese
characters appear to fall between 19968 and 40959 using ord(), and I
suppose I can go that route if necessary, but I think it would be ugly.

Any hints or suggestions would be appreciated!

Eric

Eric -

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.

-- Paul

Eric Abrahamsen · Sep 30, 2008

Is it possible to use the re module to find runs of characters within a
certain Unicode range?

Click to expand...

I'm writing a Markdown extension to go over text and wrap blocks of
consecutive Chinese characters in tags for
nice styling in an HTML page. The available hooks appear to be a pre-
processor (which is a "for line in lines" situation) or an inline pattern
(which uses regular expressions). The regular expression solution would
be much simpler and faster, but something tells me there's no way to use
a regex to find character ranges... Chinese characters appear to fall
between 19968 and 40959 using ord(), and I suppose I can go that route if
necessary, but I think it would be ugly.

Click to expand...

# coding: utf-8
import re
sample = u'My name is Âí¿Ë. I am ÃÀ¹úÈË.'
for n in re.findall(ur'[\u4e00-\u9fff]+',sample):
print n

Of course! And obvious, once you point it out. Thanks for the help.

This sounds similar to what zhpy (http://pyparsing.wikispaces.com/
WhosUsingPyparsing#Zhpy) does to extract Chinese words from code, to
generate executable English Python. You might give that a look.
--Mark

Mark - not quite what I'm after here, but pretty interesting
nonetheless...

E

using re module to find " but not " alone ... is this a BUG in re?	5	Jun 12, 2008
Regex for unicode letter characters	4	Jan 11, 2009
How to replace UniCode representation with actual character?	6	Dec 18, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
How to find the best solution ?	2	Mar 23, 2010
An assessment of the Unicode standard	119	Aug 29, 2009
Unicode: matching a	0	Nov 15, 2007
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010

Using re to find unicode ranges

Eric Abrahamsen

Paul McGuire

Eric Abrahamsen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads