unicode "em space" in regex

X

Xah Lee

how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re.U)

Xah
(e-mail address removed)
∑ http://xahlee.org/
 
K

Klaus Alexander Seistrup

Xah Lee :
how to represent the unicode "em space" in regex?

e.g. i want do something like this:

fracture=re.split(r'\342371*\|\342371*',myline,re.U)

I'm not sure what you're trying to do, but would it help you to use
it's name:

>>> EM_SPACE = u'\N{EM SPACE}'
>>> fracture = myline.split(EM_SPACE)

?

Cheers,
 
G

Guest

Xah said:
how to represent the unicode "em space" in regex?

You will have to pass a Unicode literal as the regular expression,
e.g.

fracture=re.split(u'\u2003*\\|\u2003*',myline,re.U)

Notice that, in raw Unicode literals, you can still use \u to
escape characters, e.g.

fracture=re.split(ur'\u2003*\|\u2003*',myline,re.U)

Regards,
Martin
 
X

Xah Lee

Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can. But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)

----
Related...:

The official python doc:
http://python.org/doc/2.4.1/lib/module-re.html
says:

"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?

Xah
(e-mail address removed)
∑ http://xahlee.org/
 
F

Fredrik Lundh

Xah said:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars??

no, null bytes. "\0". "\x00". ord(byte) == 0. chr(0).
and the "\number" is meant to be decimal?

octal. this is explained on the "Regular Expression Syntax" page.
and in what encoding?

null byte encoding? you're confused.

</F>
 
R

Reinhold Birkenfeld

Xah said:
"Regular expression pattern strings may not contain null bytes, but can
specify the null byte using the \number notation."

What is meant by null bytes here? Unprintable chars?? and the "\number"
is meant to be decimal? and in what encoding?

The null byte is a byte with the integer value 0. Difficult, isn't it.

The \number notation is, as you could read in http://docs.python.org/ref/strings.html,
octal.

Reinhold
 
G

Guest

Xah said:
Thanks. Is it true that any unicode chars can also be used inside regex
literally?

e.g.
re.search(ur' +',mystring,re.U)

I tested this case and apparently i can.

Yes. In fact, when you write u"\u2003" or u" " doesn't matter
to re.search. Either way you get a Unicode object with U+2003
in it, which is processed by SRE.
But is it true that any
unicode char can be embedded in regex literally. (does this apply to
the esoteric ones such as other non-printing chars and combining
forms...)

Yes. To SRE, only the Unicode ordinal values matter. To determine
whether something matches, it needs to have the same ordinal value
in the string that you have in the expression. No interpretation
of the character is performed, except for the few characters that
have markup meaning in regular expressions (e.g. $, \, [, etc)

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,173
Latest member
GeraldReund
Top