Extracting "true" words

C

candide

Back again with my study of regular expressions ;) There exists a
special character allowing alphanumeric extraction, the special
character \w (BTW, what the letter 'w' refers to?). But this feature
doesn't permit to extract true words; by "true" I mean word composed
only of _alphabetic_ letters (not digit nor underscore).


So I was wondering what is the pattern to extract (or to match) _true_
words ? Of course, I don't restrict myself to the ascii universe so that
the pattern [a-zA-Z]+ doesn't meet my needs.
 
C

Chris Rebert

Back again with my study of regular expressions ;) There exists a special
character allowing alphanumeric extraction, the special character \w (BTW,
what the letter 'w' refers to?).

"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.
But this feature doesn't permit to extract
true words; by "true" I mean word composed only of _alphabetic_ letters (not
digit nor underscore).

Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?
So I was wondering what is the pattern to extract (or to match) _true_ words
? Of course, I don't restrict myself to the ascii universe so that the
pattern [a-zA-Z]+ doesn't meet my needs.

AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.

Cheers,
Chris
 
M

MRAB

Back again with my study of regular expressions ;) There exists a
special character allowing alphanumeric extraction, the special
character \w (BTW, what the letter 'w' refers to?). But this feature
doesn't permit to extract true words; by "true" I mean word composed
only of _alphabetic_ letters (not digit nor underscore).
The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).
So I was wondering what is the pattern to extract (or to match) _true_
words ? Of course, I don't restrict myself to the ascii universe so that
the pattern [a-zA-Z]+ doesn't meet my needs.
Using the re module, you would have to create a character class out of
all the possible letters, something like this:

letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
unichr(c).isalpha()) + u"]"

Alternatively, you could try the new regex implementation here:

http://pypi.python.org/pypi/regex

which adds support for Unicode properties, and do something like this:

words = regex.findall(ur"\p{Letter}+", unicode_text)
 
J

John Nagle

Back again with my study of regular expressions ;) There exists a special
character allowing alphanumeric extraction, the special character \w (BTW,
what the letter 'w' refers to?).

"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.
But this feature doesn't permit to extract
true words; by "true" I mean word composed only of _alphabetic_ letters (not
digit nor underscore).

Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?

It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.

Parse this into words:

★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開ã—ã¾ã—ãŸã€‚
アルãƒãƒ ã®ä¸Šéƒ¨ã§å†ç”Ÿæ“作ã€ä¸‹éƒ¨ã§ã‚µãƒ ãƒã‚¤ãƒ«ãŒã”覧ã„ãŸã ã‘ã¾ã™ã€‚

John Nagle
 
C

candide

Le 02/04/2011 01:10, Chris Rebert a écrit :
"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.
OK


Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?

Yes, CJK ideographs don't belong to the locale I'm working with ;)

And what of hyphenated terms (e.g. "re-lock")?


I'm interested only with ascii letters and ascii letters with diacritics


Thanks for your response.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top