Extracting "true" words

Discussion in 'Python' started by candide, Apr 1, 2011.

  1. candide

    candide Guest

    Back again with my study of regular expressions ;) There exists a
    special character allowing alphanumeric extraction, the special
    character \w (BTW, what the letter 'w' refers to?). But this feature
    doesn't permit to extract true words; by "true" I mean word composed
    only of _alphabetic_ letters (not digit nor underscore).


    So I was wondering what is the pattern to extract (or to match) _true_
    words ? Of course, I don't restrict myself to the ascii universe so that
    the pattern [a-zA-Z]+ doesn't meet my needs.
    candide, Apr 1, 2011
    #1
    1. Advertising

  2. candide

    Chris Rebert Guest

    On Fri, Apr 1, 2011 at 1:55 PM, candide <> wrote:
    > Back again with my study of regular expressions ;) There exists a special
    > character allowing alphanumeric extraction, the special character \w (BTW,
    > what the letter 'w' refers to?).


    "Word" presumably/intuitively; hence the non-standard "[:word:]"
    POSIX-like character class alias for \w in some environments.

    > But this feature doesn't permit to extract
    > true words; by "true" I mean word composed only of _alphabetic_ letters (not
    > digit nor underscore).


    Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
    And what of hyphenated terms (e.g. "re-lock")?

    > So I was wondering what is the pattern to extract (or to match) _true_ words
    > ? Of course, I don't restrict myself to the ascii universe so that the
    > pattern [a-zA-Z]+ doesn't meet my needs.


    AFAICT, there doesn't appear to be a nice way to do this in Python
    using the std lib `re` module, but I'm not a regex guru.
    POSIX character classes are unsupported, which rules out "[:alpha:]".
    \w can be made Unicode/locale-sensitive, but includes digits and the
    underscore, as you've already pointed out.
    \p (Unicode property/block testing), which would allow for
    "\p{Alphabetic}" or similar, is likewise unsupported.

    Cheers,
    Chris
    --
    http://blog.rebertia.com
    Chris Rebert, Apr 2, 2011
    #2
    1. Advertising

  3. candide

    MRAB Guest

    On 01/04/2011 21:55, candide wrote:
    > Back again with my study of regular expressions ;) There exists a
    > special character allowing alphanumeric extraction, the special
    > character \w (BTW, what the letter 'w' refers to?). But this feature
    > doesn't permit to extract true words; by "true" I mean word composed
    > only of _alphabetic_ letters (not digit nor underscore).
    >

    The 'w' refers to a 'word' character, although in regex it refers to
    letters, digits and the underscore character '_' due to its use in
    computer languages (basically, the characters of an identifier or name).
    >
    > So I was wondering what is the pattern to extract (or to match) _true_
    > words ? Of course, I don't restrict myself to the ascii universe so that
    > the pattern [a-zA-Z]+ doesn't meet my needs.
    >

    Using the re module, you would have to create a character class out of
    all the possible letters, something like this:

    letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
    unichr(c).isalpha()) + u"]"

    Alternatively, you could try the new regex implementation here:

    http://pypi.python.org/pypi/regex

    which adds support for Unicode properties, and do something like this:

    words = regex.findall(ur"\p{Letter}+", unicode_text)
    MRAB, Apr 2, 2011
    #3
  4. candide

    John Nagle Guest

    On 4/1/2011 4:10 PM, Chris Rebert wrote:
    > On Fri, Apr 1, 2011 at 1:55 PM, candide<> wrote:
    >> Back again with my study of regular expressions ;) There exists a special
    >> character allowing alphanumeric extraction, the special character \w (BTW,
    >> what the letter 'w' refers to?).

    >
    > "Word" presumably/intuitively; hence the non-standard "[:word:]"
    > POSIX-like character class alias for \w in some environments.
    >
    >> But this feature doesn't permit to extract
    >> true words; by "true" I mean word composed only of _alphabetic_ letters (not
    >> digit nor underscore).

    >
    > Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
    > And what of hyphenated terms (e.g. "re-lock")?


    It's an interesting parsing problem to find word breaks in mixed
    language text. It's quite common to find English and Japanese text
    mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
    excessively cute.) Each ideograph is a "word", of course.

    Parse this into words:

    ★12/25/2009★
    6%DOKIDOKI VISUAL FILE vol.4を公開ã—ã¾ã—ãŸã€‚
    アルãƒãƒ ã®ä¸Šéƒ¨ã§å†ç”Ÿæ“作ã€ä¸‹éƒ¨ã§ã‚µãƒ ãƒã‚¤ãƒ«ãŒã”覧ã„ãŸã ã‘ã¾ã™ã€‚

    John Nagle
    John Nagle, Apr 2, 2011
    #4
  5. candide

    candide Guest

    Le 02/04/2011 01:10, Chris Rebert a écrit :

    > "Word" presumably/intuitively; hence the non-standard "[:word:]"
    > POSIX-like character class alias for \w in some environments.


    OK


    > Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?


    Yes, CJK ideographs don't belong to the locale I'm working with ;)


    > And what of hyphenated terms (e.g. "re-lock")?



    I'm interested only with ascii letters and ascii letters with diacritics


    Thanks for your response.
    candide, Apr 2, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,084
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    361
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    420
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,481
  5. bdb112
    Replies:
    45
    Views:
    1,340
    jazbees
    Apr 29, 2009
Loading...

Share This Page