Regular expression for matching IPA characters in Unicode?

Discussion in 'Python' started by =?ISO-8859-1?Q?Mickel_Gr=F6nroos?=, Oct 11, 2004.

  1. Hi Pythoneers,

    Which is the best way of checking that a given unicode string only
    contains IPA characters, e.g. characters in the range \u0250-\u02AF?
    I guess a regular expression would do it, just can't figure out how to
    implement that expression.

    Code snippets are most welcome.

    Best regards,

    Mickel Grönroos

    --
    Mickel Grönroos, application specialist, linguistics, CSC
    PL 405 (Tekniikantie 15 a D), 02101 Espoo, Finland,
    CSC is the Finnish IT center for science, www.csc.fi
     
    =?ISO-8859-1?Q?Mickel_Gr=F6nroos?=, Oct 11, 2004
    #1
    1. Advertisements

  2. Mickel Grönroos wrote:

    > Which is the best way of checking that a given unicode string only
    > contains IPA characters, e.g. characters in the range \u0250-\u02AF?


    Well, I'll give you an example that only includes characters in the
    range [\u0250, \u02AF] but those are just the IPA *extensions.* You also
    need to include basic latin and greek characters from other blocks.

    See: http://www.unicode.org/charts/PDF/U0250.pdf

    And why do you want to do this anyway?

    This example uses the itertools example all() which tells you whether a
    predicate is true for every item in an iterable. The predicate here is
    whether the item is contained in IPA_CHARS, which you can expand...

    =====

    import itertools
    from sets import Set # set() is a built-in in 2.4

    IPA_CHARS = Set(map(unichr, xrange(0x250, 0x2b0)))

    def all(seq, pred=bool):
    # http://www.python.org/doc/current/lib/itertools-example.html
    "Returns True if pred(x) is True for every element in the iterable"
    return False not in itertools.imap(pred, seq)

    def is_ipa(iterable):
    return all(iterable, IPA_CHARS.__contains__)

    print is_ipa(u"aeiou") # this is valid IPA, but not in the extensions block
    print is_ipa(u"\u0260\u02af") # valid IPA in the extensions block

    ====output===

    False
    True
    --
    Michael Hoffman
     
    Michael Hoffman, Oct 11, 2004
    #2
    1. Advertisements

  3. Mickel Grönroos wrote:
    > Which is the best way of checking that a given unicode string only
    > contains IPA characters, e.g. characters in the range \u0250-\u02AF?


    The regular expression for that is [\u0250-\u02AF]. You can either make
    the regular expression a Unicode string itself, or you can make it a
    normal (byte) string, and put the backslash-u-number sequence into it
    (e.g. with double-backslash quotation).

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 12, 2004
    #3
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,695
  2. Codex Twin
    Replies:
    1
    Views:
    774
    Wessel Troost
    Apr 18, 2005
  3. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    1,104
    Alan Moore
    Dec 2, 2005
  4. Dung Ping

    Displaying an IPA symbol

    Dung Ping, Feb 22, 2006, in forum: HTML
    Replies:
    2
    Views:
    863
    Dung Ping
    Feb 22, 2006
  5. Jeremie Le Hen
    Replies:
    2
    Views:
    308
  6. blaine
    Replies:
    6
    Views:
    606
    blaine
    Apr 28, 2008
  7. Fulio Open
    Replies:
    7
    Views:
    846
    Fulio Pen
    May 8, 2009
  8. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    1,352
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...