Pattern matching French accented characters

Discussion in 'Ruby' started by Thomas Luedeke, Mar 1, 2011.

  1. I am writing a French conjugation testing script, and a significant
    problem I have run into is how to pattern match the accented characters
    used in the French language. For example, =C3=A9, =C3=A0, =C3=A8, =C3=AE=
    , =C3=AF, etc.

    I've tried a number of approaches, but can't seem to make it work.
    After some research on the Internet, it may require a UTF-8 approach,
    but I am not familiar with it.

    As an example, assume I want to directly pattern match the French verb
    ha=C3=AFr, and distinguish it from other verbs ending in -ir. How would =
    I do
    this?

    Thanks in advance.

    TPL

    -- =

    Posted via http://www.ruby-forum.com/.=
    Thomas Luedeke, Mar 1, 2011
    #1
    1. Advertising

  2. Thomas Luedeke

    7stud -- Guest

    If you are not familiar with unicode, and you want to match utf-8
    characters, then you better start reading some unicode tutorials. If
    you are already familiar with unicode in general, then in ruby you can
    set the $KCODE variable to 'U' for UTF-8, and then you can require the
    jcode standard library, which will change the way regexes work--they
    will match characters rather than single bytes.

    See here:

    http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Mar 1, 2011
    #2
    1. Advertising

  3. Thomas Luedeke

    7stud -- Guest

    7stud -- wrote in post #984785:
    > If you are not familiar with unicode, and you want to match utf-8
    > characters, then you better start reading some unicode tutorials. If
    > you are already familiar with unicode in general, then in ruby you can
    > set the $KCODE variable to 'U' for UTF-8, and then you can require the
    > jcode standard library, which will change the way regexes work--they
    > will match characters rather than single bytes.
    >


    Uhhmm...you don't need to require 'jcode' to make regexes match
    characters rather than bytes--just set $KCODE = 'U' (or 'UTF-8'). The
    jcode library just gives you some methods like jsize to get the
    character length rather than the byte length, which is what String#size
    returns.

    As an alternative, you can set the /u flag for a regex to make it match
    characters rather than bytes.

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Mar 1, 2011
    #3
  4. Thomas Luedeke

    7stud -- Guest

    7stud -- wrote in post #984789:
    > 7stud -- wrote in post #984785:
    >> If you are not familiar with unicode, and you want to match utf-8
    >> characters, then you better start reading some unicode tutorials.


    Here is a short one, 'unicode in three rules':

    1) Unicode assigns an integer to every letter in every alphabet in the
    world. Currently, there are something like 100,000 letters.

    2) Now the question becomes: what is the best way to store those unicode
    integers (which represent characters) on a computer? The way in which
    you decide to store a unicode integer on a computer is called an
    "encoding".

    For instance, you could use 4 bytes to store each unicode integer. In
    that system, a series of unicode integers is very easy for ruby to
    parse: every 4 bytes represents one unicode integer(which in turn
    represents one character). If ruby blindly reads 4 byte chunks, then
    each 4 byte chunk will be one uncode integer.

    But you don't need 4 bytes to store, say, the unicode integer 60 because
    three of those bytes would be empty. In fact, for all unicode integers
    under 256 (which correspond to the letters in the Western alphabet),
    three out of the four bytes would always be empty. Enter the UTF-8
    encoding.

    3) The UTF-8 encoding uses a variable number of bytes to store unicode
    integers on your computer. For smaller unicode integers, UTF-8 stores
    them in 1 byte, and for larger unicode integers, UTF-8 stores them in
    2,3, or 4 bytes. But then how does ruby know how many bytes it should
    read for each unicode integer? Well, UTF-8 has a tricky way of
    signaling to ruby that the end of one unicode integer has been reached.
    As long as you tell ruby that it is reading unicode integers stored in
    the UTF-8 format, then ruby will will be able to sort out where one
    unicode integer ends and the next one begins--even though each unicode
    in

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, Mar 2, 2011
    #4
  5. Thomas Luedeke, Mar 2, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mickey Segal

    Text search with accented characters

    Mickey Segal, Dec 15, 2005, in forum: Java
    Replies:
    3
    Views:
    778
    Roedy Green
    Dec 16, 2005
  2. Davide Benini

    accented characters

    Davide Benini, Jun 1, 2005, in forum: XML
    Replies:
    4
    Views:
    809
    David Carlisle
    Jun 1, 2005
  3. Mark Drummond

    Dealing with accented characters

    Mark Drummond, May 31, 2006, in forum: Perl
    Replies:
    0
    Views:
    2,918
    Mark Drummond
    May 31, 2006
  4. Rob
    Replies:
    3
    Views:
    166
  5. Alex Fenton
    Replies:
    0
    Views:
    93
    Alex Fenton
    Jun 11, 2005
Loading...

Share This Page