regexp to match CJK characters

Discussion in 'Ruby' started by Cafe Babe, Oct 28, 2006.

  1. Cafe Babe

    Cafe Babe Guest

    Cafe Babe, Oct 28, 2006
    #1
    1. Advertising

  2. --------------enig898654CD2A67A152918C63B7
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable

    Paul Lutus wrote:
    > Cafe Babe wrote:
    >=20
    >> How can I write a regexp to match CJK characters?
    >> Thanks in advance:)

    >=20
    > print "Yes!" if varname =3D~ /^CJK$/
    >=20
    > If this is not what you wanted, you will simply have to write a longer =

    post.
    >=20


    CJK =3D (I think) Chinese, Japanese, Korean. "CJK characters" usually
    refers to the encodings you use for those - Big5, JIS, Unicode, etc.

    David Vallner


    --------------enig898654CD2A67A152918C63B7
    Content-Type: application/pgp-signature; name="signature.asc"
    Content-Description: OpenPGP digital signature
    Content-Disposition: attachment; filename="signature.asc"

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (MingW32)

    iD8DBQFFQ4aLy6MhrS8astoRAu5qAJ9gaWMehjdgyOzYahKEGxFlidLPuQCeIU3v
    0wnhxaaQJ9cjNQrwCJux0aE=
    =pjVJ
    -----END PGP SIGNATURE-----

    --------------enig898654CD2A67A152918C63B7--
     
    David Vallner, Oct 28, 2006
    #2
    1. Advertising

  3. Cafe Babe

    Cafe Babe Guest

    David Vallner wrote:
    > CJK = (I think) Chinese, Japanese, Korean. "CJK characters" usually
    > refers to the encodings you use for those - Big5, JIS, Unicode, etc.
    >
    > David Vallner


    Yes, so how can write the regexp? thanks a lot


    --
    Posted via http://www.ruby-forum.com/.
     
    Cafe Babe, Oct 28, 2006
    #3
  4. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Cafe Babe wrote:
    | David Vallner wrote:
    |> CJK = (I think) Chinese, Japanese, Korean. "CJK characters" usually
    |> refers to the encodings you use for those - Big5, JIS, Unicode, etc.
    | Yes, so how can write the regexp? thanks a lot

    Which encoding?

    Jupp
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (GNU/Linux)

    iD8DBQFFQ7lNrhv7B2zGV08RAiWDAJ9nHZ53nFKfbWdHshWc8z/5zU/u6gCdGfyt
    8XDVfOVp/F/MbhPx/6MitxA=
    =8zOn
    -----END PGP SIGNATURE-----
     
    Josef 'Jupp' Schugt, Oct 28, 2006
    #4
  5. Cafe Babe

    Cafe Babe Guest

    Josef 'Jupp' Schugt wrote:
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Cafe Babe wrote:
    > | David Vallner wrote:
    > |> CJK = (I think) Chinese, Japanese, Korean. "CJK characters" usually
    > |> refers to the encodings you use for those - Big5, JIS, Unicode, etc.
    > | Yes, so how can write the regexp? thanks a lot
    >
    > Which encoding?
    >
    > Jupp
    > -----BEGIN PGP SIGNATURE-----
    > Version: GnuPG v1.4.5 (GNU/Linux)
    >
    > iD8DBQFFQ7lNrhv7B2zGV08RAiWDAJ9nHZ53nFKfbWdHshWc8z/5zU/u6gCdGfyt
    > 8XDVfOVp/F/MbhPx/6MitxA=
    > =8zOn
    > -----END PGP SIGNATURE-----


    UTF-8

    and

    $KCODE='u'
    require_dependency 'jcode',

    thanks


    --
    Posted via http://www.ruby-forum.com/.
     
    Cafe Babe, Oct 29, 2006
    #5
  6. Cafe Babe

    Dido Sevilla Guest

    On 10/29/06, Cafe Babe <> wrote:
    > UTF-8
    >
    > and
    >
    > $KCODE='u'
    > require_dependency 'jcode',


    You may need to use the Oniguruma patch. I believe this is necessary
    to give regular expressions support for character sets other than
    plain ASCII.

    http://www.geocities.jp/kosako3/oniguruma/

    If you're using Gentoo, all you need to do is remerge Ruby with the
    cjk use flag turned on. For other systems, you may need to download
    and apply the patch manually. See the Oniguruma site for more details.
    If you're using a 1.9 Ruby, Oniguruma is already built-in.
     
    Dido Sevilla, Oct 29, 2006
    #6
  7. Hi,

    In message "Re: regexp to match CJK characters"
    on Mon, 30 Oct 2006 00:26:49 +0900, "Dido Sevilla" <> writes:

    |You may need to use the Oniguruma patch. I believe this is necessary
    |to give regular expressions support for character sets other than
    |plain ASCII.

    Regular expression comes with 1.8 does support UTF-8.

    matz.
     
    Yukihiro Matsumoto, Oct 30, 2006
    #7
  8. > Regular expression comes with 1.8 does support UTF-8.

    does this mean though that you must do a match on an escaped character
    (\u1234 or on a 'real' character?)

    Kev
     
    Kevin Jackson, Oct 30, 2006
    #8
  9. Hi,

    In message "Re: regexp to match CJK characters"
    on Mon, 30 Oct 2006 12:33:08 +0900, "Kevin Jackson" <> writes:

    |> Regular expression comes with 1.8 does support UTF-8.
    |
    |does this mean though that you must do a match on an escaped character
    |(\u1234 or on a 'real' character?)

    You don't have to escape, if you specify -Ku or $KCODE='u'.

    matz.
     
    Yukihiro Matsumoto, Oct 30, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lau Lei Cheong

    CJK character and HttpRequestValidation

    Lau Lei Cheong, Feb 1, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    853
    Lau Lei Cheong
    Feb 1, 2005
  2. Fred Grafe
    Replies:
    0
    Views:
    416
    Fred Grafe
    Dec 17, 2003
  3. gs
    Replies:
    2
    Views:
    432
    Andrew Clover
    Oct 24, 2004
  4. Giovanni Bajo
    Replies:
    14
    Views:
    547
    Neil Benn
    Aug 23, 2005
  5. Old Echo
    Replies:
    1
    Views:
    187
    Adam Shelly
    Sep 4, 2008
Loading...

Share This Page