problem with regex

Discussion in 'Perl Misc' started by Paul Johnston, Feb 6, 2004.

  1. Hi
    I have a file encoded using unicode (utf-8) on a Redhat 9 system and
    using Perl 5.8.0
    It contains mixed estonian and English like below:

    <ee> Kaks vana sõpra </ee>
    <en> Two old friends </en>
    <ee> Tere Piret ! </ee>
    <en> Hello Piret ! </en>
    <ee> Tere Tõnu ! </ee>
    <en> Hello Tõnu ! </en>

    I need to do some processing but the expression
    (/õ/) will not match with the õ in any line
    The perl script and the file I wish to process were both created using
    the same editor (kedit) so I assume they are encoding using the same
    scheme.
    Any ideas why I cannot for example extract all lines which contain
    this symbol "õ"
    TIA
    Paul
    Paul Johnston, Feb 6, 2004
    #1
    1. Advertising

  2. Paul Johnston

    Paul Lall Guest

    On Fri, 6 Feb 2004, Paul Johnston wrote:

    > Hi
    > I have a file encoded using unicode (utf-8) on a Redhat 9 system and
    > using Perl 5.8.0
    > It contains mixed estonian and English like below:
    >
    > <ee> Kaks vana sõpra </ee>
    > <en> Two old friends </en>
    > <ee> Tere Piret ! </ee>
    > <en> Hello Piret ! </en>
    > <ee> Tere Tõnu ! </ee>
    > <en> Hello Tõnu ! </en>
    >
    > I need to do some processing but the expression
    > (/õ/) will not match with the õ in any line
    > The perl script and the file I wish to process were both created using
    > the same editor (kedit) so I assume they are encoding using the same
    > scheme.
    > Any ideas why I cannot for example extract all lines which contain
    > this symbol "õ"
    > TIA
    > Paul



    Without having seen your code, my guess would be that your locale is not
    correctly set up. See perldoc perllocale and perldoc locale

    Paul Lalli
    Paul Lall, Feb 6, 2004
    #2
    1. Advertising

  3. Paul Johnston

    Ben Morrow Guest

    Paul Lall <> wrote:
    > On Fri, 6 Feb 2004, Paul Johnston wrote:
    >
    > > I have a file encoded using unicode (utf-8) on a Redhat 9 system and
    > > using Perl 5.8.0
    > > It contains mixed estonian and English like below:
    > >
    > > <ee> Kaks vana sõpra </ee>
    > > <en> Two old friends </en>
    > > <ee> Tere Piret ! </ee>
    > > <en> Hello Piret ! </en>
    > > <ee> Tere Tõnu ! </ee>
    > > <en> Hello Tõnu ! </en>
    > >
    > > I need to do some processing but the expression
    > > (/õ/) will not match with the õ in any line
    > > The perl script and the file I wish to process were both created using
    > > the same editor (kedit) so I assume they are encoding using the same
    > > scheme.
    > > Any ideas why I cannot for example extract all lines which contain
    > > this symbol "õ"

    >
    >
    > Without having seen your code, my guess would be that your locale is not
    > correctly set up. See perldoc perllocale and perldoc locale


    NO! Don't mix locales and unicode with 5.8. It doesn't work.

    If you wish to use utf8 literals in your source, you have to 'use
    utf8;' at the top.

    Ben

    --
    Joy and Woe are woven fine,
    A Clothing for the Soul divine William Blake
    Under every grief and pine 'Auguries of Innocence'
    Runs a joy with silken twine.
    Ben Morrow, Feb 6, 2004
    #3
  4. On Fri, 6 Feb 2004 15:45:44 +0000 (UTC), Ben Morrow
    <> wrote:

    >
    >Paul Lall <> wrote:
    >> On Fri, 6 Feb 2004, Paul Johnston wrote:
    >>
    >> > I have a file encoded using unicode (utf-8) on a Redhat 9 system and
    >> > using Perl 5.8.0
    >> > It contains mixed estonian and English like below:
    >> >
    >> > <ee> Kaks vana sõpra </ee>
    >> > <en> Two old friends </en>
    >> > <ee> Tere Piret ! </ee>
    >> > <en> Hello Piret ! </en>
    >> > <ee> Tere Tõnu ! </ee>
    >> > <en> Hello Tõnu ! </en>
    >> >
    >> > I need to do some processing but the expression
    >> > (/õ/) will not match with the õ in any line
    >> > The perl script and the file I wish to process were both created using
    >> > the same editor (kedit) so I assume they are encoding using the same
    >> > scheme.
    >> > Any ideas why I cannot for example extract all lines which contain
    >> > this symbol "õ"

    >>
    >>
    >> Without having seen your code, my guess would be that your locale is not
    >> correctly set up. See perldoc perllocale and perldoc locale

    >
    >NO! Don't mix locales and unicode with 5.8. It doesn't work.
    >
    >If you wish to use utf8 literals in your source, you have to 'use
    >utf8;' at the top.
    >
    >Ben


    Just as a follow up I have discover the script works i.e matches õ on
    Solaris 5.8 Perl version 5.005
    However adding
    use utf8; to the script on the Redhat machine also works so my
    problems have been solved (for now at least :) )
    Many thanks
    Paul
    Paul Johnston, Feb 9, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    696
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,619
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    595
  4. Xah Lee
    Replies:
    1
    Views:
    938
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    746
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page