perl regexp to ruby one conversion ?

Discussion in 'Ruby' started by Une bévue, Mar 23, 2006.

  1. Une bévue

    Une bévue Guest

    i've a perl regexp :

    $field =~
    m/^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$/x;

    able to detect if $field is of UTF-8 chars or not and i'd like to
    convert it into a ruby regexp.

    How to do that ?

    --
    une bévue
    Une bévue, Mar 23, 2006
    #1
    1. Advertising

  2. On Mar 23, 2006, at 6:43 AM, Une b=E9vue wrote:

    > i've a perl regexp :
    >
    > $field =3D~
    > m/^(
    > [\x09\x0A\x0D\x20-\x7E] # ASCII
    > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    > )*$/x;
    >
    > able to detect if $field is of UTF-8 chars or not and i'd like to
    > convert it into a ruby regexp.
    >
    > How to do that ?


    The expression looks fine to me. Did you try using it?

    James Edward Gray II=
    James Edward Gray II, Mar 23, 2006
    #2
    1. Advertising

  3. Une bévue

    Une bévue Guest

    James Edward Gray II <> wrote:

    >
    > The expression looks fine to me. Did you try using it?


    yes, without the correct result, here is my code :

    field='&é§è!çàîûtybvn¤'
    utf8rgx=Regexp.new('m/^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$/x')

    the test :

    flag=(field === utf8rgx)
    p "flag = #{flag}"

    the result being :
    "flag = false"

    i'm sure my encoding is utf-8...

    may be i've a misunderstanding of "===" ?

    because when trying :

    truc = 'toto'
    rgx=Regexp.new('^toto$')
    flag=(truc === rgx)
    p "flag = #{flag}"

    i got :
    # => "flag = false" ///seems NOT OK to me

    flag=(truc =~ rgx)
    p "flag = #{flag}"
    # => "flag = 0" ///seems OK to me

    --
    une bévue
    Une bévue, Mar 23, 2006
    #3
  4. Une bévue

    Ross Bamford Guest

    On Thu, 2006-03-23 at 23:38 +0900, Une b=C3=A9vue wrote:
    > James Edward Gray II <> wrote:
    >=20
    > >=20
    > > The expression looks fine to me. Did you try using it?

    >=20
    > yes, without the correct result, here is my code :
    >=20
    > field=3D'&=C3=A9=C2=A7=C3=A8!=C3=A7=C3=A0=C3=AE=C3=BBtybvn=E2=82=AC'
    > utf8rgx=3DRegexp.new('m/^(
    > [\x09\x0A\x0D\x20-\x7E] # ASCII
    > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    > )*$/x')
    >=20
    > the test :
    >=20
    > flag=3D(field =3D=3D=3D utf8rgx)
    > p "flag =3D #{flag}"
    >=20


    You'll need to switch those around, as I showed in my response to your
    other thread. flag will then be true, but unfortunately I think too
    often:

    utf8rgx =3D=3D=3D "onlyascii"
    # =3D> true

    I think to do that kind of test you'd have to remove the first line
    (matching ASCII chars) and not anchor the regexp with ^ and $.

    Incidentally, I believe that the regexp above is best translated to Ruby
    like this:

    utf8rgx =3D /^(.)*$/u

    You should also look into $KCODE (specifically $KCODE =3D 'u').

    (Caveat to the above: I'm not much of an encoding expert at all).

    --=20
    Ross Bamford -
    Ross Bamford, Mar 23, 2006
    #4
  5. On Mar 23, 2006, at 8:38 AM, Une b=E9vue wrote:

    > utf8rgx=3DRegexp.new('m/^(
    > [\x09\x0A\x0D\x20-\x7E] # ASCII
    > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    > )*$/x')


    Try changing this to:

    utf8rgx =3D / ... /x

    Hope that helps.

    James Edward Gray II=
    James Edward Gray II, Mar 23, 2006
    #5
  6. Une bévue

    Une bévue Guest

    Ross Bamford <> wrote:

    > You'll need to switch those around, as I showed in my response to your
    > other thread. flag will then be true, but unfortunately I think too
    > often:
    >
    > utf8rgx === "onlyascii"
    > # => true
    >
    > I think to do that kind of test you'd have to remove the first line
    > (matching ASCII chars) and not anchor the regexp with ^ and $.
    >
    > Incidentally, I believe that the regexp above is best translated to Ruby
    > like this:
    >
    > utf8rgx = /^(.)*$/u
    >
    > You should also look into $KCODE (specifically $KCODE = 'u').
    >
    > (Caveat to the above: I'm not much of an encoding expert at all).


    ok thanks for all, may be it could be better streaming out all of the
    html tags and bringing only part of what's in the <body/>...
    --
    une bévue
    Une bévue, Mar 23, 2006
    #6
  7. Une bévue

    Une bévue Guest

    James Edward Gray II <> wrote:

    > Try changing this to:
    >
    > utf8rgx = / ... /x
    >
    > Hope that helps.


    ok, thanks, i see what u mean !
    --
    une bévue
    Une bévue, Mar 23, 2006
    #7
  8. Une bévue

    Une bévue Guest

    James Edward Gray II <> wrote:

    > > utf8rgx=Regexp.new('m/^(
    > > [\x09\x0A\x0D\x20-\x7E] # ASCII
    > > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    > > )*$/x')

    >
    > Try changing this to:
    >
    > utf8rgx = / ... /x


    the above regexp doesn't work as expected with ruby, i've compared the
    output for the same files with perl and ruby, ruby says always "yes it
    is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
    after wipping out the first line the first ^and the last $)

    then, for the time being, i'll use the perl script from ruby in a commad
    line fashion...
    --
    une bévue
    Une bévue, Mar 23, 2006
    #8
  9. Une bévue

    ts Guest

    >>>>> "U" =3D=3D =3D?ISO-8859-1?Q?Une b=3DE9vue?=3D <=
    m.invalid> writes:

    U> the above regexp doesn't work as expected with ruby, i've compared the
    U> output for the same files with perl and ruby, ruby says always "yes it
    U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
    U> after wipping out the first line the first ^and the last $)

    moulon% cat b.rb
    field=3D'&=E9=A7=E8!=E7=E0=EE=FBtybvn=A4'
    utf8rgx=3DRegexp.new('^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$', Regexp::EXTENDED)

    p utf8rgx =3D~ field
    moulon%=20

    moulon% file b.rb
    b.rb: ISO-8859 text
    moulon%=20

    moulon% ruby b.rb
    nil
    moulon%=20


    Guy Decoux
    ts, Mar 23, 2006
    #9
  10. Une bévue

    Une bévue Guest

    ts <> wrote:

    > p utf8rgx =~ field
    > moulon%
    >
    > moulon% file b.rb
    > b.rb: ISO-8859 text
    > moulon%
    >
    > moulon% ruby b.rb
    > nil
    > moulon%


    i don't understand your post )))

    my rb file is UTF-8 encoded, at best i can have an answer, from this
    script, being the reverse of what is wanted )))

    otherwise i get always true...
    --
    une bévue
    Une bévue, Mar 23, 2006
    #10
  11. Une bévue

    ts Guest

    >>>>> "U" == =?ISO-8859-1?Q?Une b=E9vue?= <> writes:

    U> i don't understand your post )))

    U> ts <> wrote:

    >> moulon% file b.rb
    >> b.rb: ISO-8859 text
    >> moulon%


    my file is ISO-8859 encoded

    >> moulon% ruby b.rb
    >> nil
    >> moulon%


    and ruby say NO

    U> output for the same files with perl and ruby, ruby says always "yes it
    ^^^^^^^
    U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
    ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

    Guy Decoux
    ts, Mar 23, 2006
    #11
  12. Une bévue

    Une bévue Guest

    ts <> wrote:

    >
    > my file is ISO-8859 encoded


    ok i've done one "biso.rb" ISO encoded and the result is ok :

    > ruby biso.rb

    nil
    "false"

    with :
    field='&éèàçôîûêâöïü'
    utf8rgx=Regexp.new('^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$', Regexp::EXTENDED)
    p utf8rgx =~ field
    p (utf8rgx === field).to_s

    >
    > >> moulon% ruby b.rb
    > >> nil
    > >> moulon%

    >
    > and ruby say NO
    >
    > U> output for the same files with perl and ruby, ruby says always "yes it
    > ^^^^^^^
    > U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
    > ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^


    BUT, in "butf.rb" (an UTF-8 encoded file) i do :
    field='&é§è!çàîûtybvn¤'
    utf8rgx=Regexp.new('^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$', Regexp::EXTENDED)

    p utf8rgx =~ field
    p (utf8rgx === field).to_s

    str=""
    File.open("tut_exceptions.html").each { |l| str << l}

    p utf8rgx =~ str
    p (utf8rgx === str).to_s


    and get :
    > ruby butf.rb

    0
    "true"
    0
    "true"


    this file comes from :
    <http://www.rubycentral.com/book/tut_exceptions.html>

    with the following meta tag :
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
    notice Firefox does aggree with the "iso-8859-1" one of my text editor
    also.

    then, it is seen as an UTF-8 file but isn't, may be this is due to html
    tags, i wippe them out saving the file tut_exceptions.html to
    tut_exceptions.txt without any more tags nor even one < or >, retry on
    that file :

    ruby butf.rb
    0
    "true"
    0
    "true"


    (i've only change the :
    File.open("tut_exceptions.html").each { |l| str << l}

    to :
    File.open("tut_exceptions.txt").each { |l| str << l}
    --------------------------^^^

    however :
    > file tut_exceptions.txt

    tut_exceptions.txt: UTF-8 Unicode English text

    may be this isn't a good exemple because most of the char are us ascci
    someway, the file as an english written one.

    over :
    <http://www.linux-france.org/>
    saying it is a :
    <meta http-equiv="Content-type" content="text/html;
    charset=iso-8859-15"/>

    and Firefox aggres also with that, then with the regexp i get :

    > ruby butf.rb

    0
    "true"
    0
    "true"

    ....
    --
    une bévue
    Une bévue, Mar 23, 2006
    #12
  13. Hi,

    On Thu, 23 Mar 2006 19:13:51 +0100, "Une b=E9vue" =20
    <> wrote:

    > utf8rgx=3DRegexp.new('^(
    > [\x09\x0A\x0D\x20-\x7E] # ASCII
    > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    > )*$', Regexp::EXTENDED)


    As I understand it utf8rgx matches any string that is utf8, which include=
    s =20
    pure ascii strings (see first line).
    So it should match http://www.rubycentral.com/book/tut_exceptions.html.

    First, here is a working version:

    $ cat utf8tst.rb
    utf8rgx =3D /\A(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*\z/x

    p utf8rgx =3D=3D=3D ARGF.read
    $ curl -s http://www.linux-france.org/ | ruby utf8tst.rb
    false
    $ curl -s http://www.rubycentral.com/book/tut_exceptions.html | ruby =20
    utf8tst.rb
    true


    Your problem was that in Perl ^ and $ only match beginning and end of =20
    string, but in ruby they also match beginning and end of line. So if a =20
    string contains for example a single empty line, it does always match:

    irb(main):001:0> a =3D "xxx\n\nyyyy"
    =3D> "xxx\n\nyyyy"
    irb(main):002:0> a =3D~ /^(w)*$/
    =3D> 4

    So for beginning and end of string in ruby you need \A and \z:

    irb(main):003:0> a =3D~ /\A(w)*\z/
    =3D> nil

    Hope that helps,
    Dominik
    Dominik Bathon, Mar 23, 2006
    #13
  14. Une bévue

    Une bévue Guest

    Dominik Bathon <> wrote:

    > Hope that helps,


    fine thanks a lot it works, you explained very well why the ruby version
    works on string like : string="&éçàôûîêäë" BUT NOT no files because of
    the \n..., here is a script able to compare perl output with ruby one :
    def isFileUtf8Encoded(fileName)
    utf8rgx = /\A(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*\z/x
    str=""
    File.open("#{fileName}").each { |l| str << l}
    return (utf8rgx === str)
    end

    p isFileUtf8Encoded("lutte-ouvriere.html") # => false
    p isFileUtf8Encoded("l_harmatan.html") # => false
    p isFileUtf8Encoded("tut_exceptions.html") # => false
    p isFileUtf8Encoded("butf.rb") # => true
    p isFileUtf8Encoded("biso.rb") # => false

    p `perl IsUTF-8.pl "lutte-ouvriere.html"` # => "0"
    p `perl IsUTF-8.pl "l_harmatan.html"` # => "0"
    p `perl IsUTF-8.pl "tut_exceptions.html"` # => "0"
    p `perl IsUTF-8.pl "butf.rb"` # => "1"
    p `perl IsUTF-8.pl "biso.rb"` # => "0"

    p $KCODE # => "UTF8"

    the perl script being (called from the ruby one) :

    #!/usr/bin/perl

    sub isFileUtf8Encoded
    {
    my ($fn) = @_;
    $string='';
    open (F, $fn) || die "Unable to open file $file : $!";
    while ($line = <F>) {
    $string.=$line;
    }
    close F;
    $flag = ($string =~
    m/^(
    [\x09\x0A\x0D\x20-\x7E] # ASCII
    | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
    )*$/x);
    if( $flag != 1 )
    {
    return 0;
    }
    return $flag;
    }
    print isFileUtf8Encoded(@ARGV[0])


    --
    une bévue
    Une bévue, Mar 23, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    148
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    461
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    337
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    182
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    168
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page