perl regexp to ruby one conversion ?

U

Une bévue

i've a perl regexp :

$field =~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?
 
J

James Edward Gray II

i've a perl regexp :

$field =3D~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x;

able to detect if $field is of UTF-8 chars or not and i'd like to
convert it into a ruby regexp.

How to do that ?

The expression looks fine to me. Did you try using it?

James Edward Gray II=
 
U

Une bévue

James Edward Gray II said:
The expression looks fine to me. Did you try using it?

yes, without the correct result, here is my code :

field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

the test :

flag=(field === utf8rgx)
p "flag = #{flag}"

the result being :
"flag = false"

i'm sure my encoding is utf-8...

may be i've a misunderstanding of "===" ?

because when trying :

truc = 'toto'
rgx=Regexp.new('^toto$')
flag=(truc === rgx)
p "flag = #{flag}"

i got :
# => "flag = false" ///seems NOT OK to me

flag=(truc =~ rgx)
p "flag = #{flag}"
# => "flag = 0" ///seems OK to me
 
R

Ross Bamford

James Edward Gray II said:
=20
The expression looks fine to me. Did you try using it?
=20
yes, without the correct result, here is my code :
=20
field=3D'&=C3=A9=C2=A7=C3=A8!=C3=A7=C3=A0=C3=AE=C3=BBtybvn=E2=82=AC'
utf8rgx=3DRegexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')
=20
the test :
=20
flag=3D(field =3D=3D=3D utf8rgx)
p "flag =3D #{flag}"
=20

You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx =3D=3D=3D "onlyascii"
# =3D> true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx =3D /^(.)*$/u

You should also look into $KCODE (specifically $KCODE =3D 'u').

(Caveat to the above: I'm not much of an encoding expert at all).

--=20
Ross Bamford - (e-mail address removed)
 
J

James Edward Gray II

utf8rgx=3DRegexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

Try changing this to:

utf8rgx =3D / ... /x

Hope that helps.

James Edward Gray II=
 
U

Une bévue

Ross Bamford said:
You'll need to switch those around, as I showed in my response to your
other thread. flag will then be true, but unfortunately I think too
often:

utf8rgx === "onlyascii"
# => true

I think to do that kind of test you'd have to remove the first line
(matching ASCII chars) and not anchor the regexp with ^ and $.

Incidentally, I believe that the regexp above is best translated to Ruby
like this:

utf8rgx = /^(.)*$/u

You should also look into $KCODE (specifically $KCODE = 'u').

(Caveat to the above: I'm not much of an encoding expert at all).

ok thanks for all, may be it could be better streaming out all of the
html tags and bringing only part of what's in the <body/>...
 
U

Une bévue

James Edward Gray II said:
utf8rgx=Regexp.new('m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x')

Try changing this to:

utf8rgx = / ... /x

the above regexp doesn't work as expected with ruby, i've compared the
output for the same files with perl and ruby, ruby says always "yes it
is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
after wipping out the first line the first ^and the last $)

then, for the time being, i'll use the perl script from ruby in a commad
line fashion...
 
T

ts

"U" =3D=3D =3D?ISO-8859-1?Q?Une b=3DE9vue?=3D <[email protected]=
m.invalid> writes:

U> the above regexp doesn't work as expected with ruby, i've compared the
U> output for the same files with perl and ruby, ruby says always "yes it
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
U> after wipping out the first line the first ^and the last $)

moulon% cat b.rb
field=3D'&=E9=A7=E8!=E7=E0=EE=FBtybvn=A4'
utf8rgx=3DRegexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =3D~ field
moulon%=20

moulon% file b.rb
b.rb: ISO-8859 text
moulon%=20

moulon% ruby b.rb
nil
moulon%=20


Guy Decoux
 
U

Une bévue

ts said:
p utf8rgx =~ field
moulon%

moulon% file b.rb
b.rb: ISO-8859 text
moulon%

moulon% ruby b.rb
nil
moulon%

i don't understand your post )))

my rb file is UTF-8 encoded, at best i can have an answer, from this
script, being the reverse of what is wanted )))

otherwise i get always true...
 
T

ts

U> i don't understand your post )))


my file is ISO-8859 encoded

and ruby say NO

U> output for the same files with perl and ruby, ruby says always "yes it
^^^^^^^
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

Guy Decoux
 
U

Une bévue

ts said:
my file is ISO-8859 encoded

ok i've done one "biso.rb" ISO encoded and the result is ok :
ruby biso.rb
nil
"false"

with :
field='&éèàçôîûêâöïü'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s
and ruby say NO

U> output for the same files with perl and ruby, ruby says always "yes it
^^^^^^^
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^

BUT, in "butf.rb" (an UTF-8 encoded file) i do :
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

p utf8rgx =~ field
p (utf8rgx === field).to_s

str=""
File.open("tut_exceptions.html").each { |l| str << l}

p utf8rgx =~ str
p (utf8rgx === str).to_s


and get :
ruby butf.rb
0
"true"
0
"true"


this file comes from :
<http://www.rubycentral.com/book/tut_exceptions.html>

with the following meta tag :
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
notice Firefox does aggree with the "iso-8859-1" one of my text editor
also.

then, it is seen as an UTF-8 file but isn't, may be this is due to html
tags, i wippe them out saving the file tut_exceptions.html to
tut_exceptions.txt without any more tags nor even one < or >, retry on
that file :

ruby butf.rb
0
"true"
0
"true"


(i've only change the :
File.open("tut_exceptions.html").each { |l| str << l}

to :
File.open("tut_exceptions.txt").each { |l| str << l}
--------------------------^^^

however :
file tut_exceptions.txt
tut_exceptions.txt: UTF-8 Unicode English text

may be this isn't a good exemple because most of the char are us ascci
someway, the file as an english written one.

over :
<http://www.linux-france.org/>
saying it is a :
<meta http-equiv="Content-type" content="text/html;
charset=iso-8859-15"/>

and Firefox aggres also with that, then with the regexp i get :
ruby butf.rb
0
"true"
0
"true"

....
 
D

Dominik Bathon

Hi,

utf8rgx=3DRegexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)

As I understand it utf8rgx matches any string that is utf8, which include=
s =20
pure ascii strings (see first line).
So it should match http://www.rubycentral.com/book/tut_exceptions.html.

First, here is a working version:

$ cat utf8tst.rb
utf8rgx =3D /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x

p utf8rgx =3D=3D=3D ARGF.read
$ curl -s http://www.linux-france.org/ | ruby utf8tst.rb
false
$ curl -s http://www.rubycentral.com/book/tut_exceptions.html | ruby =20
utf8tst.rb
true


Your problem was that in Perl ^ and $ only match beginning and end of =20
string, but in ruby they also match beginning and end of line. So if a =20
string contains for example a single empty line, it does always match:

irb(main):001:0> a =3D "xxx\n\nyyyy"
=3D> "xxx\n\nyyyy"
irb(main):002:0> a =3D~ /^(w)*$/
=3D> 4

So for beginning and end of string in ruby you need \A and \z:

irb(main):003:0> a =3D~ /\A(w)*\z/
=3D> nil

Hope that helps,
Dominik
 
U

Une bévue

Dominik Bathon said:
Hope that helps,

fine thanks a lot it works, you explained very well why the ruby version
works on string like : string="&éçàôûîêäë" BUT NOT no files because of
the \n..., here is a script able to compare perl output with ruby one :
def isFileUtf8Encoded(fileName)
utf8rgx = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x
str=""
File.open("#{fileName}").each { |l| str << l}
return (utf8rgx === str)
end

p isFileUtf8Encoded("lutte-ouvriere.html") # => false
p isFileUtf8Encoded("l_harmatan.html") # => false
p isFileUtf8Encoded("tut_exceptions.html") # => false
p isFileUtf8Encoded("butf.rb") # => true
p isFileUtf8Encoded("biso.rb") # => false

p `perl IsUTF-8.pl "lutte-ouvriere.html"` # => "0"
p `perl IsUTF-8.pl "l_harmatan.html"` # => "0"
p `perl IsUTF-8.pl "tut_exceptions.html"` # => "0"
p `perl IsUTF-8.pl "butf.rb"` # => "1"
p `perl IsUTF-8.pl "biso.rb"` # => "0"

p $KCODE # => "UTF8"

the perl script being (called from the ruby one) :

#!/usr/bin/perl

sub isFileUtf8Encoded
{
my ($fn) = @_;
$string='';
open (F, $fn) || die "Unable to open file $file : $!";
while ($line = <F>) {
$string.=$line;
}
close F;
$flag = ($string =~
m/^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$/x);
if( $flag != 1 )
{
return 0;
}
return $flag;
}
print isFileUtf8Encoded(@ARGV[0])
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top