ts said:
my file is ISO-8859 encoded
ok i've done one "biso.rb" ISO encoded and the result is ok :
nil
"false"
with :
field='&éèàçôîûêâöïü'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s
and ruby say NO
U> output for the same files with perl and ruby, ruby says always "yes it
^^^^^^^
U> is UTF-8", where perl says NO over an ISO-8859-1 encoded file... (even
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
BUT, in "butf.rb" (an UTF-8 encoded file) i do :
field='&é§è!çàîûtybvn¤'
utf8rgx=Regexp.new('^(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$', Regexp::EXTENDED)
p utf8rgx =~ field
p (utf8rgx === field).to_s
str=""
File.open("tut_exceptions.html").each { |l| str << l}
p utf8rgx =~ str
p (utf8rgx === str).to_s
and get :
0
"true"
0
"true"
this file comes from :
<
http://www.rubycentral.com/book/tut_exceptions.html>
with the following meta tag :
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
notice Firefox does aggree with the "iso-8859-1" one of my text editor
also.
then, it is seen as an UTF-8 file but isn't, may be this is due to html
tags, i wippe them out saving the file tut_exceptions.html to
tut_exceptions.txt without any more tags nor even one < or >, retry on
that file :
ruby butf.rb
0
"true"
0
"true"
(i've only change the :
File.open("tut_exceptions.html").each { |l| str << l}
to :
File.open("tut_exceptions.txt").each { |l| str << l}
--------------------------^^^
however :
tut_exceptions.txt: UTF-8 Unicode English text
may be this isn't a good exemple because most of the char are us ascci
someway, the file as an english written one.
over :
<
http://www.linux-france.org/>
saying it is a :
<meta http-equiv="Content-type" content="text/html;
charset=iso-8859-15"/>
and Firefox aggres also with that, then with the regexp i get :
0
"true"
0
"true"
....