Detect file encoding utf-8

R

Rebhan, Gilbert

Hi,

I want to check the file encoding of files in a directory.
Until now i have tried =3D

# found in an older thread in comp.lang.ruby
class String
def utf8?
unpack('U*') rescue return false
true
end
end
# found in an older thread in comp.lang.ruby

utf=3DArray.new
others=3DArray.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|=20
(f.read.utf8?) ? uts<<path : others<<path
}
end

and also tried the chardet Library (no ruby documentation included)
like that

require 'UniversalDetector'

utf=3DArray.new
others=3DArray.new
Dir["Y:/test/**/*.xml"].each do |path|
open(path) { |f|=20
UniversalDetector.chardet(f.read) =3D~ /utf-8/ ?
uts<<path : others<<path
}
end
puts utf.join(",")
puts others.join(",")


Are there better / simpler ways ?

Regards, Gilbert
 
R

Richard Conroy

You could use some regular expressions, to search for code points in
your source string that are outside of what is legal for UTF-8.

Basically you assume it is UTF-8, and then reject it if it contains illegal
or unknown code points.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top