Detect file encoding utf-8

Discussion in 'Ruby' started by Rebhan, Gilbert, Aug 29, 2007.

  1. Hi,

    I want to check the file encoding of files in a directory.
    Until now i have tried =3D

    # found in an older thread in comp.lang.ruby
    class String
    def utf8?
    unpack('U*') rescue return false
    true
    end
    end
    # found in an older thread in comp.lang.ruby

    utf=3DArray.new
    others=3DArray.new
    Dir["Y:/test/**/*.xml"].each do |path|
    open(path) { |f|=20
    (f.read.utf8?) ? uts<<path : others<<path
    }
    end

    and also tried the chardet Library (no ruby documentation included)
    like that

    require 'UniversalDetector'

    utf=3DArray.new
    others=3DArray.new
    Dir["Y:/test/**/*.xml"].each do |path|
    open(path) { |f|=20
    UniversalDetector.chardet(f.read) =3D~ /utf-8/ ?
    uts<<path : others<<path
    }
    end
    puts utf.join(",")
    puts others.join(",")


    Are there better / simpler ways ?

    Regards, Gilbert
    Rebhan, Gilbert, Aug 29, 2007
    #1
    1. Advertising

  2. You could use some regular expressions, to search for code points in
    your source string that are outside of what is legal for UTF-8.

    Basically you assume it is UTF-8, and then reject it if it contains illegal
    or unknown code points.

    On 8/29/07, Rebhan, Gilbert <> wrote:
    >
    > Hi,
    >
    > I want to check the file encoding of files in a directory.
    > Until now i have tried =
    >
    > # found in an older thread in comp.lang.ruby
    > class String
    > def utf8?
    > unpack('U*') rescue return false
    > true
    > end
    > end
    > # found in an older thread in comp.lang.ruby
    >
    > utf=Array.new
    > others=Array.new
    > Dir["Y:/test/**/*.xml"].each do |path|
    > open(path) { |f|
    > (f.read.utf8?) ? uts<<path : others<<path
    > }
    > end
    >
    > and also tried the chardet Library (no ruby documentation included)
    > like that
    >
    > require 'UniversalDetector'
    >
    > utf=Array.new
    > others=Array.new
    > Dir["Y:/test/**/*.xml"].each do |path|
    > open(path) { |f|
    > UniversalDetector.chardet(f.read) =~ /utf-8/ ?
    > uts<<path : others<<path
    > }
    > end
    > puts utf.join(",")
    > puts others.join(",")
    >
    >
    > Are there better / simpler ways ?
    >
    > Regards, Gilbert
    >
    >
    >
    >
    Richard Conroy, Aug 29, 2007
    #2
    1. Advertising

  3. Rebhan, Gilbert

    Xavier Noria Guest

    Xavier Noria, Aug 29, 2007
    #3
  4. Xavier Noria wrote:
    > On Aug 29, 2007, at 2:14 PM, Rebhan, Gilbert wrote:
    >
    >> I want to check the file encoding of files in a directory.

    >
    > Have you tried charguess?
    >
    > http://raa.ruby-lang.org/project/charguess


    No, how to install it ?

    only =

    charguess.c
    extconf.rb
    MANIFEST
    sample.rb

    in the tarfile.

    Regards, Gilbert
    Gilbert Rebhan, Aug 29, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. soccer
    Replies:
    1
    Views:
    18,786
    DRocket
    Feb 9, 2011
  2. moonhkt
    Replies:
    18
    Views:
    2,512
    Roedy Green
    Feb 5, 2010
  3. Kioko --
    Replies:
    3
    Views:
    290
    Walton Hoops
    Mar 24, 2010
  4. Replies:
    2
    Views:
    362
  5. Replies:
    2
    Views:
    378
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page