Finding non ascii characters in a set of files

B

bg_ie

Hi,

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?

Thanks,

Barry.
 
P

Peter Bengtsson

Hi,

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?

How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"
 
J

John Machin

How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"
 
L

Larry Bates

Peter said:
How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"
The next problem will be that non-text files will contain non-ASCII
characters (bytes). The other 'issue' is that OP didn't say how large
the files were, so .read() might be a problem.

-Larry
 
J

John Machin

Sorry, I fell face down on the Send button :)

To check all .py files in the current directory, modify Peter's code
like this:

import glob
for filename in glob.glob('*.py'):
content = open(filename).read()

maybe that UnicodeDecodeError should be ...Encode...
and change the print statement to cater for filename being variable.

If you have hundreds of .py files in the same directory, you'd better
modify the code further to explicitly close each file.

HTH,
John
 
J

John Machin

The next problem will be that non-text files will contain non-ASCII
characters (bytes). The other 'issue' is that OP didn't say how large
the files were, so .read() might be a problem.

-Larry

The way I read it, the OP's problem is to determine in one big hit
which Python source files need a

# coding: whatever

line up the front to stop Python 2.5 complaining ... I hope none of
them are so big as to choke .read()

Cheers,
John
 
T

Tim Arnold

Peter Bengtsson said:
How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"
Here's what I do (I need to know the line number).

import os,sys,codecs
def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

lines = open(filename).readlines()
print 'Total lines: %d' % len(lines)
for i in range(0,len(lines)):
try:
l = f.readline()
except:
num = i+1
print 'problem: line %d' % num

f.close()
 
M

Marc 'BlackJack' Rintsch

Here's what I do (I need to know the line number).

import os,sys,codecs
def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

lines = open(filename).readlines()
print 'Total lines: %d' % len(lines)
for i in range(0,len(lines)):
try:
l = f.readline()
except:
num = i+1
print 'problem: line %d' % num

f.close()

I see a `NameError` here. Where does `i` come from? And there's no need
to read the file twice. Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch
 
T

Tim Arnold

.... Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch

Thanks Marc,
That looks much cleaner. I didn't know the 'num' from the enumerate would
persist so the except block could report it.

thanks again,
--Tim
 
T

Tim Arnold

Marc 'BlackJack' Rintsch said:
I see a `NameError` here. Where does `i` come from? And there's no need
to read the file twice. Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch

well, I take it back....that code doesn't work, or at least it doesn't for
my test case.
but thanks anyway, I'm sticking to my original code. the 'i' came from for i
in range.
--Tim
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Tim said:
That looks much cleaner. I didn't know the 'num' from the enumerate would
persist so the except block could report it.

It's indeed guaranteed that the for loop index variables will keep the
value they had when the loop stopped (either through regular
termination, break, or an exception) (unlike list comprehensions, where
the variable also stays, but only as a side effect of the implementation
strategy).

Regards,
Martin
 
S

Scott David Daniels

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?


def non_ascii(files):
for file_name in files:
f = open(file_name, 'rb')
if '~' < max(f.read(), ' '):
yield file_name
f.close()

if __name__ == '__main__':
import os.path
import glob
import sys
for dirname in sys.path[1:] or ['.']:
for name in non_ascii(glob.glob(os.path.join(dirname, '*.py')) +
glob.glob(os.path.join(dirname, '*.pyw'))):
print name


--Scott David Daniels
(e-mail address removed)
 
T

Toby A Inkster

bg_ie said:
What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character.

Not quite sure what your intention is. If you're planning a one-time scan
of a directory for non-ASCII characters in files, so that you can manually
fix those files up, then this Perl one-liner will do the trick. At the
command line, type:

perl -ne 'print "$ARGV:$.\n" if /[\x80-\xFF]/;' *

This will print out a list of files that contain non-ASCII characters, and
the line numbers which those characters appear on. Note this also
operates on binary files like images, etc, so you may want to be more
specific with the wildcard. e.g.:

perl -ne 'print "$ARGV:$.\n" if /[\x80-\xFF]/;' *.py *.txt *.*htm*

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top