Finding non ascii characters in a set of files

bg_ie · Feb 23, 2007

Hi,

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?

Thanks,

Barry.

Peter Bengtsson · Feb 23, 2007

Hi,

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?

How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"

John Machin · Feb 23, 2007

How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"

Larry Bates · Feb 23, 2007

Peter said:
How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"

The next problem will be that non-text files will contain non-ASCII
characters (bytes). The other 'issue' is that OP didn't say how large
the files were, so .read() might be a problem.

-Larry

John Machin · Feb 23, 2007

Sorry, I fell face down on the Send button

To check all .py files in the current directory, modify Peter's code
like this:

import glob
for filename in glob.glob('*.py'):
content = open(filename).read()

maybe that UnicodeDecodeError should be ...Encode...
and change the print statement to cater for filename being variable.

If you have hundreds of .py files in the same directory, you'd better
modify the code further to explicitly close each file.

HTH,
John

John Machin · Feb 23, 2007

The next problem will be that non-text files will contain non-ASCII
characters (bytes). The other 'issue' is that OP didn't say how large
the files were, so .read() might be a problem.

-Larry

The way I read it, the OP's problem is to determine in one big hit
which Python source files need a

# coding: whatever

line up the front to stop Python 2.5 complaining ... I hope none of
them are so big as to choke .read()

Cheers,
John

Tim Arnold · Feb 23, 2007

Peter Bengtsson said:
How about something like this:
content = open('file.py').read()
try:
content.encode('ascii')
except UnicodeDecodeError:
print "file.py contains non-ascii characters"

Here's what I do (I need to know the line number).

import os,sys,codecs
def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

lines = open(filename).readlines()
print 'Total lines: %d' % len(lines)
for i in range(0,len(lines)):
try:
l = f.readline()
except:
num = i+1
print 'problem: line %d' % num

f.close()

Marc 'BlackJack' Rintsch · Feb 23, 2007

Here's what I do (I need to know the line number).

import os,sys,codecs
def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

lines = open(filename).readlines()
print 'Total lines: %d' % len(lines)
for i in range(0,len(lines)):
try:
l = f.readline()
except:
num = i+1
print 'problem: line %d' % num

f.close()

I see a `NameError` here. Where does `i` come from? And there's no need
to read the file twice. Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch

Tim Arnold · Feb 23, 2007

.... Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch

Thanks Marc,
That looks much cleaner. I didn't know the 'num' from the enumerate would
persist so the except block could report it.

thanks again,
--Tim

Tim Arnold · Feb 23, 2007

Marc 'BlackJack' Rintsch said:
I see a `NameError` here. Where does `i` come from? And there's no need
to read the file twice. Untested:

import os, sys, codecs

def checkfile(filename):
f = codecs.open(filename,encoding='ascii')

try:
for num, line in enumerate(f):
pass
except UnicodeError:
print 'problem: line %d' % num

f.close()

Ciao,
Marc 'BlackJack' Rintsch

well, I take it back....that code doesn't work, or at least it doesn't for
my test case.
but thanks anyway, I'm sticking to my original code. the 'i' came from for i
in range.
--Tim

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 23, 2007

Tim said:
That looks much cleaner. I didn't know the 'num' from the enumerate would
persist so the except block could report it.

It's indeed guaranteed that the for loop index variables will keep the
value they had when the loop stopped (either through regular
termination, break, or an exception) (unlike list comprehensions, where
the variable also stays, but only as a side effect of the implementation
strategy).

Regards,
Martin

Scott David Daniels · Feb 24, 2007

I'm updating my program to Python 2.5, but I keep running into
encoding problems. I have no ecodings defined at the start of any of
my scripts. What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character. How would I go about
doing this?

def non_ascii(files):
for file_name in files:
f = open(file_name, 'rb')
if '~' < max(f.read(), ' '):
yield file_name
f.close()

if __name__ == '__main__':
import os.path
import glob
import sys
for dirname in sys.path[1:] or ['.']:
for name in non_ascii(glob.glob(os.path.join(dirname, '*.py')) +
glob.glob(os.path.join(dirname, '*.pyw'))):
print name

--Scott David Daniels
(e-mail address removed)

Toby A Inkster · Feb 24, 2007

bg_ie said:
What I'd like to do is scan a directory and list all the
files in it that contain a non ascii character.

Not quite sure what your intention is. If you're planning a one-time scan
of a directory for non-ASCII characters in files, so that you can manually
fix those files up, then this Perl one-liner will do the trick. At the
command line, type:

perl -ne 'print "$ARGV:$.\n" if /[\x80-\xFF]/;' *

This will print out a list of files that contain non-ASCII characters, and
the line numbers which those characters appear on. Note this also
operates on binary files like images, etc, so you may want to be more
specific with the wildcard. e.g.:

perl -ne 'print "$ARGV:$.\n" if /[\x80-\xFF]/;' *.py *.txt *.*htm*

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!

parsing non-ascii characters	2	Nov 10, 2008
Interpreting non-ascii characters.	3	Jul 17, 2007
How do I automate the removal of all non-ascii characters from mycode?	2	Sep 12, 2011
Managing non-ascii filenames in python	1	Jul 20, 2009
Tkinter - non-ASCII characters in text widgets problem	15	Jun 25, 2009
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010
Email headers and non-ASCII characters	4	Nov 23, 2006

Finding non ascii characters in a set of files

bg_ie

Peter Bengtsson

John Machin

Larry Bates

John Machin

John Machin

Tim Arnold

Marc 'BlackJack' Rintsch

Tim Arnold

Tim Arnold

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Scott David Daniels

Toby A Inkster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads