read from file with mixed encodings in Python3

Jaroslav Dobrek · Nov 7, 2011

Hello,

in Python3, I often have this problem: I want to do something with
every line of a file. Like Python3, I presuppose that every line is
encoded in utf-8. If this isn't the case, I would like Python3 to do
something specific (like skipping the line, writing the line to
standard error, ...)

Like so:

try:
....
except UnicodeDecodeError:
...

Yet, there is no place for this construction. If I simply do:

for line in f:
print(line)

this will result in a UnicodeDecodeError if some line is not utf-8,
but I can't tell Python3 to stop:

This will not work:

for line in f:
try:
print(line)
except UnicodeDecodeError:
...

because the UnicodeDecodeError is caused in the "for line in f"-part.

How can I catch such exceptions?

Note that recoding the file before opening it is not an option,
because often files contain many different strings in many different
encodings.

Jaroslav

Dave Angel · Nov 7, 2011

Hello,

in Python3, I often have this problem: I want to do something with
every line of a file. Like Python3, I presuppose that every line is
encoded in utf-8. If this isn't the case, I would like Python3 to do
something specific (like skipping the line, writing the line to
standard error, ...)

Like so:

try:
....
except UnicodeDecodeError:
...

Yet, there is no place for this construction. If I simply do:

for line in f:
print(line)

this will result in a UnicodeDecodeError if some line is not utf-8,
but I can't tell Python3 to stop:

This will not work:

for line in f:
try:
print(line)
except UnicodeDecodeError:
...

because the UnicodeDecodeError is caused in the "for line in f"-part.

How can I catch such exceptions?

Note that recoding the file before opening it is not an option,
because often files contain many different strings in many different
encodings.

Jaroslav

A file with mixed encodings isn't a text file. So open it with 'rb'
mode, and use read() on it. Find your own line-endings, since a given
'\n' byte may or may not be a line-ending.

Once you've got something that looks like a line, explicitly decode it
using utf-8. Some invalid lines will give an exception and some will
not. But perhaps you've got some other gimmick to tell the encoding for
each line.

Peter Otten · Nov 7, 2011

Jaroslav said:
Hello,

in Python3, I often have this problem: I want to do something with
every line of a file. Like Python3, I presuppose that every line is
encoded in utf-8. If this isn't the case, I would like Python3 to do
something specific (like skipping the line, writing the line to
standard error, ...)

Like so:

try:
....
except UnicodeDecodeError:
...

Yet, there is no place for this construction. If I simply do:

for line in f:
print(line)

this will result in a UnicodeDecodeError if some line is not utf-8,
but I can't tell Python3 to stop:

This will not work:

for line in f:
try:
print(line)
except UnicodeDecodeError:
...

because the UnicodeDecodeError is caused in the "for line in f"-part.

How can I catch such exceptions?

Note that recoding the file before opening it is not an option,
because often files contain many different strings in many different
encodings.

I don't see those files often, but I think they are all seriously broken.
There's no way to recover the information from files with unknown mixed
encodings. However, here's an approach that may sometimes work:
.... for line in f:
.... try:
.... line = "UTF-8 " + line.decode("utf-8")
.... except UnicodeDecodeError:
.... line = "Latin-1 " + line.decode("latin-1")
.... print(line, end="")
....
UTF-8 Ã¤Ã¶Ã¼
Latin-1 Ã¤Ã¶Ã¼
UTF-8 Ã¤Ã¶Ã¼

Opening and appending to file in Python3	6	Feb 10, 2024
ValueError: I/O operation on closed file. with python3	0	Jun 12, 2013
Cant encrypt a server disk with fernet PYTHON3	0	Jun 6, 2022
numpy.genfromtxt with Python3 - howto	0	Apr 6, 2012
Python3: Sane way to deal with broken encodings	4	Dec 6, 2009
Finding Relative Maxima in Python3	1	May 31, 2013
catch UnicodeDecodeError	17	Jul 25, 2012
Why Python3	12	Jun 28, 2010

read from file with mixed encodings in Python3

Jaroslav Dobrek

Dave Angel

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads