Handling text lines from files with some (few) starnge chars

Paulo da Silva · Jun 6, 2010

I need to read text files and process each line using string
comparisions and regexp.

I have a python2 program that uses <file object>.readline to read each
line as a string. Then, processing it was a trivial job.

With python3 I got error messagew like:
File "./pp1.py", line 93, in RL
line=inf.readline()
File "/usr/lib64/python3.1/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
4963-4965: invalid data

How do I handle this?

If I use <file object>.read from an open as binary file I got a <bytes>
object. Then how do I handle it? Reg exps, comparisions with strings, ?...

Thanks for any help.

Chris Rebert · Jun 6, 2010

I need to read text files and process each line using string
comparisions and regexp.

I have a python2 program that uses <file object>.readline to read each
line as a string. Then, processing it was a trivial job.

With python3 I got error messagew like:
File "./pp1.py", line 93, in RL
Â Â line=inf.readline()
Â File "/usr/lib64/python3.1/codecs.py", line 300, in decode
Â Â (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position
4963-4965: invalid data

How do I handle this?

Specify the encoding of the text when opening the file using the
`encoding` parameter. For Windows-1252 for example:

your_file = open("path/to/file.ext", 'r', encoding='cp1252')

Cheers,
Chris

python · Jun 6, 2010

Chris,

Specify the encoding of the text when opening the file using the `encoding` parameter. For Windows-1252 for example:

your_file = open("path/to/file.ext", 'r', encoding='cp1252')

This looks similar to the codecs module's functionality. Do you know if
the codecs module is still required in Python 3.x?

Thank you,
Malcolm

Paulo da Silva · Jun 6, 2010

Em 06-06-2010 00:41, Chris Rebert escreveu:

On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva

Specify the encoding of the text when opening the file using the
`encoding` parameter. For Windows-1252 for example:

your_file = open("path/to/file.ext", 'r', encoding='cp1252')

OK! This fixes my current problem. I used encoding="iso-8859-15". This
is how my text files are encoded.
But what about a more general case where the encoding of the text file
is unknown? Is there anything like "autodetect"?

MRAB · Jun 6, 2010

Paulo said:
Em 06-06-2010 00:41, Chris Rebert escreveu:

OK! This fixes my current problem. I used encoding="iso-8859-15". This
is how my text files are encoded.
But what about a more general case where the encoding of the text file
is unknown? Is there anything like "autodetect"?
>

An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
How could you tell which was the correct encoding?

Well, if the file contained words in a certain language and some of the
characters were wrong, then you'd know that the encoding was wrong. This
does imply, though, that you'd need to know what the language should
look like!

You could try different encodings, and for each one try to identify what
could be words, then look them up in dictionaries for various languages
to see whether they are real words...

John Machin · Jun 6, 2010

>
An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
How could you tell which was the correct encoding?

Well, if the file contained words in a certain language and some of the
characters were wrong, then you'd know that the encoding was wrong. This
does imply, though, that you'd need to know what the language should
look like!

You could try different encodings, and for each one try to identify what
could be words, then look them up in dictionaries for various languages
to see whether they are real words...

This has been automated (semi-successfully, with caveats) by the
chardet package ... see http://chardet.feedparser.org/

Paulo da Silva · Jun 6, 2010

Em 06-06-2010 04:05, John Machin escreveu:

....

....

This has been automated (semi-successfully, with caveats) by the
chardet package ... see http://chardet.feedparser.org/

This seems nice!
Thanks

regular expression, unicode	1	Apr 29, 2009
regular expression, unicode	0	Apr 29, 2009
io module and pdf question	2	Jun 25, 2013
Encoding trouble when script called from application	0	Jan 14, 2014
Python3: Sane way to deal with broken encodings	4	Dec 6, 2009
3.2 can't extract tarfile produced by 2.7	0	Dec 26, 2012
Python 3.0 automatic decoding of UTF16	25	Dec 5, 2008
Using codecs.EncodedFile() with Python 2.5	1	Jan 3, 2007

Handling text lines from files with some (few) starnge chars

Paulo da Silva

Chris Rebert

python

Paulo da Silva

MRAB

John Machin

Paulo da Silva

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads