Character encoding & the copyright symbol

R

Robert Dailey

Hello,

I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII and the copyright symbol shows up just
fine in Notepad++. However, Python will not print this symbol. How can
I get this to work? And no, I won't replace it with "(c)". Thanks!
 
P

Philip Semanchuk

Hello,

I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII and the copyright symbol shows up just
fine in Notepad++. However, Python will not print this symbol. How can
I get this to work? And no, I won't replace it with "(c)". Thanks!

If the file is defined as ASCII and it contains 0xa9, then the file
was written incorrectly or you were told the wrong encoding. There is
no such character in ASCII which runs from 0x00 - 0x7f.

The copyright symbol == 0xa9 if the encoding is ISO-8859-1 or
windows-1252, and since you're on Windows the latter is a likely bet.

http://en.wikipedia.org/wiki/Ascii
http://en.wikipedia.org/wiki/Iso-8859-1
http://en.wikipedia.org/wiki/Windows-1252


Bottom line is that your file is not in ASCII. Try specifying
windows-1252 as the encoding. Without seeing your code I can't tell
you where you need to specify the encoding, but the Python docs should
help you out.


HTH
Philip
 
R

Richard Brodie

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII.

That's the problem: ASCII is a seven bit code. What you have is
actually ISO-8859-1 (or possibly Windows-1252).

The different ISO-8859-n variants assign various characters to
to '\xa9'. Rather than being Western-European centric and assuming
ISO-8859-1 by default, Python throws an error when you stray
outside of strict ASCII.
 
R

Robert Dailey

That's the problem: ASCII is a seven bit code. What you have is
actually ISO-8859-1 (or possibly Windows-1252).

The different ISO-8859-n variants assign various characters to
to '\xa9'. Rather than being Western-European centric and assuming
ISO-8859-1 by default, Python throws an error when you stray
outside of strict ASCII.

Thanks for the help guys. Sorry I left out code, I wasn't sure at the
time if it would be helpful. Below is my code:


#========================================================
def GetFileContentsAsString( file ):
f = open( file, mode='r', encoding='cp1252' )
contents = f.read()
f.close()
return contents

#========================================================
def ReplaceVersion( file, version, regExps ):
#match = regExps[0].search( 'FILEVERSION 1,45332,2100,32,' )
#print( match.group() )
text = GetFileContentsAsString( file )
print( text )


As you can see, I am trying to load the file with encoding 'cp1252'
which, according to the python 3.1 docs, translates to windows-1252. I
also tried 'latin_1', which translates to ISO-8859-1, but this did not
work either. Am I doing something else wrong?
 
A

Albert Hopkins

Hello,

I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII and the copyright symbol shows up just
fine in Notepad++. However, Python will not print this symbol. How can
I get this to work? And no, I won't replace it with "(c)". Thanks!

It's not actually ASCII but Windows-1252 extended ASCII-like. So with
that information you can do either of 2 things: You can open it in text
mode and specify the encoding:

or you can open it in binary mode and decode it later:

Or you may be able to set the default encoding to windows-1252 but I
don't know how to do that (in Windows).

p.s.

Next time it might be helpful to paste a code snippet else we have to
make assumptions about what you are actually doing.
 
R

Richard Brodie

As you can see, I am trying to load the file with encoding 'cp1252'
which, according to the python 3.1 docs, translates to windows-1252. I
also tried 'latin_1', which translates to ISO-8859-1, but this did not
work either. Am I doing something else wrong?

Probably it's just the debugging print that has a problem, and if you
opened an output file with an encoding specified it would be fine.
When you get a UnicodeEncodingError, it's conversion _from_
Unicode that has failed.
 
P

Philip Semanchuk

...



That's the problem: ASCII is a seven bit code. What you have is
actually ISO-8859-1 (or possibly Windows-1252).

The different ISO-8859-n variants assign various characters to
to '\xa9'. Rather than being Western-European centric and assuming
ISO-8859-1 by default, Python throws an error when you stray
outside of strict ASCII.

Thanks for the help guys. Sorry I left out code, I wasn't sure at the
time if it would be helpful. Below is my code:


#========================================================
def GetFileContentsAsString( file ):
f = open( file, mode='r', encoding='cp1252' )
contents = f.read()
f.close()
return contents

#========================================================
def ReplaceVersion( file, version, regExps ):
#match = regExps[0].search( 'FILEVERSION 1,45332,2100,32,' )
#print( match.group() )
text = GetFileContentsAsString( file )
print( text )


As you can see, I am trying to load the file with encoding 'cp1252'
which, according to the python 3.1 docs, translates to windows-1252. I
also tried 'latin_1', which translates to ISO-8859-1, but this did not
work either. Am I doing something else wrong?


Are you getting the error when you read the file or when you
print(text)?

As a side note, you should probably use something other than "file"
for the parameter name in GetFileContentsAsString() since file() is a
Python function.
 
N

Nobody

I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII and the copyright symbol shows up just
fine in Notepad++. However, Python will not print this symbol. How can
I get this to work? And no, I won't replace it with "(c)". Thanks!

1. As others have said, your file *isn't* ASCII, but that isn't the
problem.

2. The problem is that the encoding which your standard output
stream uses doesn't have the copyright symbol. You need to use something
like:

sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = 'iso-8859-1')
sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = 'iso-8859-1')

to fix the encoding of the stdout and stderr streams.
 
M

Martin v. Löwis

As a side note, you should probably use something other than "file" for
the parameter name in GetFileContentsAsString() since file() is a Python
function.

Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
py> file
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'file' is not defined

Regards,
Martin
 
P

Philip Semanchuk

As a side note, you should probably use something other than "file"
for
the parameter name in GetFileContentsAsString() since file() is a
Python
function.

Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
py> file
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'file' is not defined


Whooops, didn't know about that change from 2.x to 3.x. Thanks.
 
B

Benjamin Kaplan

That's the problem: ASCII is a seven bit code. What you have is
actually ISO-8859-1 (or possibly Windows-1252).

The different ISO-8859-n variants assign various characters to
to '\xa9'. Rather than being Western-European centric and assuming
ISO-8859-1 by default, Python throws an error when you stray
outside of strict ASCII.

Thanks for the help guys. Sorry I left out code, I wasn't sure at the
time if it would be helpful. Below is my code:


#========================================================
def GetFileContentsAsString( file ):
  f = open( file, mode='r', encoding='cp1252' )
  contents = f.read()
  f.close()
  return contents

#========================================================
def ReplaceVersion( file, version, regExps ):
  #match = regExps[0].search( 'FILEVERSION 1,45332,2100,32,' )
  #print( match.group() )
  text = GetFileContentsAsString( file )
  print( text )


As you can see, I am trying to load the file with encoding 'cp1252'
which, according to the python 3.1 docs, translates to windows-1252. I
also tried 'latin_1', which translates to ISO-8859-1, but this did not
work either. Am I doing something else wrong?

This is why we need code and full tracebacks. There's a good chance
that your error is on the print(text) line. That's because sys.stdout
is probably a byte stream without an encoding defined. When you try to
print your unicode string, Python has to convert it to a stream of
bytes. Python refuses to guess on the console encoding and just falls
back to ascii, the conversion fails, and you get your error. Try using
print( text.encode( 'cp1252' ) ) instead.
 
D

Dave Angel

Robert said:
Hello,

I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():

UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to <undefined>

The file is defined as ASCII and the copyright symbol shows up just
fine in Notepad++. However, Python will not print this symbol. How can
I get this to work? And no, I won't replace it with "(c)". Thanks!
I see others have alerted you to changes needed in stdout, which is
ASCII coded by default.

But I wanted to comment on the (c) remark. If you're in the US, that's
the wrong abbreviation for copyright. The only recognized abbreviation
is (copr).

DaveA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top