the unicode saga continues...

Ethan Furman · Nov 14, 2009

So I've added unicode support to my dbf package, but I also have some
rather large programs that aren't ready to make the switch over yet. So
as a workaround I added a (rather lame) option to convert the
unicode-ified data that was decoded from the dbf table back into an
encoded format.

Here's the fun part: in figuring out what the option should be for use
with my system, I tried some tests...

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.('en_US', 'cp1252')

My confusion lies in my apparant codepage (cp1252), and the discrepancy
with character u'\xed' which is absolutely an i with an accent; yet when
I encode with cp1252 and print it, I get an o with a line.

Can anybody clue me in to what's going on here?

~Ethan~

Ulrich Eckhardt · Nov 14, 2009

Ethan said:
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.('en_US', 'cp1252')

My confusion lies in my apparant codepage (cp1252), and the discrepancy
with character u'\xed' which is absolutely an i with an accent; yet when
I encode with cp1252 and print it, I get an o with a line.

^^^^^^^^^^^^^^^^^^^^^^
For the record: I read a small Greek letter phi in your posting, not an o
with a line. If I encode according to my default locale (UTF-8), I get the
letter i with the accent. If I encode with codepage 1252, I get a marker for
an invalid character on my terminal. This is using Debian though, not MS
Windows.

Try printing the repr() of that. The point is that internally, you have the
codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages,
which yields a string of bytes representing this thing ('\xa1', '\xa1' and
'\xed'), useful for storing on disk when the file uses said codepage or
other forms of IO.

Now, with a Unicode string, the output (print) knows what to do, it encodes
it according to the defaultlocale and sends the resulting bytes to stdout.
With a byte string, I think it directly forwards the content to stdout.

Note:
* If you want to verify your code, rather use 'print repr(..)'.
* I could imagine that your locale is simply not set up correctly.

Uli

Martin v. LÃ¶wis · Nov 14, 2009

Can anybody clue me in to what's going on here?

It's as Mark says: the console encoding is cp437 on your system,
cp1252.

Windows has *two* default code pages at any point in time: the
OEM code page, and the ANSI code page. Either one depends on the
Windows release (Western, Japanese, etc.), and can be set by the
administrator. The OEM code page is primarily used for the console
(and then also as the encoding on the FAT filesystem); the ANSI
code page is used in all other places (that don't use Unicode APIs).

In addition, the console code page may deviate from the OEM code
page, if you run chcp.exe.

Regards,
Martin

regex question on .findall and \b	1	Jul 2, 2009
Representation of floats (-> Mark Dickinson?)	4	Sep 6, 2011
easy install	0	Oct 9, 2009
listdir reports [Error 1006] The volume for a file has been externally altered so that the opened fi	2	Jan 7, 2009
Reuse of DB-API 2.0 cursors for multiple queries?	1	Jan 28, 2009
Revised PEP 349: Allow str() to return unicode strings	2	Aug 22, 2005
Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
Decreasing the "standard deviation" of Java	3	May 25, 2006

the unicode saga continues...

Ethan Furman

Ulrich Eckhardt

Martin v. LÃ¶wis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads