the unicode saga continues...

E

Ethan Furman

So I've added unicode support to my dbf package, but I also have some
rather large programs that aren't ready to make the switch over yet. So
as a workaround I added a (rather lame) option to convert the
unicode-ified data that was decoded from the dbf table back into an
encoded format.

Here's the fun part: in figuring out what the option should be for use
with my system, I tried some tests...

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.('en_US', 'cp1252')

My confusion lies in my apparant codepage (cp1252), and the discrepancy
with character u'\xed' which is absolutely an i with an accent; yet when
I encode with cp1252 and print it, I get an o with a line.

Can anybody clue me in to what's going on here?

~Ethan~
 
U

Ulrich Eckhardt

Ethan said:
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.('en_US', 'cp1252')

My confusion lies in my apparant codepage (cp1252), and the discrepancy
with character u'\xed' which is absolutely an i with an accent; yet when
I encode with cp1252 and print it, I get an o with a line.
^^^^^^^^^^^^^^^^^^^^^^
For the record: I read a small Greek letter phi in your posting, not an o
with a line. If I encode according to my default locale (UTF-8), I get the
letter i with the accent. If I encode with codepage 1252, I get a marker for
an invalid character on my terminal. This is using Debian though, not MS
Windows.

Try printing the repr() of that. The point is that internally, you have the
codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages,
which yields a string of bytes representing this thing ('\xa1', '\xa1' and
'\xed'), useful for storing on disk when the file uses said codepage or
other forms of IO.

Now, with a Unicode string, the output (print) knows what to do, it encodes
it according to the defaultlocale and sends the resulting bytes to stdout.
With a byte string, I think it directly forwards the content to stdout.

Note:
* If you want to verify your code, rather use 'print repr(..)'.
* I could imagine that your locale is simply not set up correctly.

Uli
 
M

Martin v. Löwis

Can anybody clue me in to what's going on here?

It's as Mark says: the console encoding is cp437 on your system,
cp1252.

Windows has *two* default code pages at any point in time: the
OEM code page, and the ANSI code page. Either one depends on the
Windows release (Western, Japanese, etc.), and can be set by the
administrator. The OEM code page is primarily used for the console
(and then also as the encoding on the FAT filesystem); the ANSI
code page is used in all other places (that don't use Unicode APIs).

In addition, the console code page may deviate from the OEM code
page, if you run chcp.exe.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top