A question on Encoding and Decoding.

Discussion in 'Python' started by kath, Nov 13, 2006.

  1. kath

    kath Guest

    Hi all,

    Platform: winxp
    Version: Python 2.3

    I have a task of reading files in a folder and creating an one excel
    file with sheets, one sheet per file, with sheet named as filename. I
    am facing problem in handling special characters. I am using XLRD and
    XLW package to read/write from/to file. But facing problem in handling
    special characters. I am getting encode error.

    UnicodeDecodeError 'ascii' codec can't encode character u'\xdf' in
    position 19: ordinal not in range(128)
    row: 76

    the cell value at rowx = 76, colx = 0 is
    'Activest-Aktien-Großbritannien'

    I used Latin-1 encoding, but after the file is created I get an error
    'Unable to read the file'.

    When I get the exception I want to format the string so that I can use
    it to write to a file and also query database.

    Can anybody guide how to solve this problem. what encoding I should
    use, and after wring to file I should the same special character.

    Also python IDLE is able to output the same character corretly when I
    say print and why not I?


    Any suggestions would be greatly appreciated.

    thanks.
    regards,
    kath.
     
    kath, Nov 13, 2006
    #1
    1. Advertising

  2. kath

    John Machin Guest

    kath wrote:
    > Hi all,
    >
    > Platform: winxp
    > Version: Python 2.3
    >
    > I have a task of reading files in a folder and creating an one excel
    > file with sheets, one sheet per file, with sheet named as filename. I
    > am facing problem in handling special characters. I am using XLRD and
    > XLW package to read/write from/to file.


    The name of the first package is "xlrd", *NOT* "XLRD".

    By "XLW"; do you mean python2xlw? If so, why don't you upgrade to
    Python 2.4 or 2.5 so that you can use pyExcelerator (which is the most
    modern and least unmaintained of the pure-Python Excel-writing
    gadgets)? If you can't upgrade, consider using pyXLWriter instead.

    Note that python2xlw is *NOT* intended to be a general-purpose Excel
    file writer. From the top of its XLW.py:
    '''Simple XLW file maker.
    Can create data sheets and Scatter Charts from the data sheets.
    Data sheet is defined as a list of rows.
    The first row contains labels, subsequent rows are numeric.
    Charts are defined by adding XY series.
    Series are defined as zero-based indeces to the dataSheet, XColumn, and
    YColumn.
    The entire column will be plotted.'''

    Are you sure that the description "The first row contains labels,
    subsequent rows are numeric." fits your data?

    At this stage, sentient beings who are looking for a general-purpose
    Excel writer and who are not skeptics would probably turn away. Kinda
    makes you believe it won't handle *any* kind of string (either str or
    Unicode) at rowx == 76. Game over. "77 Sunset Row" :)

    It turns out that what happens (quite independently of row number) is:
    Ints and reals are spat out as Excel NUMBER records.
    For *any* other Python type, it attempts (without benefit of try/accept
    or any other prophylactic) to write str(value)[:254] to an Excel LABEL
    record.
    Apart from silent truncation of data, this will cause:
    * unicode data that contains non-ASCII characters to produce a
    UnicodeDecodeError
    * a long item e.g. 1234567890L to appear as an Excel text type with
    value"1234567890", not an Excel number type wirth value 1234567890.0

    Oh, and if you forgot to call the XLW.XLW.addChart method, it kindly
    remedies your deficiency by supplying one. What if you consciously
    don't want any ferschlugginer charts?

    Unfortunately, outside the Redmond Pale, knowledge of charts in .XLW
    files is scant. Given such a file:
    * Gnumeric 1.6.something (Windows version) crashes
    * OpenOffice.org's Calc 2.0.something (Windows version) thinks silently
    for a second or two, then leaves you with an empty worksheet, and no
    charts, and no message.
    * xlrd grumbles that it was expecting a worksheet, and gives up -- I'll
    fix this; it'll expect a chart as a possibility, and ignore it (as with
    XLS files).

    > But facing problem in handling
    > special characters. I am getting encode error.


    Have you read any of these:
    (a) The notes on Unicode in the xlrd documentation?
    (b) The Unicode howto (http://www.amk.ca/python/howto/unicode)
    ?

    >
    > UnicodeDecodeError 'ascii' codec can't encode character u'\xdf' in
    > position 19: ordinal not in range(128)
    > row: 76
    >
    > the cell value at rowx = 76, colx = 0 is
    > 'Activest-Aktien-Großbritannien'


    :)
    Gross Brits? Must be talking about the "Barmy Army" :)
    :)

    >
    > I used Latin-1 encoding,


    If, as it seems, you think you've found out what the problem is, fixed
    it, and continued on, why bother telling us?

    > but after the file is created I get an error
    > 'Unable to read the file'.


    *WHAT* program is producing this error message? Under what
    circumstances?

    >
    > When I get the exception I want to format the string so that I can use
    > it to write to a file and also query database.


    Never mind what you want; what you *need* is to encode your data with
    the appropriate encoding, so that you *don't* get the exception. Then
    it should be a good old legacy str instance, suitable for writing
    anywhere within reason. You can query your database with str and/or
    unicode data, depending on how it is stored and whether the database
    interface converts on the fly -- read the database-specific docs and/or
    ask a specific question.

    >
    > Can anybody guide how to solve this problem. what encoding I should
    > use, and after wring to file I should the same special character.


    My parser barfed on that last clause:
    ParseFailure: After "wring", expected "neck"
    :)

    The encoding that you should use is one that encompasses all your data.
    If latin1 doesn't hack it, switch to pyExcelerator (as already
    recommended); it will write the latest (only 9 years old) version of
    Excel files which are recorded in utf_16_le.

    What makes you think that the encoding problem (which you appear to
    have fixed (at least temporarily)) is anything at all to do with the
    mystery program saying 'Unable to read the file"? Have you tried a test
    input file which has *only* ASCII text in it?

    >
    > Also python IDLE is able to output the same character corretly when I
    > say print and why not I?


    I have to go out now, so I'll leave something for someone else to
    answer.

    >
    >
    > Any suggestions would be greatly appreciated.
    >


    If it all becomes too hard, try sending me (1) a two-sheet input test
    file for your app (2) the corresponding error-message-causing output
    file and (3) your code, and I'll have a look at it.

    Cheers,
    John
     
    John Machin, Nov 14, 2006
    #2
    1. Advertising

  3. "kath" wrote:

    > Also python IDLE is able to output the same character corretly when I
    > say print and why not I?


    probably because IDLE's interactive window is Unicode-aware, but your terminal
    is not.

    </F>
     
    Fredrik Lundh, Nov 14, 2006
    #3
  4. kath

    Guest

    Hello,
    I think this remark is more to the point. In my experience, the general
    problem is that python operates with the default encoding "ascii" as in
    sys.getdefaultencoding(). It is possible to set the defaultencoding in
    sitecustomize.py, with sys.setdefaultencoding('latin1'). I have placed
    sitecustomize.py in Lib/site-packages. It is not possible to set the
    encoding once python has started. Setting the encoding only works if
    you can bind yourself to this one encoding and is therefore no general
    fix.
    The only reasonable way to work is to get your strings into unicode
    (and sometimes back out again).
    If for instance you type:
    s = "äÄöÖüÜß" and then try
    us = unicode(s) you will get a traceback identical to yours.
    However:
    us = unicode(s,'latin1')
    will work. If however to try:
    print us
    you will get another traceback.

    try:
    >>> print "%r" % us

    u'\x84\x8e\x94\x99\x81\x9a\xe1'
    Even if it is not pretty, at least the program won't crash.
    You can get a better result with:
    >>> print us.encode('latin1')

    äÄöÖüÜß

    The whole topic can get tedious at times, but be assured,
    PyExcelerator, for writing, and xlrd for reading Excel files do work,
    with unicode, both in the content and in the sheetnames. PyExcelerator
    may be able to read Excel Sheets but not nearly as well as xlrd, which
    works wonderfully.

    regards,
    Richard Sharp


    Fredrik Lundh wrote:
    > "kath" wrote:
    >
    > > Also python IDLE is able to output the same character corretly when I
    > > say print and why not I?

    >
    > probably because IDLE's interactive window is Unicode-aware, but your terminal
    > is not.
    >
    > </F>
     
    , Nov 15, 2006
    #4
  5. wrote:
    > Hello,
    > I think this remark is more to the point. In my experience, the general
    > problem is that python operates with the default encoding "ascii" as in
    > sys.getdefaultencoding(). It is possible to set the defaultencoding in
    > sitecustomize.py, with sys.setdefaultencoding('latin1'). I have placed
    > sitecustomize.py in Lib/site-packages. It is not possible to set the
    > encoding once python has started. Setting the encoding only works if
    > you can bind yourself to this one encoding and is therefore no general
    > fix.
    > The only reasonable way to work is to get your strings into unicode
    > (and sometimes back out again).
    > If for instance you type:
    > s = "äÄöÖüÜß" and then try
    > us = unicode(s) you will get a traceback identical to yours.


    I missed the beginning of this thread, but why not write

    s = u"äÄöÖüÜß"

    Is there ever a reason _not_ to exclusively use the unicode stringtype
    throughout your Python program? (of course you may need to encode/decode
    when interfacing with the world outside your program)

    /johan
     
    Johan von Boisman, Nov 17, 2006
    #5
  6. kath

    John Machin Guest

    On Nov 17, 9:18 pm, Johan von Boisman <> wrote:
    > I missed the beginning of this thread, but why not write
    >
    > s = u"äÄöÖüÜß"
    >
    > Is there ever a reason _not_ to exclusively use the unicode stringtype
    > throughout your Python program? (of course you may need to encode/decode
    > when interfacing with the world outside your program)
    >


    Please consider going back and finding the beginning of the thread.

    Cheers,
    John
     
    John Machin, Nov 17, 2006
    #6
  7. Johan von Boisman wrote:

    > Is there ever a reason _not_ to exclusively use the unicode stringtype
    > throughout your Python program?


    speed and memory use. in 2.5, the unicode datatype is almost often as
    fast as the string type at the algorithm level, but it's still limited
    by memory bandwidth for certain operations. it simply takes a bit more
    time to copy two or four times as much data.

    but requiring four times as much memory use can be really hurting, in
    some applications, e.g.

    http://effbot.org/zone/celementtree.htm#benchmarks

    a good compromise is to use 8-bit strings for ASCII-only strings, and
    Unicode strings for everything else. ASCII strings mix well with
    Unicode, and you can use (almost) all string methods and other string
    operations on both kind of strings, without problems.

    </F>
     
    Fredrik Lundh, Nov 17, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Antill
    Replies:
    2
    Views:
    397
    Randy Howard
    Jul 21, 2003
  2. Hallvard B Furuseth
    Replies:
    6
    Views:
    441
    Peter Nilsson
    Jul 22, 2003
  3. Tak-Shing Chan
    Replies:
    1
    Views:
    343
    =?ISO-8859-1?Q?Bj=F8rn_Augestad?=
    Jul 22, 2003
  4. Tzury Bar Yochay
    Replies:
    3
    Views:
    1,008
    Rob Williscroft
    Mar 24, 2008
  5. Napalmski
    Replies:
    2
    Views:
    414
    Napalmski
    May 5, 2009
Loading...

Share This Page