getting data with proper encoding to the finish

K

Ksenia Marasanova

Hi,

I have a little problem with encoding. Was hoping maybe anyone can
help me to solve it.

There is some amount of data in a database (PG) that must be inserted
into Excel sheet and emailed. Nothing special, everything works.
Except that non-ascii characters are not displayed properly.
The data is stored as XML into a text field. When I use pgsql it's
displayed good in the terminal. Now I run my script and print data
with "print" statement, still goed. Then I use pyXLWriter to write the
sheet, and Python email package to email it... and the resulting sheet
is not good:

Г is displayed instead of ü (for example)

The most important question that I'd like to ask, is what is the
proper way to debug it?! How can I determine where it goes wrong
(maybe by calculating character code or something?) I know very little
about how encoding works, I use it, but I don't really understand
it... any dummy-proof pointers are also appreciated :)

Thanks...
 
J

John Machin

Ksenia said:
Hi,

I have a little problem with encoding. Was hoping maybe anyone can
help me to solve it.

There is some amount of data in a database (PG) that must be inserted
into Excel sheet and emailed. Nothing special, everything works.
Except that non-ascii characters are not displayed properly.
The data is stored as XML into a text field.

This sentence doesn't make much sense. Explain.
When I use pgsql it's
displayed good in the terminal. Now I run my script and print data
with "print" statement, still goed.

Instead of "print data", do "print repr(data)" and show us what you
get. What *you* see on the screen is not much use for diagnosis; it's
the values of the bytes in the file that matter.
Then I use pyXLWriter to write the
sheet,

Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
a Notepad window, save the Notepad file as Unicode type named (say)
"junk.u16" then at the Python interactive prompt do this:

file("junk.u16", "rb").read().decode("utf16")

and show us what you get.
and Python email package to email it... and the resulting sheet
is not good:

E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what software]
Г is displayed instead of ü (for example)

You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).

OK, we'd already guessed your background from your name :)

However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.

This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example.
 
K

Ksenia Marasanova

There is some amount of data in a database (PG) that must be inserted
This sentence doesn't make much sense. Explain.

Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.
Instead of "print data", do "print repr(data)" and show us what you
get. What *you* see on the screen is not much use for diagnosis; it's
the values of the bytes in the file that matter.

Thanks for this valuable tip. I take letter "é" as an example.

"print repr(data)" shows this:
u'\xe9'

Open the spreadsheet with Microsoft Excel, copy-and-paste some data to
a Notepad window, save the Notepad file as Unicode type named (say)
"junk.u16" then at the Python interactive prompt do this:

file("junk.u16", "rb").read().decode("utf16")

and show us what you get.

(I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?)
The result from Python is:

u'\u0439'

In Excel sheet it is shown as: й

(Russian again?!)
and Python email package to email it... and the resulting sheet
is not good:

E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what software]

Emailed with Python, please see the code at the end of the message.
The receiving system is OS X with languages priority: Dutch, English,
German, Russian and Hebrew. Viewer: MS Office 2004.

You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).

OK, we'd already guessed your background from your name :)

:)
However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.

This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example.

Thank you very much for your help and explanation! The "é" letter is
what I could find till now, see examples above..




The following code fragment is used for creating Excel sheet and sending email:
# ##################
# Create Excel sheet
#
f = StringIO()
workbook = xl.Writer(f)
worksheet = workbook.add_worksheet()
bold = workbook.add_format(bold=1)
border = workbook.add_format(border=1)

worksheet.write_row('A1', ['First Name', 'Last Name'], border)
i = 1
for row in result:
print repr(row['firstname'])
datarow = [row.get('firstname'), row.get('surname')]
i += 1
worksheet.write_row('A%s' % i, datarow)
workbook.close()
####################
# Create email message

# Create the container (outer) email message.
msg = MIMEMultipart()
# Attach Excel sheet
xls = MIMEBase('application', 'vnd.ms-excel')
xls.set_payload(f.getvalue())
Encoders.encode_base64(xls)
xls.add_header('Content-Disposition', 'attachment', filename='some
file name %s-%s-%s.xls' % (today.day, today.month, today.year))
msg.attach(xls)

msg['Subject'] = subject
msg['From'] = fromaddr
msg['To'] = toaddr
# Guarantees the message ends in a newline
msg.epilogue = ''

# Send message
s = smtplib.SMTP(smtp_host)
s.sendmail(fromaddr, to_list, msg.as_string())
 
J

John Machin

Ksenia said:
Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.


Thanks for this valuable tip. I take letter "é" as an example.

"print repr(data)" shows this:
u'\xe9'



(I am on a Mac so I used Textedit to create a UTF-16 encoded file, right?)
The result from Python is:

u'\u0439'

So Unicode U+00E9 has become U+0439? Magic! I suspect that there is a
conversion step or two in there that you haven't mentioned. Are you
talking about the spreadsheet after it is created by your script on the
machine that created it [which is what I asked], or are you talking
about the spreadsheet on the recipient's machine, or are you talking
about the spreadsheet after the recipient has e-mailed it back to you,
hopefully untouched?
In Excel sheet it is shown as: й

(Russian again?!)

This is probably indicative that the Latin-1 e-acute (0xE9) is being
converted to Unicode U+0439 by something that thinks it is actually in
an 8-bit Cyrillic encoding (0xE9 is "Cyrillic small letter short I" in
some 8-bit encodings) but the u-umlaut becoming GHE example doesn't fit
this story.

Please do a test where you put several different accented Latin letters
in the one field in your database. No, put ALL the non-ASCII characters
that you expect to be transmitted unchanged into test fields -- this
will make you think about what language(s)/locale(s) your database is
designed for and what language(s)/locale(s) your e-mail targets use.
Having this test data will be useful in the future for verifying that
your system works. Repeat all the above steps. Tell us what you see in
Excel on your machine and on the recipient's machine.
and Python email package to email it... and the resulting sheet
is not good:

E-mailed how? To whom? [I.e. what country / what cultural background /
on what machine / what operating system / viewed using what
software]

Emailed with Python, please see the code at the end of the message.
The receiving system is OS X with languages priority: Dutch, English,
German, Russian and Hebrew. Viewer: MS Office 2004.

Sending system is ...?
You are saying (in effect) U+0413 (Cyrillic upper case letter GHE) is
displayed instead of U+00FC (Latin small letter U with diaeresis).

OK, we'd already guessed your background from your name :)
:)

However, what you see isn't necessarily what you've got. How do you
know it's not U+0393 (Greek capital letter GAMMA) or something else
that looks the same? Could even be from a line-drawing set (top left
corner of a box). What you need to do is find out the ordinal of the
character being displayed.

This type of problem arises when a character is written in one encoding
and viewed using another. I've had a quick look through various
likely-suspect 8-bit character sets (e.g. Latin1, KOI-8, cp1251,
cp1252, various DOS (OEM) code-pages) and I couldn't see a pair of
encodings that would reproduce anything like your "umlauted-u becomes
gamma-or-similar" problem. Please supply more than 1 example.

Thank you very much for your help and explanation! The "é" letter is
what I could find till now, see examples above..




The following code fragment is used for creating Excel sheet and sending email:
# ##################
# Create Excel sheet
#
f = StringIO()
workbook = xl.Writer(f)
worksheet = workbook.add_worksheet()
bold = workbook.add_format(bold=1)
border = workbook.add_format(border=1)

worksheet.write_row('A1', ['First Name', 'Last Name'], border)
i = 1
for row in result:
print repr(row['firstname'])
datarow = [row.get('firstname'), row.get('surname')]

You print the repr() of row['firstname'] but pass row.get('firstname')
to the XLS writer -- do they have the same value? For believability,
print what you pass!!
i += 1
worksheet.write_row('A%s' % i, datarow)

Check the documentation for the XLS writer module to see if it is doing
an implicit conversion here.
workbook.close()
####################
# Create email message

# Create the container (outer) email message.
msg = MIMEMultipart()
# Attach Excel sheet
xls = MIMEBase('application', 'vnd.ms-excel')
xls.set_payload(f.getvalue())
Encoders.encode_base64(xls)
xls.add_header('Content-Disposition', 'attachment', filename='some
file name %s-%s-%s.xls' % (today.day, today.month, today.year))
msg.attach(xls)

msg['Subject'] = subject
msg['From'] = fromaddr
msg['To'] = toaddr
# Guarantees the message ends in a newline
msg.epilogue = ''

# Send message
s = smtplib.SMTP(smtp_host)
s.sendmail(fromaddr, to_list, msg.as_string())
 
J

John Machin

Ksenia said:
Sorry, I meant: I use field of the type 'text' in a Postgres table to
store my data. The data is a XML string.


Thanks for this valuable tip. I take letter "é" as an example.

"print repr(data)" shows this:
u'\xe9'

That doesn't look like an "XML string" to me. Show the WHOLE contents
of the field.

Have you read the docs of the Perl module of which pyXLWrtiter is a
docs-free port? Right down the end it mutters something about XML
parsers returning UTF8 which will jam up the works if fed into an Excel
spreadsheet ...
 
S

Serge Orlov

John said:
That doesn't look like an "XML string" to me. Show the WHOLE contents
of the field.

Have you read the docs of the Perl module of which pyXLWrtiter is a
docs-free port? Right down the end it mutters something about XML
parsers returning UTF8 which will jam up the works if fed into an
Excel spreadsheet ...

Looking at the following function in pyXLWriter
def _asc2ucs(s):
"""Convert ascii string to unicode."""
return "\x00".join(s) + "\x00"

I can guess several things:
a) pyXLWriter author is an ascii guy :)
b) unicode strings are not supported by pyXLWriter
c) excel keeps unicode text in utf-16le

Ksenia, try encoding unicode strings in utf-16le before passing them to
pyXLWriter . If that doesn't work that means pyXLWriter requires
changes to support unicode strings.

Serge.
 
J

John Machin

Serge said:
Looking at the following function in pyXLWriter
def _asc2ucs(s):
"""Convert ascii string to unicode."""
return "\x00".join(s) + "\x00"

I can guess several things:
a) pyXLWriter author is an ascii guy :)

Shrewd guess :)
b) unicode strings are not supported by pyXLWriter

But that _asc2ucs() is used ONLY in write_url* methods ... so there's
hope yet.
c) excel keeps unicode text in utf-16le

Uh-huh. MS-everything is LE.
 
K

Ksenia Marasanova

John, Serge, thanks for your help!

utf-16le encoding didn't help. I had however to solve it yesterday,
so I used csv module to create CSV file and then import it in Excel.
Excel still had troubles with accented characters, but this is another
story: it seems that Office 2004 Excel (for Mac, but I assume the PC
version is no better) cannot import UTF-8 encoded text files. Encoding
CSV file with Latin1 encoding finally did work.

Now back to the Excel story, I also think that there is something
wrong with pyExcelWriter or the way I use it. CSV file generation was
okay, so I think there is nothing wrong with my data, or XML parser.

I will resume in a few days with pyExcelWriter and will post the
results here, but anyway, many thanks for your time and explanation!
 
S

Serge Orlov

Ksenia said:
John, Serge, thanks for your help!

utf-16le encoding didn't help. I had however to solve it yesterday,
so I used csv module to create CSV file and then import it in Excel.
Excel still had troubles with accented characters, but this is another
story: it seems that Office 2004 Excel (for Mac, but I assume the PC
version is no better) cannot import UTF-8 encoded text files.

Right, I tried on windows xp, utf-8 csv file is imported as garbadge.
However, csv file saved in utf-16 encoding is imported correctly.
Encoding
CSV file with Latin1 encoding finally did work.

Now back to the Excel story, I also think that there is something
wrong with pyExcelWriter or the way I use it. CSV file generation was
okay, so I think there is nothing wrong with my data, or XML parser.

I will resume in a few days with pyExcelWriter and will post the
results here, but anyway, many thanks for your time and explanation!

I believe Microsoft Office has gone through byte strings to unicode
strings transformation between 1995 and 1997. I still remember times
when you could receive Microsoft Office file and couldn't view it.
I suspect pyExcelWriter writes strings in that old format so utf-16le
trick didn't work. You can try to contact pyExcelWriter author
and ask him about unicode support.

Serge.
 
J

John Machin

Ksenia said:
John, Serge, thanks for your help!

Thank *you* for having interesting problems :)
utf-16le encoding didn't help. I had however to solve it yesterday,
so I used csv module to create CSV file and then import it in Excel.
Excel still had troubles with accented characters, but this is another
story: it seems that Office 2004 Excel (for Mac, but I assume the PC
version is no better) cannot import UTF-8 encoded text files. Encoding
CSV file with Latin1 encoding finally did work.

Yes, Excel appears not to understand UTF-8. It interprets CSV files
according to the current locale / codepage / whatever -- the "old bad
way" that Unicode is meant to save us from.

An alternative, if you need to represent more than one codepage, or
want a "new good way" of doing it: Excel allows "Save As" to "Unicode
Text" format. It uses Unicode tab u'\t' as delimiter. It quotes tabs,
quotes quotes by doubling them, and [weirdly] also quotes cells which
have only a comma [maybe locale-dependent] in them. It quite happily
opens such files without data loss. You should be able to make such
files easily with Python.

Here's a dump of such a file created by Excel 2002 on Windows -- pls
pardon any Cyrillic spelling errors :)
u'\u041c\u0430\u0440\u0430\u0441\u0430\u043d\u043e\u0432\u0430\t\u041a\u0441\u0435\u043d\u044f\r\n"comma,
comma, comma"\t\r\n"quote ""Hello UniWorld""
unquote"\t\r\n"tab\ttab"\t\r\n'
МараÑанова КÑенÑ

"comma, comma, comma"

"quote ""Hello UniWorld"" unquote"

"tab tab"

To make such a file, you would need a quoter function something like
this; you would apply it to each field:
! if quote in s:
! return quote + s.replace(quote, quote+quote) + quote
! if alt_delim in s or u'\t' in s:
! return quote + s + quote
! return s

Then you would do u'\t'.join(fields) , add on u'\r\n' [or whatever is
needed in your environment], .encode('utf16') and .write() to your 'wb'
file.
Now back to the Excel story, I also think that there is something
wrong with pyExcelWriter or the way I use it. CSV file generation was
okay, so I think there is nothing wrong with my data, or XML parser.

I will resume in a few days with pyExcelWriter and will post the
results here, but anyway, many thanks for your time and explanation!

I've been reading the source and looking at the Excel file specs
[available from openoffice.org if you're very short of reading
material!]. Apparently pyXLWriter doesn't handle Unicode at all.
Although Unicode came in with Excel 1997 (BIFF8 format file),
pyXLWriter appears to support only Excel 5(?) (BIFF5 format file). As
Serge suggested, appeal to the porter to appeal to the author of the
Perl module it's ported from; but don't hold your breath in the
meantime.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top