I have an excel file. When I select cells, copy from excel, and then use
win32clipboard to get the contents of the clipboard, I have a 131071
character string.
How exactly are you using win32clipboard, and what exact result are you
getting? Start with a smaller selection of cells, say, two cells, so the
amount of data is manageable.
I doubt very much you are getting 131071 *characters*. Perhaps you are
getting that many *bytes*. Or perhaps you are getting HTML-formatted
text. Without seeing what is in the clipboard, who can tell? Remember
that the clipboard can contain multiple versions of the same data.
When I save the file as a text file, and use the python 3.3 open command
to read its contents, I only have 80684 characters.
How exactly are you using the open function? The result you get may
differ drastically depending on what you do.
Excel (and other programs too) appends a bunch of b'\x00' (or similar)
characters.
Append to what? The .xls file? The text file? What tool are you using to
see this?
Is there a pythonic way to strip these out?
It's hard to tell from the limited description, but my guess is that you
are misinterpreting what you are seeing. Part of the evidence for this is
that you are conflating bytes and characters, e.g. above where you refer
to the NUL *byte* b'\x00' as a character. Bytes are not characters. Scrub
that out of your brain. They never were, not even back in the old ASCII
days (may they soon be forgotten), despite what many people think.
My guess is that when you save the spreadsheet as text, either Excel by
default, or possibly because you have ticked a checkbox, have set it to
save using the UTF-16 encoding. That's a good thing (not ideal, ideally
they ought to use UTF-8, but UTF-16 is not a bad choice).
But when you open the file in Python, Python defaults to UTF-8, which
means you get an bunch of extraneous NULs when opening the file in text
mode, or b'\x00' null bytes in binary mode. For example:
py> with open('/tmp/rubbish', 'w', encoding='utf-16be') as f:
.... f.write('hello world blah blah blah\n')
....
27
py> with open('/tmp/rubbish', 'r') as f: # default encoding is UTF-8
.... f.read()
....
'\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00b\x00l
\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00\n'
If you look carefully, you will see that every character appears to be
preceded by the NUL control character, \x00. But that's because you've
used the wrong encoding to decode the bytes in the file back to
characters. The right way to do this is:
py> with open('/tmp/rubbish', 'r', encoding='utf-16be') as f:
.... f.read()
....
'hello world blah blah blah\n'
Which encoding should you use? Unfortunately, there is no clean way for
text files to record which encoding to use inside the file itself, so
it's often impossible to know for sure. This is why everyone should move
towards using UTF-8 everywhere. But I digress.
The fact that you show the NULs as *bytes* b'\x00' rather than characters
'\x00' suggests that you might be reading the file in binary mode.
Presumably you did this because you got an error when trying to read it
in text mode. If you got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
invalid start byte
then you should go back to text mode, but use 'utf-16' as the codec:
open(filename, 'w', encoding='utf-16')
Otherwise, if the NULs appear before the letters, as in my example, use
'utf-16be'. If they appear after the letters, use 'utf-16le'. The
existence of three oh-so-very-slightly different versions of UTF-16 is
why it is a sub-optimal encoding, and it is oh-so-bloody-typical that
Microsoft screwed it up for everyone by picking it as their default
implementation. If they had sensibly used UTF-8, you wouldn't be having
this problem.
Of course, it is possible I've misdiagnosed your problem. I've had to
guess a lot because you didn't show us what you actually did to get the
results you say you got.