Stripping characters from windows clipboard with win32clipboard from excel

S

stephen.boulet

I have an excel file. When I select cells, copy from excel, and then use win32clipboard to get the contents of the clipboard, I have a 131071 character string.

When I save the file as a text file, and use the python 3.3 open command to read its contents, I only have 80684 characters.

Excel (and other programs too) appends a bunch of b'\x00' (or similar) characters.

Is there a pythonic way to strip these out?
 
S

Steven D'Aprano

I have an excel file. When I select cells, copy from excel, and then use
win32clipboard to get the contents of the clipboard, I have a 131071
character string.

How exactly are you using win32clipboard, and what exact result are you
getting? Start with a smaller selection of cells, say, two cells, so the
amount of data is manageable.

I doubt very much you are getting 131071 *characters*. Perhaps you are
getting that many *bytes*. Or perhaps you are getting HTML-formatted
text. Without seeing what is in the clipboard, who can tell? Remember
that the clipboard can contain multiple versions of the same data.

When I save the file as a text file, and use the python 3.3 open command
to read its contents, I only have 80684 characters.

How exactly are you using the open function? The result you get may
differ drastically depending on what you do.

Excel (and other programs too) appends a bunch of b'\x00' (or similar)
characters.

Append to what? The .xls file? The text file? What tool are you using to
see this?


Is there a pythonic way to strip these out?

It's hard to tell from the limited description, but my guess is that you
are misinterpreting what you are seeing. Part of the evidence for this is
that you are conflating bytes and characters, e.g. above where you refer
to the NUL *byte* b'\x00' as a character. Bytes are not characters. Scrub
that out of your brain. They never were, not even back in the old ASCII
days (may they soon be forgotten), despite what many people think.

My guess is that when you save the spreadsheet as text, either Excel by
default, or possibly because you have ticked a checkbox, have set it to
save using the UTF-16 encoding. That's a good thing (not ideal, ideally
they ought to use UTF-8, but UTF-16 is not a bad choice).

But when you open the file in Python, Python defaults to UTF-8, which
means you get an bunch of extraneous NULs when opening the file in text
mode, or b'\x00' null bytes in binary mode. For example:


py> with open('/tmp/rubbish', 'w', encoding='utf-16be') as f:
.... f.write('hello world blah blah blah\n')
....
27
py> with open('/tmp/rubbish', 'r') as f: # default encoding is UTF-8
.... f.read()
....
'\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00b\x00l
\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00\n'

If you look carefully, you will see that every character appears to be
preceded by the NUL control character, \x00. But that's because you've
used the wrong encoding to decode the bytes in the file back to
characters. The right way to do this is:

py> with open('/tmp/rubbish', 'r', encoding='utf-16be') as f:
.... f.read()
....
'hello world blah blah blah\n'


Which encoding should you use? Unfortunately, there is no clean way for
text files to record which encoding to use inside the file itself, so
it's often impossible to know for sure. This is why everyone should move
towards using UTF-8 everywhere. But I digress.

The fact that you show the NULs as *bytes* b'\x00' rather than characters
'\x00' suggests that you might be reading the file in binary mode.
Presumably you did this because you got an error when trying to read it
in text mode. If you got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
invalid start byte

then you should go back to text mode, but use 'utf-16' as the codec:

open(filename, 'w', encoding='utf-16')


Otherwise, if the NULs appear before the letters, as in my example, use
'utf-16be'. If they appear after the letters, use 'utf-16le'. The
existence of three oh-so-very-slightly different versions of UTF-16 is
why it is a sub-optimal encoding, and it is oh-so-bloody-typical that
Microsoft screwed it up for everyone by picking it as their default
implementation. If they had sensibly used UTF-8, you wouldn't be having
this problem.


Of course, it is possible I've misdiagnosed your problem. I've had to
guess a lot because you didn't show us what you actually did to get the
results you say you got.
 
S

stephen.boulet

Hi Steven. Here is my code:

import win32clipboard, win32con

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_TEXT)
win32clipboard.CloseClipboard()
return s

I use this helper function to grab the text on the clipboard and do useful things with it.

Sorry about the description; I have stuff to learn about strings and python 3.3.

To get the string length, I just do len(s). There were 10 columns and 700+ rows of data, so len(s) returned 80684 from an excel spreadsheet saved as a text file.

From the clipboard contents copied from the spreadsheet, the characters s[:80684] were the visible cell contents, and s[80684:] all started with "b'\x0" and lack any useful info for what I'm trying to accomplish.
 
M

MRAB

Hi Steven. Here is my code:

import win32clipboard, win32con

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_TEXT)
win32clipboard.CloseClipboard()
return s

I use this helper function to grab the text on the clipboard and do useful things with it.

Sorry about the description; I have stuff to learn about strings and python 3.3.

To get the string length, I just do len(s). There were 10 columns and 700+ rows of data, so len(s) returned 80684 from an excel spreadsheet saved as a text file.
From the clipboard contents copied from the spreadsheet, the characters s[:80684] were the visible cell contents, and s[80684:] all started with "b'\x0" and lack any useful info for what I'm trying to accomplish.
From my own experiments (although not with Excel) it appears that
GetClipboardData returns bytes (a bytestring), not (Unicode) strings.
 
S

stephen.boulet

Stephen Boulet:


From the clipboard contents copied from the spreadsheet, the characters s[:80684] were the visible cell contents, and s[80684:] all started with "b'\x0" and lack any useful info for what I'm trying to accomplish.



Looks like Excel is rounding up its clipboard allocation to the next

64K. There used to be good reasons for trying to leave some extra room

on the clipboard and avoid reallocating the block but I thought that was

over a long time ago.



To strip NULs off the end of the string use s.rstrip('\0')

Hm, that gives me a "Type str doesn't support the buffer API" message.
 
N

Neil Cerutti

Stephen Boulet:


From the clipboard contents copied from the spreadsheet, the characters s[:80684] were the visible cell contents, and s[80684:] all started with "b'\x0" and lack any useful info for what I'm trying to accomplish.



Looks like Excel is rounding up its clipboard allocation to the next

64K. There used to be good reasons for trying to leave some extra room

on the clipboard and avoid reallocating the block but I thought that was

over a long time ago.



To strip NULs off the end of the string use s.rstrip('\0')

Hm, that gives me a "Type str doesn't support the buffer API"
message.

Type mismatch. Try:

s.rstrip(b"\0")
 
S

stephen.boulet

Stephen Boulet:
From the clipboard contents copied from the spreadsheet, the characters s[:80684] were the visible cell contents, and s[80684:] all started with "b'\x0" and lack any useful info for what I'm trying to accomplish.
Looks like Excel is rounding up its clipboard allocation to the next
64K. There used to be good reasons for trying to leave some extra room
on the clipboard and avoid reallocating the block but I thought that was
over a long time ago.
To strip NULs off the end of the string use s.rstrip('\0')



Hm, that gives me a "Type str doesn't support the buffer API" message.

Aha, I need to use str(s, encoding='utf8').rstrip('\0').
 
R

random832

Aha, I need to use str(s, encoding='utf8').rstrip('\0').

It's not a solution to your problem, but why aren't you using
CF_UNICODETEXT, particularly if you're using python 3? And if you're
not, utf8 is the incorrect encoding, you should be using encoding='mbcs'
to interact with the CF_TEXT clipboard.

Anyway, to match behavior found in other applications when pasting from
the clipboard, I would suggest using:

if s.contains('\0'): s = s[:s.index('\0')]

Which will also remove non-null bytes after the first null (but if the
clipboard contains these, it won't be pasted into e.g. notepad).
 
S

stephen.boulet

I have an excel file. When I select cells, copy from excel, and then use win32clipboard to get the contents of the clipboard, I have a 131071 character string.



When I save the file as a text file, and use the python 3.3 open command to read its contents, I only have 80684 characters.



Excel (and other programs too) appends a bunch of b'\x00' (or similar) characters.



Is there a pythonic way to strip these out?

Odd behavior from excel. It seems as though the clipboard contents will vary according to the clipboard size. For example, in my 13 column row I select 2023 rows, copy to the clipboard, and

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_TEXT)
win32clipboard.CloseClipboard()
return s

s = getclipboard()

len(s) gives me 16298.

If I select one additional row, len(s) balloons to 32767.

In the first case, s[-10:] is b'0.032573\r\n'.

In the second case, s[-10:] is b'\x008\x008\x000\x005\x00\t'

Does anyone know what is going on with that?
 
R

random832

Anyway, to match behavior found in other applications when pasting from
the clipboard, I would suggest using:

if s.contains('\0'): s = s[:s.index('\0')]

Which will also remove non-null bytes after the first null (but if the
clipboard contains these, it won't be pasted into e.g. notepad).

I did some testing and confirmed you MUST do this - if you copy multiple
things from Excel, it may reuse this buffer and not clear the end of it.
 
S

stephen.boulet

Thanks to everyone for their help. Using everyone's suggestions, this seems to work:

import win32clipboard, win32con

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
win32clipboard.CloseClipboard()
if '\0' in s:
s = s[:s.index('\0')]
return s
 
M

MRAB

Thanks to everyone for their help. Using everyone's suggestions, this seems to work:

import win32clipboard, win32con

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
win32clipboard.CloseClipboard()
if '\0' in s:
s = s[:s.index('\0')]
return s
That'll look for a null character and, if it finds one, then look for
it again. Of course, that probably isn't an issue in practice.

However, an alternative way of doing it is to use the .partition method:

return s.partition('\0')[0]
 
D

Dave Angel

Thanks to everyone for their help. Using everyone's suggestions, this seems to work:

import win32clipboard, win32con

def getclipboard():
win32clipboard.OpenClipboard()
s = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
win32clipboard.CloseClipboard()
if '\0' in s:
s = s[:s.index('\0')]
return s
That'll look for a null character and, if it finds one, then look for
it again. Of course, that probably isn't an issue in practice.

However, an alternative way of doing it is to use the .partition method:

return s.partition('\0')[0]

So is the bug in Excel, in Windows, or in the Python library? Somebody
is falling down on the job; if Windows defines the string as ending at
the first null, then the Python interface should use that when defining
the text defined with CF_UNICODETEXT.

Or maybe it's an example of ill-defined Windows specs.
 
D

Dave Angel

Dave Angel:


Everything is performing correctly. win32clipboard is low-level
direct access to the Win32 clipboard API. A higher level API which is
more easily used from Python could be defined on top of this if anyone
was motivated.

Neil

Clearly you miss the point. If the clipboard API is defined to return a
null-terminated string, then the problem is in the Python library which
doesn't do a strlen() (or the wide-character equivalent; I forget its
name) on the results.

But there's a big if there. Somebody is either ill specified or poorly
implemented here.
 
R

random832

So is the bug in Excel, in Windows, or in the Python library? Somebody
is falling down on the job; if Windows defines the string as ending at
the first null, then the Python interface should use that when defining
the text defined with CF_UNICODETEXT.

Or maybe it's an example of ill-defined Windows specs.

I think it's a matter of ill-defined windows specs - the clipboard data
has an exact size, and in principle that size is always used, for
example, for numerous binary formats. But the text clipboard has often
been treated as a C string by a wide variety of different applications,
so one could argue that has become a de facto standard.
 
N

Neil Cerutti

Clearly you miss the point. If the clipboard API is defined to
return a null-terminated string, then the problem is in the
Python library which doesn't do a strlen() (or the
wide-character equivalent; I forget its name) on the results.

Python can't really know if you're pasting text or a screenshot
or what.
 
D

Dave Angel

Python can't really know if you're pasting text or a screenshot
or what.

CF_UNICODETEXT gives it a pretty good clue.

But random832 has already mentioned that it's an ill-specified Windows
interface; different programs deal with it differently. No way that a
the Python library should deal with that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top