xlrd and cPickle.dump

P

patrick.waldo

Hi all,

Sorry for the repeat I needed to reform my question and had some
problems...silly me.

The xlrd documentation says:
"Pickleable. Default is true. In Python 2.4 or earlier, setting to
false will cause use of array.array objects which save some memory but
can't be pickled. In Python 2.5, array.arrays are used
unconditionally. Note: if you have large files that you need to read
multiple times, it can be much faster to cPickle.dump() the xlrd.Book
object once, and use cPickle.load() multiple times."

I'm using Python 2.4 and I have an extremely large excel file that I
need to work with. The documentation leads me to believe that cPickle
will be a more efficient option, but I am having trouble pickling the
excel file. So far, I have this:

import cPickle,xlrd
import pyExcelerator
from pyExcelerator import *

data_path = """C:\test.xls"""
pickle_path = """C:\pickle.xls"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

wb=pyExcelerator.Workbook()
proc = wb.add_sheet("proc")

#Neither of these work
#1) pyExcelerator try
#cPickle.dump(book,wb.save(pickle_path))
#2) Normal pickle try
#pickle_file = open(pickle_path, 'w')
#cPickle.dump(book, pickle_file)
#file.close()

Any ideas would be helpful. Otherwise, I won't pickle the excel file
and deal with the lag time.

Patrick
 
J

John Machin

Hi all,

Sorry for the repeat I needed to reform my question and had some
problems...silly me.

Indeed. Is omitting the traceback part of the "reformation"?
The xlrd documentation says:
"Pickleable. Default is true. In Python 2.4 or earlier, setting to
false will cause use of array.array objects which save some memory but
can't be pickled. In Python 2.5, array.arrays are used
unconditionally. Note: if you have large files that you need to read
multiple times, it can be much faster to cPickle.dump() the xlrd.Book
object once, and use cPickle.load() multiple times."

I'm using Python 2.4 and I have an extremely large excel file that I
need to work with.

How many megabytes is "extremely large"? How many seconds does it take
to open it with xlrd.open_workbook?
The documentation leads me to believe that cPickle
will be a more efficient option, but I am having trouble pickling the
excel file. So far, I have this:

import cPickle,xlrd
import pyExcelerator
from pyExcelerator import *

You only need one of the above imports at the best of times, and for
what you are attempting to do, you don't need pyExcelerator at all.
data_path = """C:\test.xls"""

It is extremely unlikely that you have a file whose basename begins with
a TAB ('\t') character. Please post the code that you actually ran.
pickle_path = """C:\pickle.xls"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

wb=pyExcelerator.Workbook()
proc = wb.add_sheet("proc")

#Neither of these work
#1) pyExcelerator try
#cPickle.dump(book,wb.save(pickle_path))
#2) Normal pickle try
#pickle_file = open(pickle_path, 'w')
#cPickle.dump(book, pickle_file)
#file.close()

and the last bit of the pre-freormation traceback was
"""
File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle file objects
"""

I can reproduce that behaviour with Python 2.2, also with 2.1 (different
error message, same meaning). However it works OK with Python 2.3.5,
2.4.3, and 2.5.2. Precisely which version of Python 2.4 are you using?
Are you in the habit of copying library files like copy_reg.py from one
version to another?

The second argument of cPickle.dump is an open file.
wb.save(pickle_path) will write an empty/default spreadsheet file to the
given path (this is utterly pointless) and then return None. So once you
get over the first problem, you will have the second: None is not an
open file. The whole pyExcelerator carry-on is quite irrelevant to your
problem.

Please post the minimal pyExcelerator-free script that demonstrates your
problem. Ensure that it includes the following line:
import sys; print sys.version; print xlrd.__VERSION__
Also post the output and the traceback (in full).

Cheers,
John
 
P

patrick.waldo

How many megabytes is "extremely large"? How many seconds does it take
to open it with xlrd.open_workbook?

The document is 15mb ad 50,000+ rows (for test purposes I will use a
smaller sample), but my computer hangs (ie it takes a long time) when
I try to do simple manipulations and the documentation leads me to
believe cPickle will be more efficient. If this is not true, then I
don't have a problem (ie I just have to wait), but I still would like
to figure out how to pickle an xlrd object anyways.
You only need one of the above imports at the best of times, and for
what you are attempting to do, you don't need pyExcelerator at all.

Using pyExcelerator was a guess, because the traditional way didn't
work and I thought it may be because it's an Excel file. Secondly, I
import it twice because sometimes, and I don't know why, PythonWin
does not import pyExcelerator the first time. This has only been true
with pyExcelerator.
It is extremely unlikely that you have a file whose basename begins with
a TAB ('\t') character. Please post the code that you actually ran.

you're right, I had just quickly erased my documents and settings
folder to make it smaller for an example.
Please post the minimal pyExcelerator-free script that demonstrates your
problem. Ensure that it includes the following line:
import sys; print sys.version; print xlrd.__VERSION__
Also post the output and the traceback (in full).

As to copy_reg.py, I downloaded Activestate Python 2.4 and that was
it, so I have had no other version on my computer.

Here's the code:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'w')
cPickle.dump(book, pickle_file)
pickle_file.close()

Here's the output:

2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)]
0.6.1
Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(book, pickle_file)
File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle module objects


Thanks for the advice!
 
J

John Machin

The document is 15mb ad 50,000+ rows (for test purposes I will use a
smaller sample),

15 Mb is not large. E.g. 120 Mb is large.
but my computer hangs (ie it takes a long time) when
I try to do simple manipulations

What do you describe as "simple manipulations"? Please describe your
computer, including how much memory it has.
and the documentation leads me to
believe cPickle will be more efficient.

The doc says that *IF* you need to read the XLS file multiple times,
you can use xlrd.open_workbook() once and save a pickled copy. Then
you can c.Pickle.load() from the pickled file multiple times;
typically this is about 10 times as fast.

BUT this is just a faster way of getting an xlrd.Book object ... not
relevant to any subsequent "simple manipulations". You should not be
manipulating an xlrd.Book object; treat it as read-only.
If this is not true, then I
don't have a problem (ie I just have to wait), but I still would like
to figure out how to pickle an xlrd object anyways.


Using pyExcelerator was a guess, because the traditional way didn't
work and I thought it may be because it's an Excel file.

Don't guess. If the docs say it wants a file to write to, don't give
it None. Especially don't use a function/method call with silly side
effects.
Secondly, I
import it twice because sometimes, and I don't know why, PythonWin
does not import pyExcelerator the first time. This has only been true
with pyExcelerator.

This seems exceedingly unlikely to me. Try asking on the pywin32
mailing list.
you're right, I had just quickly erased my documents and settings
folder to make it smaller for an example.

I'm far too astonished by that (and your habit of putting any old
bunch of files in your root directory) to make any coherent comment.
As to copy_reg.py, I downloaded Activestate Python 2.4 and that was
it, so I have had no other version on my computer.

Here's the code:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'w')

Use 'wb' ...
cPickle.dump(book, pickle_file)

.... use cPickle.dump(book, pickle_file, -1) # latest protocol

*I WAS WRONG* (happens sometimes) when I said I couldn't reproduce
your problem with Python 2.4; my test/demo/timing code was using -1 by
default.

Note: as well as not complaining about contained file objects,
protocol -1 can be 10 times the speed of protocol 0 and produce a
pickle file half the size.
pickle_file.close()

Here's the output:

2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)]
0.6.1
Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(book, pickle_file)
File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle module objects

Thanks for the advice!

Here's some more: try to *understand* the "hanging" / "simple
manipulations" problem before you start implementing solutions.

Also, any good reason for sticking with Python 2.4?

HTH,
John
 
P

patrick.waldo

Still no luck:

Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(Data_sheet, pickle_file, -1)
PicklingError: Can't pickle <type 'module'>: attribute lookup
__builtin__.module failed

My code remains the same, except I added 'wb' and the -1 following
your suggestions:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'wb')
cPickle.dump(Data_sheet, pickle_file, -1)
pickle_file.close()

To begin with (I forgot to mention this before) I get this error:
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
zero

I'm not sure what this means.
What do you describe as "simple manipulations"? Please describe your
computer, including how much memory it has.

I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
enough for my programming projects. However, when I try to print out
the rows in the excel file, my computer gets very slow and choppy,
which makes experimenting slow and frustrating. Maybe cPickle won't
solve this problem at all! For this first part, I am trying to make
ID numbers for the different permutation of categories, topics, and
sub_topics. So I will have [book,non-fiction,biography],[book,non-
fiction,history-general],[book,fiction,literature], etc..
so I want the combination of
[book,non-fiction,biography] = 1
[book,non-fiction,history-general] = 2
[book,fiction,literature] = 3
etc...

My code does this, except sort returns None, which is strange. I just
want an alphabetical sort of the first option, which sort should do
automatically. When I do a test like
nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
nest_list.sort()
[['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
It works fine, but not for my rows.

Here's the code (unpickled/unsorted):
import xlrd, pyExcelerator

path_file = "C:\\text_analysis\\test.xls"
book = xlrd.open_workbook(path_file)
ProcFT_QC = book.sheet_by_index(0)
log_path = "C:\\text_analysis\\ID_Log.log"
logfile = open(log_path,'wb')

set_rows = []
rows = []
db = {}
n=0
while n<ProcFT_QC.nrows:
rows.append(ProcFT_QC.row_values(n, 6,9))
n+=1
print rows.sort() #Outputs None
ID = 1
for row in rows:
if row not in set_rows:
set_rows.append(row)
db[ID] = row
entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'
logfile.write(entry)
ID+=1
logfile.close()
Also, any good reason for sticking with Python 2.4?

Trying to learn Zope/Plone too, so I'm sticking with Python 2.4.


Thanks again
 
P

patrick.waldo

FWIW, it works here on 2.5.1 without errors or warnings. Ouput is:
2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
0.6.1

I guess it's a version issue then...

I forgot about sorted! Yes, that would make sense!

Thanks for the input.


Still no luck:

Traceback (most recent call last):
File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(Data_sheet, pickle_file, -1)
PicklingError: Can't pickle <type 'module'>: attribute lookup
__builtin__.module failed

My code remains the same, except I added 'wb' and the -1 following
your suggestions:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'wb')cPickle.dump(Data_sheet, pickle_file, -1)
pickle_file.close()

To begin with (I forgot to mention this before) I get this error:
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
zero

I'm not sure what this means.
What do you describe as "simple manipulations"? Please describe your
computer, including how much memory it has.

I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
enough for my programming projects. However, when I try to print out
the rows in the excel file, my computer gets very slow and choppy,
which makes experimenting slow and frustrating. Maybe cPickle won't
solve this problem at all! For this first part, I am trying to make
ID numbers for the different permutation of categories, topics, and
sub_topics. So I will have [book,non-fiction,biography],[book,non-
fiction,history-general],[book,fiction,literature], etc..
so I want the combination of
[book,non-fiction,biography] = 1
[book,non-fiction,history-general] = 2
[book,fiction,literature] = 3
etc...

My code does this, except sort returns None, which is strange. I just
want an alphabetical sort of the first option, which sort should do
automatically. When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
[['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
It works fine, but not for my rows.

Here's the code (unpickled/unsorted):
import xlrd, pyExcelerator

path_file = "C:\\text_analysis\\test.xls"
book = xlrd.open_workbook(path_file)
ProcFT_QC = book.sheet_by_index(0)
log_path = "C:\\text_analysis\\ID_Log.log"
logfile = open(log_path,'wb')

set_rows = []
rows = []
db = {}
n=0
while n<ProcFT_QC.nrows:
rows.append(ProcFT_QC.row_values(n, 6,9))
n+=1
print rows.sort() #Outputs None
ID = 1
for row in rows:
if row not in set_rows:
set_rows.append(row)
db[ID] = row
entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'
logfile.write(entry)
ID+=1
logfile.close()
Also, any good reason for sticking with Python 2.4?

Trying to learn Zope/Plone too, so I'm sticking with Python 2.4.

Thanks again
 
J

John Machin

FWIW, it works here on 2.5.1 without errors or warnings. Ouput is:
2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
0.6.1

I guess it's a version issue then...

I say again: Don't guess.
I forgot about sorted! Yes, that would make sense!

Thanks for the input.

I didn't notice that the exception had changed from the original:
"TypeError: can't pickle file objects" (with protocol=0)
to:
"TypeError: can't pickle module objects" (pickling an xlrd.Book
object with protocol=-1)
and now to:
"PicklingError: Can't pickle <type 'module'>: attribute lookup
__builtin__.module failed" (pickling an xlrd.Sheet object with protocol -1)

I'm wondering if this is some unfortunate side effect of running the
script in the pywin IDE ("exec codeObject in __main__.__dict__"). Can
you reproduce the problem by running the script in the Command Prompt
window? What version of pywin32 are you using?


"WARNING" != "error". If that's the only message you get, ignore it; it
means that your XLS file was created by the perl XLS-writing package or
a copier thereof.

Just printing the rows is VERY UNLIKELY to cause this. Demonstrate this
to yourself by using xlrd's supplied runxlrd script:

command_prompt> c:\python24\scripts\runxlrd.py show yourfile.xls


99.9% chance, not "maybe".
For this first part, I am trying to make
ID numbers for the different permutation of categories, topics, and
sub_topics. So I will have [book,non-fiction,biography],[book,non-
fiction,history-general],[book,fiction,literature], etc..
so I want the combination of
[book,non-fiction,biography] = 1
[book,non-fiction,history-general] = 2
[book,fiction,literature] = 3
etc...

My code does this, except sort returns None, which is strange.

list.sort() returns None by definition; it sorts the list object's
contents in situ.
I just
want an alphabetical sort of the first option, which sort should do
automatically. When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
nest_list.sort()
[['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
It works fine, but not for my rows.

Why are you sorting?
Here's the code (unpickled/unsorted):
import xlrd, pyExcelerator

path_file = "C:\\text_analysis\\test.xls"
book = xlrd.open_workbook(path_file)
ProcFT_QC = book.sheet_by_index(0)
log_path = "C:\\text_analysis\\ID_Log.log"
logfile = open(log_path,'wb')

set_rows = []

The test x in y where y is a sequence needs to compare with half of the
existing items on average. You are doing that test N times. If the
number of unique rows is U, it will do about N*U/4 comparisons. You
said N is about 50,000.

The changes below make y a set; consequentially x needs to be a tuple
instead of a list.

set_rows = set()
rows = []
db = {}
n=0
while n<ProcFT_QC.nrows:
rows.append(ProcFT_QC.row_values(n, 6,9))

rows.append(tuple(ProcFT_QC.row_values(n, 6,9)))
n+=1
print rows.sort() #Outputs None
ID = 1
for row in rows:
if row not in set_rows:
set_rows.append(row)
set_rows.add(row)
db[ID] = row
entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'

Presuming your data is actually ASCII, you could save time and memory by
converting it once as you extract it from the spreadsheet.

entry = str(ID) + '|' + str(row).strip('u()') + '\r\n'

HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top