1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.
My point was that however the original XLS files were created or
acquired, the first step in your solution involves converting the XLS
file to "XML Spreadsheet" format, which requires a copy of Excel on a
Windows box. Many people start with an XLS file, no Excel and no
Windows box, no COM, and users who can't be relied on to open a file
and save it in the right format with the right name and extension.
BTW, did you consider opening the XLS files with OpenOffice.org's Calc
and saving it in their default ods format (chunks of XML in a zip
file)?
3. The largest file I used was about 228M,
Is that the XLS file or the XML file?
containing 36,393 hotel
properties from Commission Junction. Each entry had 113 cells. The
parsing overhead was minimal (on a per-entry basis) -- that's why I
choose to use a pull-parser.
4. I extracted primarily cell contents, though I did some very limited
format handling (looking for non-text fields and such).
I don't understand "looking for non-text fields" as "format handling".
To my way of thinking, knowing the data-type that Excel has assigned
to a cell is close to essential for effective use of the contents. And
you don't have to look very far: the ss:Type attribute tells you
whether a cell's content is String, Number, DateTime, Boolean, or
Error.
I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve.
Not so focused at all. The default behaviour of xlrd is to ignore
formatting info as much as possible. AFAICT this is the mode used by
most users.
I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.
Possibly incorrectly. If approached at the time, I would have said:
(a) if desperate to DIY:
(a1) ignore any code for old Excel versions (self.biff_version < 80)
(a2) ignore any code for extracting formatting info
(self.formatting_info)
(a3) find the Sheet.put_cell* methods in sheet.py e.g.
def put_cell(self, rowx, colx, ctype, value, xf_index):
ignore the xf_index arg and subvert them to your own needs instead of
filling up a big rectangular arena with data
(b) if not really so desperate, talk to me about implementing an
option in xlrd where callers can specify a callback to be used instead
of the Sheet.put_cell* methods
(c) What is all this "port to my own framework" caper anyway? If you
need to extract data from a database, do you rummage in their code
libraries and port the relevant bits to your own framework?
I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents.
Building objects in your own framework and checking data integrity is
something that happens *after* you've got the basics for a cell (row
id, column id, data type, data value). You have dodgy postal
addresses? You're not alone, and it's not relevant to how you parse
the spreadsheet or even whether the data source was a spreadsheet or a
database query or a box of punched cards.
My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.
"Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
XML tells you the type (String, Boolean, ...). xlrd tells you the type
(XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).
Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.
I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.
I hope this helps!
I think so. Thanks. You didn't directly address the "steep learning
curve" question, but you explained enough of where you came from.
It's a matter of "horses for courses". I guess some people might
regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
slightly non-horizontal learning curve
Cheers,
John