Parsing Excel spreadsheets

A

andyhume

Hi,

Can anybody recommend an approach for loading and parsing Excel
spreadsheets in Python. Any well known/recommended libraries for this?

The only thing I found in a brief search was http://www.lexicon.net/sjmachin/xlrd.htm,
but I'd rather get some more input before going with something I don't
know.

Thanks,
Andy.
 
R

r

Hi,

Can anybody recommend an approach for loading and parsing Excel
spreadsheets in Python. Any well known/recommended libraries for this?

The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
but I'd rather get some more input before going with something I don't
know.

Thanks,
Andy.

xlrd(read only)
also see pyExcelerator
 
S

Steve Holden

Tino said:
which is now replaced by xlwt :)
I've had very good results with xlrd, having used it to extract data
from spreadsheets for storage into a relational database. It's also
cross-platform, which means you don't need to run under Windows.

I don't know anything at all about xlwt.

regards
Steve
 
M

Mike Driscoll

Hi,

Can anybody recommend an approach for loading and parsing Excel
spreadsheets in Python. Any well known/recommended libraries for this?

The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
but I'd rather get some more input before going with something I don't
know.

Thanks,
Andy.

If you need complete control of Excel, then you'll probably have to
use COM (see Mark Hammond's PyWin32 package). Otherwise, the
suggestions by the others will work.

Mike
 
J

John Machin

If you need complete control of Excel, then you'll probably have to
use COM (see Mark Hammond's PyWin32 package). Otherwise, the
suggestions by the others will work.

For avoidance of doubt, the OP should presume that you mean the *nett*
suggestions by the others i.e. after Tino struck out pyExcelerator.
 
B

brooklineTom

Hi,

Can anybody recommend an approach for loading and parsing Excel
spreadsheets in Python. Any well known/recommended libraries for this?

The only thing I found in a brief search was http://www.lexicon.net/sjmachin/xlrd.htm,
but I'd rather get some more input before going with something I don't
know.

Thanks,
Andy.

I save the spreadsheets (in Excel) in xml format. I started with the
standard xml tools (xml.dom and xml.dom.minidom). I built a
pullparser, and then just crack them. The MS format is tedious and
overly complex (like all MS stuff), but straightforward. Once I've
cracked them into their component parts (headers, rows, cells, etc),
then I walk through them doing whatever I want.

I found this material to be no worse than doing similar crud with
xhtml. I know there are various python packages around that do it, but
I found the learning curve of those packages to be steeper than just
grokking the spreadsheet structure itself.

In spite of all the hair, the underlying MS structure really does have
everything you'll need. My suggestion is just go for it, it isn't all
that hard.
 
J

John Machin

I save the spreadsheets (in Excel) in xml format.

Which means that you need to be on a Windows box with a licensed copy
of Excel. I presume you talking about using Excel 2003 and saving as
"XML Spreadsheet (*.xml)". Do you save the files manually, or using a
COM script? What is the largest xls file that you've saved as xml, how
big was the xml file, and how long did it take to parse the xml file?
Do you extract formatting information or just cell contents?
I started with the
standard xml tools (xml.dom and xml.dom.minidom). I built a
pullparser, and then just crack them. The MS format is tedious and
overly complex (like all MS stuff), but straightforward.

What do you think of the xml spat out by Excel 2007's (default) xlsx
format?
Once I've
cracked them into their component parts (headers, rows, cells, etc),
then I walk through them doing whatever I want.

I found this material to be no worse than doing similar crud with
xhtml. I know there are various python packages around that do it, but
I found the learning curve of those packages to be steeper than just
grokking the spreadsheet structure itself.

I'm curious to know which are the "various python packages" with the
so steep learning curves, and what the steep bits were.

Cheers,
John
 
J

Joshua Moore

Hi,

Can anybody recommend an approach for loading and parsing Excel
spreadsheets in Python. Any well known/recommended libraries for this?

The only thing I found in a brief search washttp://www.lexicon.net/sjmachin/xlrd.htm,
but I'd rather get some more input before going with something I don't
know.

Thanks,
Andy.

Is saving the spreadsheet as a comma separated file an option for you?
If so, it's not hard to throw something together for cvs files, but
there are even modules for that.

Josh
 
B

brooklineTom

Which means that you need to be on a Windows box with a licensed copy
of Excel. I presume you talking about using Excel 2003 and saving as
"XML Spreadsheet (*.xml)". Do you save the files manually, or using a
COM script? What is the largest xls file that you've saved as xml, how
big was the xml file, and how long did it take to parse the xml file?
Do you extract formatting information or just cell contents?

1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.
2. Yes, I used Excel 2003. I haven't looked at later versions.
3. The largest file I used was about 228M, containing 36,393 hotel
properties from Commission Junction. Each entry had 113 cells. The
parsing overhead was minimal (on a per-entry basis) -- that's why I
choose to use a pull-parser.
4. I extracted primarily cell contents, though I did some very limited
format handling (looking for non-text fields and such).
What do you think of the xml spat out by Excel 2007's (default) xlsx
format?

I haven't looked at this.
I'm curious to know which are the "various python packages" with the
so steep learning curves, and what the steep bits were.

I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve. I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.

I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents. My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.

Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.

I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.

I hope this helps!
 
S

Steve Holden

brooklineTom said:
1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.

Well, even if Andy meant "Excel files" rather than "Excel-formatted
files" there are many ways to come by these without having a licensed
copy of Excel. FTP and email attachment come to mind most readily.

How then to convert those to XML without Excel?

[...]> I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve. I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.
My own case was similar, in that I only needed the value data. The
approach I took was to install xlrd and use it. Job done.
I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents. My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.

Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.
My data files weren't that large (IIRC the largest spreadsheet was about
6MB), so I was quite happy to load the whole thing in memory, iterate
over it and then write the results to the database as they were extracted.
I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.
Given the constraints of your problem it seems like an intelligent approach.
I hope this helps!

I'm sure it will.

regards
Steve
 
S

Steve Holden

brooklineTom said:
1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.

Well, even if Andy meant "Excel files" rather than "Excel-formatted
files" there are many ways to come by these without having a licensed
copy of Excel. FTP and email attachment come to mind most readily.

How then to convert those to XML without Excel?

[...]> I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve. I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.
My own case was similar, in that I only needed the value data. The
approach I took was to install xlrd and use it. Job done.
I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents. My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.

Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.
My data files weren't that large (IIRC the largest spreadsheet was about
6MB), so I was quite happy to load the whole thing in memory, iterate
over it and then write the results to the database as they were extracted.
I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.
Given the constraints of your problem it seems like an intelligent approach.
I hope this helps!

I'm sure it will.

regards
Steve
 
J

John Machin

1. The OP requested Excel files, by construction those must be
generated with a licensed copy of Excel. I did the actual processing
on both linux and windoze platforms.

My point was that however the original XLS files were created or
acquired, the first step in your solution involves converting the XLS
file to "XML Spreadsheet" format, which requires a copy of Excel on a
Windows box. Many people start with an XLS file, no Excel and no
Windows box, no COM, and users who can't be relied on to open a file
and save it in the right format with the right name and extension.

BTW, did you consider opening the XLS files with OpenOffice.org's Calc
and saving it in their default ods format (chunks of XML in a zip
file)?

3. The largest file I used was about 228M,

Is that the XLS file or the XML file?
containing 36,393 hotel
properties from Commission Junction. Each entry had 113 cells. The
parsing overhead was minimal (on a per-entry basis) -- that's why I
choose to use a pull-parser.
4. I extracted primarily cell contents, though I did some very limited
format handling (looking for non-text fields and such).

I don't understand "looking for non-text fields" as "format handling".
To my way of thinking, knowing the data-type that Excel has assigned
to a cell is close to essential for effective use of the contents. And
you don't have to look very far: the ss:Type attribute tells you
whether a cell's content is String, Number, DateTime, Boolean, or
Error.
I looked, briefly, at xlrd. I found and scanned a few alternatives,
though I don't remember what the others were. I needed something I
could incorporate into my own application framework, and I knew I
didn't need most of the formatting information. I'm not in any way
criticizing xlrd, it's simply that, based on its API summary, it seems
focused on problems I didn't have to solve.

Not so focused at all. The default behaviour of xlrd is to ignore
formatting info as much as possible. AFAICT this is the mode used by
most users.
I knew that I needed only
a small subset of the xlrd behavior, and I concluded (perhaps
incorrectly) that it would be easier to roll my own parser than find,
extract, and then port (to my own framework) the corresponding parts
of xlrd.

Possibly incorrectly. If approached at the time, I would have said:
(a) if desperate to DIY:
(a1) ignore any code for old Excel versions (self.biff_version < 80)
(a2) ignore any code for extracting formatting info
(self.formatting_info)
(a3) find the Sheet.put_cell* methods in sheet.py e.g.

def put_cell(self, rowx, colx, ctype, value, xf_index):

ignore the xf_index arg and subvert them to your own needs instead of
filling up a big rectangular arena with data

(b) if not really so desperate, talk to me about implementing an
option in xlrd where callers can specify a callback to be used instead
of the Sheet.put_cell* methods

(c) What is all this "port to my own framework" caper anyway? If you
need to extract data from a database, do you rummage in their code
libraries and port the relevant bits to your own framework?
I needed to extract the content of each row, cell by cell, and build
data objects (in my framework) with the content of various cells. I
also needed to build an "exception file" containing malformed entries
that I could re-open with Excel after my code finished, so that the
bogus entries could be manually corrected. What I mean by "malformed"
entry is, for example, an address field that fails to correctly
geocode or comment fields with confused utf8/unicode contents.

Building objects in your own framework and checking data integrity is
something that happens *after* you've got the basics for a cell (row
id, column id, data type, data value). You have dodgy postal
addresses? You're not alone, and it's not relevant to how you parse
the spreadsheet or even whether the data source was a spreadsheet or a
database query or a box of punched cards.
My
focus was on data content, as opposed to presentation. I needed to
crack the cells into things like "string", "boolean", "float", and so
on.

"Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
XML tells you the type (String, Boolean, ...). xlrd tells you the type
(XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).
Most importantly, I needed to do this one entry at a time -- I did
*not* want to load the entire spreadsheet at once.

I'm not saying that this couldn't be done with xlrd; only that I chose
to roll my own and had minimal difficulty doing so.

I hope this helps!

I think so. Thanks. You didn't directly address the "steep learning
curve" question, but you explained enough of where you came from.

It's a matter of "horses for courses". I guess some people might
regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
slightly non-horizontal learning curve :)

Cheers,
John
 
B

brooklineTom

<snip>

My point was that however the original XLS files were created or
acquired, the first step in your solution involves converting the XLS
file to "XML Spreadsheet" format, which requires a copy of Excel on a
Windows box. Many people start with an XLS file, no Excel and no
Windows box, no COM, and users who can't be relied on to open a file
and save it in the right format with the right name and extension.

True enough. I develop on a WinXP box and have Excel. I just used it.
BTW, did you consider opening the XLS files with OpenOffice.org's Calc
and saving it in their default ods format (chunks of XML in a zip
file)?

No. As I said, I have Excel.
Is that the XLS file or the XML file?

That's the xml file size. I just *love* file-bloat, don't you? :)
<snip>


Possibly incorrectly. If approached at the time, I would have said:
(a) if desperate to DIY:
(a1) ignore any code for old Excel versions (self.biff_version < 80)
(a2) ignore any code for extracting formatting info
(self.formatting_info)
(a3) find the Sheet.put_cell* methods in sheet.py e.g.

def put_cell(self, rowx, colx, ctype, value, xf_index):

ignore the xf_index arg and subvert them to your own needs instead of
filling up a big rectangular arena with data

(b) if not really so desperate, talk to me about implementing an
option in xlrd where callers can specify a callback to be used instead
of the Sheet.put_cell* methods

(c) What is all this "port to my own framework" caper anyway? If you
need to extract data from a database, do you rummage in their code
libraries and port the relevant bits to your own framework?

I'm sure xlrd is fine, I wasn't any way suggesting that any
"improvements" are needed. I was just explaining what I did.
"Crack"? It's not the Enigma code. It's not even rot13. Spreadsheet
XML tells you the type (String, Boolean, ...). xlrd tells you the type
(XL_CELL_TEXT, XL_CELL_BOOLEAN, ...).

Sure, once you've sorted through the excess MS-specific stuff, dealt
with the runs of empty cells, and so on.

I needed to write an exception file, containing the original
spreadsheet entries that failed, so that our curators could find and
fix (or delete) them. Thus, I already had to do row-by-row handling of
the input, I had to catch and handle exceptions, I had to know and
remember enough about the original SS format/layout to replicate it in
the exception file, and so on.
I think so. Thanks. You didn't directly address the "steep learning
curve" question, but you explained enough of where you came from.

It's a matter of "horses for courses". I guess some people might
regard (xml.dom, xml.dom.minidom, DIY "pullparser") as having a
slightly non-horizontal learning curve :)

*LOL*

Yes. Sadly, I had already climbed the dom/minidom/pullparser curve for
all the other stuff I have to do. In the space I work in, xml handling
is pervasive enough that I already have all that stuff "in my hands"
anyway.

I wonder if the OP ever read any of this.... :)

Thx,
Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top