reverse engineering Excel spreadsheet

D

Duncan Smith

Hello,
I am currently implementing (mainly in Python) 'models' that come
to me as Excel spreadsheets, with little additional information. I am
expected to use these models in a web application. Some contain many
worksheets and various macros.

What I'd like to do is extract the data and business logic so that I can
figure out exactly what these models actually do and code it up. An
obvious (I think) idea is to generate an acyclic graph of the cell
dependencies so that I can identify which cells contain only data (no
parents) and those that depend on other cells. If I could also extract
the relationships (functions), then I could feasibly produce something
in pure Python that would mirror the functionality of the original
spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
statistical functions).

The final application will be running on a Linux server, but I can use a
Windows box (i.e. win32all) for processing the spreadsheets (hopefully
not manually). Any advice on the feasibility of this, and how I might
achieve it would be appreciated.

I assume there are plenty of people who have a better knowledge of e.g.
COM than I do. I suppose an alternative would be to convert to Open
Office and use PyUNO, but I have no experience with PyUNO and am not
sure how much more reliable the statistical functions of Open Office
are. At the end of the day, the business logic will not generally be
complex, it's extracting it from the spreadsheet that's awkward. Any
advice appreciated. TIA. Cheers.

Duncan
 
I

irstas

Hello,
I am currently implementing (mainly in Python) 'models' that come
to me as Excel spreadsheets, with little additional information. I am
expected to use these models in a web application. Some contain many
worksheets and various macros.

What I'd like to do is extract the data and business logic so that I can
figure out exactly what these models actually do and code it up. An
obvious (I think) idea is to generate an acyclic graph of the cell
dependencies so that I can identify which cells contain only data (no
parents) and those that depend on other cells. If I could also extract
the relationships (functions), then I could feasibly produce something
in pure Python that would mirror the functionality of the original
spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
statistical functions).

The final application will be running on a Linux server, but I can use a
Windows box (i.e. win32all) for processing the spreadsheets (hopefully
not manually). Any advice on the feasibility of this, and how I might
achieve it would be appreciated.

I assume there are plenty of people who have a better knowledge of e.g.
COM than I do. I suppose an alternative would be to convert to Open
Office and use PyUNO, but I have no experience with PyUNO and am not
sure how much more reliable the statistical functions of Open Office
are. At the end of the day, the business logic will not generally be
complex, it's extracting it from the spreadsheet that's awkward. Any
advice appreciated. TIA. Cheers.

Duncan

I'm not sure I understood what kind of information you want to
get out of the Excel sheet, sorry. But I hope this'll get you started
(at least it has a few nice tokens that might help you in googling):

import win32com.client

class Excel:
def __init__(self, filename):
self.closed = True
self.xlApp =
win32com.client.dynamic.Dispatch('Excel.Application')
self.xlBook = self.xlApp.Workbooks.Open(filename)
self.closed = False

def sheet(self, sheetName):
return self.xlBook.Worksheets(sheetName)

def __del__(self):
if not self.closed:
self.close()

def close(self):
self.xlBook.Close(SaveChanges=1)
self.xlApp.Quit()
self.closed = True

excel = Excel('file.xls')
sheet = excel.sheet(1)
print sheet.Cells(6, 3)


I used it a few years ago to read and populate spreadsheet cells.
 
L

Laurent Pointal

Duncan said:
Hello,
I am currently implementing (mainly in Python) 'models' that come
to me as Excel spreadsheets, with little additional information. I am
expected to use these models in a web application. Some contain many
worksheets and various macros.

What I'd like to do is extract the data and business logic so that I can
figure out exactly what these models actually do and code it up. An
obvious (I think) idea is to generate an acyclic graph of the cell
dependencies so that I can identify which cells contain only data (no
parents) and those that depend on other cells. If I could also extract
the relationships (functions), then I could feasibly produce something
in pure Python that would mirror the functionality of the original
spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
statistical functions).

The final application will be running on a Linux server, but I can use a
Windows box (i.e. win32all) for processing the spreadsheets (hopefully
not manually). Any advice on the feasibility of this, and how I might
achieve it would be appreciated.

I assume there are plenty of people who have a better knowledge of e.g.
COM than I do. I suppose an alternative would be to convert to Open
Office and use PyUNO, but I have no experience with PyUNO and am not
sure how much more reliable the statistical functions of Open Office
are. At the end of the day, the business logic will not generally be
complex, it's extracting it from the spreadsheet that's awkward. Any
advice appreciated. TIA. Cheers.

Duncan

As I remember, there is a documentation about Excel documents in xlrd
package. And with that, you dont need to use Excel via COM to find data in
the document.
http://www.lexicon.net/sjmachin/xlrd.htm

May also look at pyExcelerator
http://sourceforge.net/projects/pyexcelerator/
 
J

John Machin

Hello,
I am currently implementing (mainly in Python) 'models' that come
to me as Excel spreadsheets, with little additional information. I am
expected to use these models in a web application. Some contain many
worksheets and various macros.

What I'd like to do is extract the data and business logic so that I can
figure out exactly what these models actually do and code it up. An
obvious (I think) idea is to generate an acyclic graph of the cell
dependencies so that I can identify which cells contain only data (no
parents) and those that depend on other cells. If I could also extract
the relationships (functions), then I could feasibly produce something
in pure Python that would mirror the functionality of the original
spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
statistical functions).

There are two things you have to consider that can create those
dependencies (1) formulas (2) macros. If it were only formulas, you
wouldn't need to generate an acyclic graph. A no-parents cell will be
defined by a type-specific record (e.g. NUMBER). Dependent cells are
defined by a FORMULA etc record. Similar story with the XML format in
Office 2007.

I'm not aware of any way to use Python to access the content of
macros. It may be possible using COM ...

I'm a bit puzzled by your notion of creating a dependency graph
*without* first extracting the "relationships (functions)" [which you
could do only by parsing the formulas and macros].

"pure Python that would mirror the functionality of the original
spreadsheet": mammoth job; speed?

Before you do that, have a look at (1) xlrd docs (2) xlrd source (3)
Openoffice.org docs of XLS format (4) MS docs (now an ECMA standard)
of Office 2007 XML-based files.
The final application will be running on a Linux server, but I can use a
Windows box (i.e. win32all) for processing the spreadsheets (hopefully
not manually). Any advice on the feasibility of this, and how I might
achieve it would be appreciated.

I assume there are plenty of people who have a better knowledge of e.g.
COM than I do. I suppose an alternative would be to convert to Open
Office and use PyUNO, but I have no experience with PyUNO and am not
sure how much more reliable the statistical functions of Open Office
are. At the end of the day, the business logic will not generally be
complex, it's extracting it from the spreadsheet that's awkward. Any
advice appreciated. TIA. Cheers.

Duncan

PyUNO: Google around a bit. My experience was: tried it, got lost in
the jungle of its documentation.

HTH,
John
 
D

Duncan Smith

John said:
There are two things you have to consider that can create those
dependencies (1) formulas (2) macros. If it were only formulas, you
wouldn't need to generate an acyclic graph. A no-parents cell will be
defined by a type-specific record (e.g. NUMBER). Dependent cells are
defined by a FORMULA etc record. Similar story with the XML format in
Office 2007.

Well yes, partitioning the cells according to format would be start.
Now you have me wondering about the new XML format. Maybe that would be
easier to deal with?
I'm not aware of any way to use Python to access the content of
macros. It may be possible using COM ...

I suppose I can do that manually. Most of them seem to do with
presentation rather than the business logic.
I'm a bit puzzled by your notion of creating a dependency graph
*without* first extracting the "relationships (functions)" [which you
could do only by parsing the formulas and macros].

Incomplete parsing :). I thought I might be able to pull out the
dependencies relatively easily, then go back to the spreadsheet to
examine the formulae.
"pure Python that would mirror the functionality of the original
spreadsheet": mammoth job; speed?

Basically to try and separate the data, business logic and presentation
for coding up in Python. Particularly, I don't want to rely on Excel's
statistical functions. But it's also possible that other similar models
will be making there way to me, and I'd rather have a few simple classes
to handle the business logic, and have model parameters etc. tucked away
in text files.
Before you do that, have a look at (1) xlrd docs (2) xlrd source (3)
Openoffice.org docs of XLS format (4) MS docs (now an ECMA standard)
of Office 2007 XML-based files.

Yes, will do.
PyUNO: Google around a bit. My experience was: tried it, got lost in
the jungle of its documentation.

Yep. Me too. I might have another go at it. Cheers (to all who replied).

Duncan
 
P

Paddy

Hello,
I am currently implementing (mainly in Python) 'models' that come
to me as Excel spreadsheets, with little additional information. I am
expected to use these models in a web application. Some contain many
worksheets and various macros.

What I'd like to do is extract the data and business logic so that I can
figure out exactly what these models actually do and code it up. An
obvious (I think) idea is to generate an acyclic graph of the cell
dependencies so that I can identify which cells contain only data (no
parents) and those that depend on other cells. If I could also extract
the relationships (functions), then I could feasibly produce something
in pure Python that would mirror the functionality of the original
spreadsheet (using e.g. Matplotlib for plots and more reliable RNGs /
statistical functions).

The final application will be running on a Linux server, but I can use a
Windows box (i.e. win32all) for processing the spreadsheets (hopefully
not manually). Any advice on the feasibility of this, and how I might
achieve it would be appreciated.

I assume there are plenty of people who have a better knowledge of e.g.
COM than I do. I suppose an alternative would be to convert to Open
Office and use PyUNO, but I have no experience with PyUNO and am not
sure how much more reliable the statistical functions of Open Office
are. At the end of the day, the business logic will not generally be
complex, it's extracting it from the spreadsheet that's awkward. Any
advice appreciated. TIA. Cheers.

Duncan

Hi Duncan,
OOffice can save sheets in Sylk format which gives you a simple
textual format for cells, including the equations. Can't think
of any easier way with the macros other than hard slog!

P.S. It is well to remember that the UK Tax department have a
very low opinion of the quality of spreadsheets so if you find
oddities remember to query them.

- Paddy.
 
B

Bart Willems

John said:
I'm a bit puzzled by your notion of creating a dependency graph
*without* first extracting the "relationships (functions)" [which you
could do only by parsing the formulas and macros].

Not really. The range object in the Excel object model has a Dependents
attribute, although for creating a graph you might want to rely on
DirectDependents for obvious reasons. There's no reason to parse
You can use that for each cell to get a list of the cells that have
their formulas depending on the cell you are interrogating. There's also
a Precedents and DirectPrecedents property if you want to walk the trees
in the other direction. :)

More important, get some *serious* books about Excel VBA programming (I
can recommend Walkenbach's Power Programming). Excel has a an extremely
complex object model that can do almost anything for you - writing code
against Excel without knowledge of that data will mean that you'll be
writing 100 lines of Python code (for instance, to parse formulas) to do
something that could have been done in 5 lines of Visual Basic code (or
Python for that matter, but I can make a better point by using VBA :) ).

Best Regards,
Bart
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top