parsing an Excel formula with the re module

V

vsoler

Hello,

I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.

Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?

I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?

For example "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]

Can anybody help? Any suggestions?

Vicente Soler
 
M

MRAB

vsoler said:
Hello,

I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.

Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?

I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?

For example "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]

Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5"]?
 
V

vsoler

vsoler said:
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?

Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5"]?

I'd like to know how to do it, should it be possible.

Vicente
 
J

John Posner

Hello,

I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.

Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?

I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?

For example "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]

Can anybody help? Any suggestions?

It seems like you want to recreate data structures that Excel, itself,
must maintain in order to recalculate cells in the correct order. As long
as you're using COM, you might be able to tap into those data structures.
My 15-year-old (!) "Using Excel Visual Basic for Applications" book wasn't
any help. :-( After a short Google session, I came up with one possible
lead: http://www.decisionmodels.com/

Good luck!
John
 
M

Mensanator

vsoler said:
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?

Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]

Ok, although "Book1" would be the default name of a workbook, with
default
worksheets labeled "Sheet1". "Sheet2", etc.

If I had a worksheet named "Sheety" that wanted to reference a cell on
"Sheetx"
OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
a completely
different workbook (say Book1 with worksheets labeled "Sheet1",
"Sheet2") then
the cell might have =[Book1]Sheet1!A7.

And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.
 
M

MRAB

vsoler said:
vsoler said:
Hello,
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5"]?

I'd like to know how to do it, should it be possible.
Something like this should work:

references = re.findall(r"\b((?:\w+!)?[A-Za-z]+\d+)\b", formula)
 
V

vsoler

vsoler said:
Hello,
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]

Ok, although "Book1" would be the default name of a workbook, with
default
worksheets labeled "Sheet1". "Sheet2", etc.

If I had a worksheet named "Sheety" that wanted to reference a cell on
"Sheetx"
OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
a completely
different workbook (say Book1 with worksheets labeled "Sheet1",
"Sheet2") then
the cell might have =[Book1]Sheet1!A7.

And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.

Yes, Mensanator, but... what re should I use? I'm looking for the re
statement. No doubt you can help!

Thank you.
 
T

Tim Chase

vsoler said:
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?

I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?

Where things start getting ugly is when you have nested function
calls, such as

=if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
(Min(C1:C25)+3)*18,Max(B1:B25)))

Regular expressions don't do well with nested parens (especially
arbitrarily-nesting-depth such as are possible), so I'd suggest
going for a full-blown parsing solution like pyparsing.

If you have fair control over what can be contained in the
formulas and you know they won't contain nested parens/functions,
you might be able to formulate some sort of "kinda, sorta, maybe
parses some forms of formulas" regexp.

-tkc
 
V

vsoler

vsoler wrote:
Hello,
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]
Ok, although "Book1" would be the default name of a workbook, with
default
worksheets labeled "Sheet1". "Sheet2", etc.
If I had a worksheet named "Sheety" that wanted to reference a cell on
"Sheetx"
OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
a completely
different workbook (say Book1 with worksheets labeled "Sheet1",
"Sheet2") then
the cell might have =[Book1]Sheet1!A7.
And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.

Yes, Mensanator, but...  what re should I use? I'm looking for the re
statement. No doubt you can help!

Thank you.


Let me give you an example:
import re
re.split("([^0-9])", "123+456*/")
[’123’, ’+’, ’456’, ’*’, ’’, ’/’, ’’]

I find it excellent that one single statement is able to do a lexical
analysis of an expression!

If the expression contains variables, such as A12 or B9, I can try
another re expression. Which one should I use?

And if my expression contains parenthesis? And the sin() function?

Vicente Soler
 
M

MRAB

Mensanator said:
vsoler said:
Hello,
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]

Ok, although "Book1" would be the default name of a workbook, with
default
worksheets labeled "Sheet1". "Sheet2", etc.

If I had a worksheet named "Sheety" that wanted to reference a cell on
"Sheetx"
OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
a completely
different workbook (say Book1 with worksheets labeled "Sheet1",
"Sheet2") then
the cell might have =[Book1]Sheet1!A7.

And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.

I forgot about the dollars! In that case, the regex is:

references = re.findall(r"\b((?:\w+!)?\$?[A-Za-z]+\$?\d+)\b", formula)
 
S

Steve Holden

Tim said:
Where things start getting ugly is when you have nested function calls,
such as

=if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
(Min(C1:C25)+3)*18,Max(B1:B25)))

Regular expressions don't do well with nested parens (especially
arbitrarily-nesting-depth such as are possible), so I'd suggest going
for a full-blown parsing solution like pyparsing.

If you have fair control over what can be contained in the formulas and
you know they won't contain nested parens/functions, you might be able
to formulate some sort of "kinda, sorta, maybe parses some forms of
formulas" regexp.
And don't forget about named ranges, which can reference cells without
using anything but a plain identifier ...

regards
Steve
 
J

John Machin

vsoler wrote:
Hello,
I am acessing an Excel file by means of Win 32 COM technology.
For a given cell, I am able to read its formula. I want to make a map
of how cells reference one another, how different sheets reference one
another, how workbooks reference one another, etc.
Hence, I need to parse Excel formulas. Can I do it by means only of re
(regular expressions)?
I know that for simple formulas such as "=3*A7+5" it is indeed
possible. What about complex for formulas that include functions,
sheet names and possibly other *.xls files?
For example    "=Book1!A5+8" should be parsed into ["=","Book1", "!",
"A5","+","8"]
Can anybody help? Any suggestions?
Do you mean "how" or do you really mean "whether", ie, get a list of the
other cells that are referred to by a certain cell, for example,
"=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5]
Ok, although "Book1" would be the default name of a workbook, with
default
worksheets labeled "Sheet1". "Sheet2", etc.
If I had a worksheet named "Sheety" that wanted to reference a cell on
"Sheetx"
OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to
a completely
different workbook (say Book1 with worksheets labeled "Sheet1",
"Sheet2") then
the cell might have =[Book1]Sheet1!A7.
And don't forget the $'s! You may see =[Book1]Sheet1!$A$7.
Yes, Mensanator, but...  what re should I use? I'm looking for the re
statement. No doubt you can help!
Thank you.

Let me give you an example:
import re
re.split("([^0-9])", "123+456*/")

[’123’, ’+’, ’456’, ’*’, ’’, ’/’, ’’]

I find it excellent that one single statement is able to do a lexical
analysis of an expression!

That is NOT lexical analysis.
If the expression contains variables, such as A12 or B9, I can try
another re expression. Which one should I use?

And if my expression contains parenthesis?   And the sin() function?

You need a proper lexical analysis, followed by a parser. What you
are trying to do can NOT be accomplished in any generality with a
single regex. The Excel formula syntax has several tricky bits. E.g.
IIRC whether TAX09 is a (macro) name or a cell reference depends on
what version of Excel you are targetting but if it appears like TAX09!
A1:B2 then it's a sheet name.

The xlwt package (of which I am the maintainer) has a lexer and parser
for a largish subset of the syntax ... see http://pypi.python.org/pypi/xlwt
 
J

John Machin

xlrd, no?

A facility in xlrd to decompile Excel formula bytecode into a text
formula is currently *under discussion*.

The OP was planning to dig the formula text out using COM then parse the
formula text looking for cell references and appeared to have a rather
simplistic view of the ease of parsing Excel formula text -- that's why
I pointed him at those facilities (existing, released, proven in the
field) in xlwt.
 
P

Paul McGuire

Where things start getting ugly is when you have nested function
calls, such as

   =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
(Min(C1:C25)+3)*18,Max(B1:B25)))

Regular expressions don't do well with nested parens (especially
arbitrarily-nesting-depth such as are possible), so I'd suggest
going for a full-blown parsing solution like pyparsing.

If you have fair control over what can be contained in the
formulas and you know they won't contain nested parens/functions,
you might be able to formulate some sort of "kinda, sorta, maybe
parses some forms of formulas" regexp.

-tkc

This might give the OP a running start:

from pyparsing import (CaselessKeyword, Suppress, Word, alphas,
alphanums, nums, Optional, Group, oneOf, Forward, Regex,
operatorPrecedence, opAssoc, dblQuotedString)

test1 = "=3*A7+5"
test2 = "=3*Sheet1!$A$7+5"
test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \
"if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))"

EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$')
sheetRef = Word(alphas, alphanums)
colRef = Optional(DOLLAR) + Word(alphas,max=2)
rowRef = Optional(DOLLAR) + Word(nums)
cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") +
rowRef("row"))

cellRange = (Group(cellRef("start") + COLON + cellRef("end"))
("range")
| cellRef )

expr = Forward()

COMPARISON_OP = oneOf("< = > >= <= != <>")
condExpr = expr + COMPARISON_OP + expr

ifFunc = (CaselessKeyword("if") +
LPAR +
Group(condExpr)("condition") +
COMMA + expr("if_true") +
COMMA + expr("if_false") + RPAR)
statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange +
RPAR
sumFunc = statFunc("sum")
minFunc = statFunc("min")
maxFunc = statFunc("max")
aveFunc = statFunc("ave")
funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc

multOp = oneOf("* /")
addOp = oneOf("+ -")
numericLiteral = Regex(r"\-?\d+(\.\d+)?")
operand = numericLiteral | funcCall | cellRange | cellRef
arithExpr = operatorPrecedence(operand,
[
(multOp, 2, opAssoc.LEFT),
(addOp, 2, opAssoc.LEFT),
])

textOperand = dblQuotedString | cellRef
textExpr = operatorPrecedence(textOperand,
[
('&', 2, opAssoc.LEFT),
])
expr << (arithExpr | textExpr)

import pprint
for test in (test1,test2, test3):
print test
pprint.pprint( (EQ + expr).parseString(test).asList() )
print


Prints:

=3*A7+5
[[['3', '*', ['A', '7']], '+', '5']]

=3*Sheet1!$A$7+5
[[['3', '*', ['Sheet1', 'A', '7']], '+', '5']]

=if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)
*18,Max(B1:B25)))
['if',
['sum', [['A', '1'], ['A', '25']], '>', '42'],
'min',
[['B', '1'], ['B', '25']],
'if',
['sum', [['C', '1'], ['C', '25']], '>', '3.14'],
[['min', [['C', '1'], ['C', '25']], '+', '3'], '*', '18'],
'max',
[['B', '1'], ['B', '25']]]


-- Paul
 
C

Chris Withers

John said:
The OP was planning to dig the formula text out using COM then parse the
formula text looking for cell references and appeared to have a rather
simplistic view of the ease of parsing Excel formula text -- that's why
I pointed him at those facilities (existing, released, proven in the
field) in xlwt.

Which bits of xlwt are you referring to? (at a guess, the stuff that
turns a piece of text into the correct formulae gubbinz when you write a
formula to a cell with xlwt?)

cheers,

Chris
 
J

John Machin

This might give the OP a running start:

Unfortunately "this" will blow up after only a few paces; see
below ...
from pyparsing import (CaselessKeyword, Suppress, Word, alphas,
    alphanums, nums, Optional, Group, oneOf, Forward, Regex,
    operatorPrecedence, opAssoc, dblQuotedString)

test1 = "=3*A7+5"
test2 = "=3*Sheet1!$A$7+5"

test2a ="=3*'Sheet 1'!$A$7+5"
test2b ="=3*'O''Reilly''s sheet'!$A$7+5"

test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \
     "if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))"

Many functions can take a variable number of args and they are not
restricted to cell references e.g.

test3a = "=sum(a1:a25,10,min(b1,c2,d3))"

The arg separator is comma or semicolon depending on the locale ... a
parser should accept either.

EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$')
sheetRef = Word(alphas, alphanums)
colRef = Optional(DOLLAR) + Word(alphas,max=2)
rowRef = Optional(DOLLAR) + Word(nums)
cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") +
                    rowRef("row"))

cellRange = (Group(cellRef("start") + COLON + cellRef("end"))
("range")
                | cellRef )

expr = Forward()

COMPARISON_OP = oneOf("< = > >= <= != <>")
condExpr = expr + COMPARISON_OP + expr

ifFunc = (CaselessKeyword("if") +
          LPAR +
          Group(condExpr)("condition") +

that should be any expression; at run-time it expects a boolean (TRUE
or FALSE) or a number (0 means false, non-0 means true). Text causes a
#VALUE! error. Trying to subdivide expressions into conditional /
numeric /text just won't work.

          COMMA + expr("if_true") +
          COMMA + expr("if_false") + RPAR)
statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange +
RPAR
sumFunc = statFunc("sum")
minFunc = statFunc("min")
maxFunc = statFunc("max")
aveFunc = statFunc("ave")
funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc

multOp = oneOf("* /")
addOp = oneOf("+ -")

needs power op "^"
numericLiteral = Regex(r"\-?\d+(\.\d+)?")

Sorry, that "-" in there is a unary minus operator. What about 1e23 ?
operand = numericLiteral | funcCall | cellRange | cellRef
arithExpr = operatorPrecedence(operand,
    [
    (multOp, 2, opAssoc.LEFT),
    (addOp, 2, opAssoc.LEFT),
    ])

textOperand = dblQuotedString | cellRef
textExpr = operatorPrecedence(textOperand,
    [
    ('&', 2, opAssoc.LEFT),
    ])

Excel evaluates excessively permissively, and the punters are
definitely not known for self-restraint. The above just won't work:
2.3 & 4.5 produces text "2.34.5", while "2.3" + "4.5" produces number
6.8.
 
J

John Machin

Did you build those parsing rules just by common sense, or following some  
actual specification?

Leave your common sense with the barkeep when you enter the Excel
saloon; it is likely to be a hindrance. The specification is what
Excel does.
 
P

Paul McGuire

I never represented that this parser would handle any and all Excel
formulas! But I should hope the basic structure of a pyparsing
solution might help the OP add some of the other features you cited,
if necessary. It's actually pretty common to take an incremental
approach in making such a parser, and so here are some of the changes
that you would need to make based on the deficiencies you pointed out:

functions can have a variable number of arguments, of any kind of
expression
- statFunc = lambda name : CaselessKeyword(name) + LPAR + delimitedList
(expr) + RPAR

sheet name could also be a quoted string
- sheetRef = Word(alphas, alphanums) | QuotedString("'",escQuote="''")

add boolean literal support
- boolLiteral = oneOf("TRUE FALSE")
- operand = numericLiteral | funcCall | boolLiteral | cellRange |
cellRef

These small changes are enough to extend the parser to successfully
handle the test2a, 2b, and 3a cases. (I'll add this to the pyparsing
wiki examples, as it looks like it is a good start on a familiar but
complex expression.)

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top