string parsing / regexp question

Ryan Krauss · Nov 28, 2007

I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

Any suggestions?

I would prefer a regexp solution, but am open to other approaches.

Thanks,

Ryan

Paul McGuire · Nov 28, 2007

I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul

Paul McGuire · Nov 28, 2007

As Tim Grove points out, ...

s/Grove/Chase/

Sorry, Tim!

-- Paul

Tim Chase · Nov 28, 2007

Paul said:
s/Grove/Chase/

Sorry, Tim!

No problem...it's not like there aren't enough Tim's on the list
as it is.

-tkc

Ryan Krauss · Nov 28, 2007

Interesting. Thanks Paul and Tim. This looks very promising.

Ryan

I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

Click to expand...

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul

Ryan Krauss · Nov 28, 2007

I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

Click to expand...

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr. Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
}\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
}\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
\right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
PMATRIX + nestedBraces + \
"$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy. But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting. (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
PMATRIX + nestedBraces("rhs_args") + \
"$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }']

You can find more info on pyparsing at http://pyparsing.wikispaces.com.

Cheers!
-- Paul

I can't seem to access pyparsing on wikispaces. Is there something
wrong with the website right now?

Padding strings for a clean visual print out...	5	Dec 23, 2023
Can't solve problems! please Help	0	Sep 26, 2022
Range / empty list issues??	1	Dec 11, 2023
getting rid of EOL character ?	7	Apr 27, 2007
string parsing	8	Sep 2, 2010
Top 10 players minheap sort - need help	1	Oct 3, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023

string parsing / regexp question

Ryan Krauss

Paul McGuire

Paul McGuire

Tim Chase

Ryan Krauss

Ryan Krauss

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads