Regex and chemistry

J

Jonas Nilsson

I'd like to parse equilibria and linear dependencies of concentrations as
follows:

_Equilibria_
Input:
Three parts:
1. Reactants
2. Products
3. Equilibrium constant.

1,2: Reactants and products are separated by /\s+[^\s]*>[^\s]*\s+/.
3: Equilibrium constant is in parenthesis with optional value.

Individual items in reactants and products are separated by /\s+\+\s+/.
Each item have a coefficient (decimal, integer or fraction (1/2)) and an
identifier.

Output:
Equilibrium constant with optional value, reactants with coefficients*-1 and
products with coefficients like this:

2A + 3B <=> C (K=1.02e-2)
parses to: ['K=1.02e-2', [-2, 'A'], [-3, 'B'], [1, 'C']]

(Kox) H2 + 1/2O2 => H2O
parses to: ['Kox', [-1, 'H2'], [-0.5,'O2'], [1, 'H2O]]

EP+ <-> E + P+ (Kdiss=1.06e-7)
parses to: ['Kdiss=1.06e-7', [-1,'EP+'], [1, 'E'], [1, 'P+']]

_Linear_dependencies_
Input:
Two or three parts divided by /=/.
If three parts: one of the parts is a number (any notation).
Of the two additional parts one is identifier and the other is a linear
combination (integer or decimal notation) of concentrations.
Concentration is noted as [_ident_].

Output:
Identifier with optional value, and concentrations with coefficients.

CAtot = 2*[ CA2 ] + [CAKE]=1e-6
Parses to: ['CAtot=1e-6', ['2', 'CA2'], ['1', 'CAKE']]

charge=0=[Na+] - 2 * [SO4 2-]
Parses to: ['charge=0', ['1', 'Na+'], ['-2', 'SO4 2-']]

[A] + + 0.5*[C] = tot
Parses to: ['tot', ['1', 'A'], ['1', 'B'], ['0.5', 'C']]

I give some code down here that croaks on some errors in input and parses
the strings. Would you please be kind to comment on is and propose some
improvements. The croaks should give som meaningful hints to the user but
that is left out for now...

_CODE_

use strict;
use Carp;

my @test=(
"2A + 3B <=> C (K=1.02e-2)",
" (Kox) H2 + 1/2O2 => H2O",
"EP+ <-> E + P+ (Kdiss=1.06e-7)"
);
for (@test) {
my $equi=ParseEqui($_);
print "$_\nparses to: ";
Dumpit($equi);
}
print "-" x 80,"\n";

my @test2=(
"CAtot = 2*[CA2] + [CAKE]=1e-6",
" charge=0=[Na+] - 2 * [SO4 2-]",
"[A] + + 0.5*[C] = tot "
);
for (@test2) {
my $tot=ParseTot($_);
print "$_\nParses to: [";
Dumpit($tot);
}


sub ParseEqui {
$_=shift;
croak unless (my @bits=split /\s+[^\s]*>[^\s]*\s+/)==2;
$bits[0]=~s/(^|\s)\(([^\)]+)\)(\s|$)// ||
$bits[1]=~s/(^|\s)\(([^\)]+)\)(\s|$)// or croak;
my $equi=[$2];
for my $lr (0,1) { #left or right?
for (split /\s+\+\s+/,$bits[$lr]) {
m/^\s*([\d\.]*)(\/([\d\.]*)|)\s*(.+?)\s*$/ or croak;
my $coeff=$1?$2?$1/$3:$1:1;
push @{$equi},[$lr?$coeff:-$coeff,$4];
}
}
return $equi;
}

sub ParseTot {
$_=shift;
croak unless int 0.5*(my @bits=split /\s*=\s*/)==1;
my $num;
if (@bits==3) {
my $i=0;
while ($bits[$i]!~/^\s*([\d\.]+([eE](\+|-|)\d+)?)\s*$/) {
$i++;
croak if $i==3;
}
$num=splice @bits,$i,1;
}
@bits=reverse @bits if $bits[0]=~/\[/ && $bits[1]!~/\[/;
$bits[0]=~s/^\s+|\s+$//g;
my $tot=[defined $num?"$bits[0]=$num":$bits[0]];
$bits[1]="+ $bits[1]";
push @{$tot},[0+($2?$1.$3:$1.1),$4] while
($bits[1]=~s/\s*([+-])\s*(([\d\.]+)\s*\*\s*)?\[\s*([^\]]+?)\s*\]\s*//);
croak if $bits[1]=~/[^\s]/;
return $tot;
}

sub Dumpit {
my $in=shift;
print "[";
for (@{$in}) {
if (ref($_)) {
print "['",join("',\t'",@{$_}),"'],\t";
} else {
print "'$_',\t";
}
}
print "]\n\n";
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top