Parsing a chemical formal

Discussion in 'Perl Misc' started by Luotao Fu, Feb 25, 2005.

  1. Luotao Fu

    Luotao Fu Guest

    Hi All,
    My first post on this Groups, so sorry for any possible stupidity :)
    I'm wrting since days a perl programm. The programm contains a small
    routine, wich shall parse a chemical formal and return the name and
    portion of single atoms
    in the material as a array(or a hash) Well my code looks like that:

    my @literals=split /([A-Z])/, $molecule;

    for (my $i=0; $i<=$#literals; $i++){
    my @atom;
    print "Literal: ", $literals[$i], "\n";
    push(@atom, $literals[$i]);
    if ($literals[$i+1] !~ /[A-Z]/){
    push(@atom,$literals[$i+1]);
    $i++;
    }
    push(@atoms,join("",@atom)};
    }

    The $molecule contains the formal (i.E. H2O, FeCl3 or CaCl), Every Beginning
    letter of a element ist written in upper case. As you can see, I split
    first the $molecule with Letters in upper case, which means FeCl3
    turns into {F,e,C,l3}, than I scan the splitted list, which is stored
    in the array @Literal, for capital
    letters, every capital letter will be pushed in a temporary Array. If
    the following item in array is not written in upper case, which means, that
    the Name of the atom contains more than one letter, it'll be also pushed in
    the same temporary Array, which will be later joined and puted in the
    output array. The final result of the Formal H20 should be {H2,O},
    FeCl3 {Fe,Cl3} and so on....

    This works so far, but I'm far not satified with this solution. There
    must be better ways to solve it. which more intelligent RegExp and so
    on. But I'm not quite familiar to RegExps in Perl, so that I can't think
    out any better solution.

    Anyone Idea, how I can write this routine more elegantly?

    Thanx A lot
    Cheers
    Luotao Fu
     
    Luotao Fu, Feb 25, 2005
    #1
    1. Advertising

  2. Luotao Fu wrote:
    > Hi All,
    > My first post on this Groups, so sorry for any possible stupidity :)
    > I'm wrting since days a perl programm. The programm contains a small
    > routine, wich shall parse a chemical formal and return the name and
    > portion of single atoms

    <snip>

    >
    > This works so far, but I'm far not satified with this solution. There
    > must be better ways to solve it. which more intelligent RegExp and so
    > on. But I'm not quite familiar to RegExps in Perl, so that I can't think
    > out any better solution.


    You need to check out

    man perlre

    But you may want to check out Chemistry::FormulaPattern
    (search.cpan.org) if this is a real-life problem rather than just a
    programming exercise.

    Try this (is pretty rough and am sure I have missed edge cases) to get
    you started:

    bob 881 $ cat testformula.pl
    #!/usr/local/bin/perl

    use strict;
    use warnings;

    use Data::Dumper;

    my $formula = shift;

    my @elements = ();

    while($formula =~ s/([A-Z][a-z]?[0-9]*)//){
    push @elements, $1;
    };
    print Dumper \@elements;

    bob 882 $ ./testformula.pl H2SO4
    $VAR1 = [
    'H2',
    'S',
    'O4'
    ];
     
    Mark Clements, Feb 25, 2005
    #2
    1. Advertising

  3. Luotao Fu

    GreenLeaf Guest

    Abigail wrote:

    > I wouldn't use split, just parse what you want to keep. What you want is
    > very simple: exactly one capital letter, followed by zero or more lower
    > case letters, followed by zero or more numbers. Written as a regex, this
    > is:


    to OP:

    If this is an exercise, considering the real world scenario, you might
    want to consider the rule that an element name is always exactly one
    capital letter followed by _exactly zero or one simple letter_, with the
    exception of elements that start with Uu. I'm assuming here that yours
    is a program for learning, since you admitted to write it 'since days'
    :). Considering these facts will make your re more robust.

    You might also want to consider the radicals (such as hydroxyl -OH)
    because they are sure to lead to incorrect results if you just ignore
    parenthesis: for instance Fe(OH)3. You can do this by first capturing
    parenthesis and numbers that follow, then running the same simple rules
    that you used to capture no-parenthesis case for the token within each
    set of parenthesis. Something along the line of

    my @atoms = /((?:\(.+\)|Uu.|[A-Z][a-z]?)\d*)/g;

    would work here.

    Since Abigail's post clearly gave you almost everything you need to
    know, it would be quite straightforward to implement these simple
    changes. Good luck! :)

    Hope this helps,
    sat
     
    GreenLeaf, Feb 25, 2005
    #3
  4. Luotao Fu

    Luotao Fu Guest

    Hi,
    GreenLeaf <> schrieb:
    > Abigail wrote:
    >
    >> I wouldn't use split, just parse what you want to keep. What you want is
    >> very simple: exactly one capital letter, followed by zero or more lower
    >> case letters, followed by zero or more numbers. Written as a regex, this
    >> is:

    >


    @Abigail:
    fancy idea! Now the famous Question to myself: If this is simple, why
    haven't I gotten it myself? ;-) works like a charm, thanx a lot.

    > to OP:
    >
    > If this is an exercise, considering the real world scenario, you might
    > want to consider the rule that an element name is always exactly one
    > capital letter followed by _exactly zero or one simple letter_, with the
    > exception of elements that start with Uu. I'm assuming here that yours
    > is a program for learning, since you admitted to write it 'since days'
    >:). Considering these facts will make your re more robust.


    ;-) Actually it's not an exercise, the perlscript should format Database
    Files for my C Programm, which handles with CT Scanners. On the other side,
    I'm indeed learning Perl though writing this. I'd also had written it in C,
    but I chose perl to refresh my Memory on RegExp.

    >
    > You might also want to consider the radicals (such as hydroxyl -OH)
    > because they are sure to lead to incorrect results if you just ignore
    > parenthesis: for instance Fe(OH)3. You can do this by first capturing
    > parenthesis and numbers that follow, then running the same simple rules
    > that you used to capture no-parenthesis case for the token within each
    > set of parenthesis. Something along the line of
    >
    > my @atoms = /((?:\(.+\)|Uu.|[A-Z][a-z]?)\d*)/g;
    >
    > would work here.
    >


    Thanx for the advise, I didn't think about this one. However it might
    not be a serious problem for me. We have limited the Input on only Stuffs
    containing the first 100 Elements on the periodic Table. Which is more
    important, I define the formatrules of the Inputfiles. I'll notice
    in the Readme, that such formats are forbidden :).

    > Since Abigail's post clearly gave you almost everything you need to
    > know, it would be quite straightforward to implement these simple
    > changes. Good luck! :)
    >
    > Hope this helps,


    Thanx a lot
    > sat


    Cheers
    Luotao Fu
     
    Luotao Fu, Feb 25, 2005
    #4
  5. Luotao Fu

    Ted Zlatanov Guest

    On 25 Feb 2005, -hannover.de wrote:

    > I'm wrting since days a perl programm. The programm contains a small
    > routine, wich shall parse a chemical formal and return the name and
    > portion of single atoms
    > in the material as a array(or a hash)

    ....
    > The $molecule contains the formal (i.E. H2O, FeCl3 or CaCl), Every
    > Beginning letter of a element ist written in upper case. As you can
    > see, I split first the $molecule with Letters in upper case, which
    > means FeCl3 turns into {F,e,C,l3}, than I scan the splitted list,
    > which is stored in the array @Literal, for capital letters, every
    > capital letter will be pushed in a temporary Array. If the following
    > item in array is not written in upper case, which means, that the
    > Name of the atom contains more than one letter, it'll be also pushed
    > in the same temporary Array, which will be later joined and puted in
    > the output array. The final result of the Formal H20 should be
    > {H2,O}, FeCl3 {Fe,Cl3} and so on....


    I think you are not doing this correctly.

    You are not parsing random letters, you are parsing chemical
    elements' names in sequence. So don't just say "split on a letter."
    Build a dictionary of element names (it's a finite list, although you
    can anticipate new elements may need to be added at the end).
    Something like this:

    my %elements = { H => { number => 1, extra => data => you => need },
    He => { number => 2, ...},
    ...
    };

    Then, build your regular expression to match elementa from your
    %element hash.

    This may be a LOT easier with the Parse::RecDescent module, which I
    think is the right tool for this task. It can parse formulas like the
    ones you describe, as long as you write a suitable grammar (you can
    write a rule that will match elements from the %elements hash). It
    will generate a suitable parse tree for you, which will be a lot more
    functional that your {H2,O} format. For more information and help,
    mail the recdescent list at after you've read the
    Parse::RecDescent documentation :)

    Ted
     
    Ted Zlatanov, Feb 25, 2005
    #5
  6. Luotao Fu

    John Bokma Guest

    John Bokma, Feb 25, 2005
    #6
  7. Luotao Fu

    John Bokma Guest

    Ted Zlatanov wrote:

    > On 25 Feb 2005, -hannover.de wrote:
    >
    >> I'm wrting since days a perl programm. The programm contains a small
    >> routine, wich shall parse a chemical formal and return the name and
    >> portion of single atoms
    >> in the material as a array(or a hash)

    > ...
    >> The $molecule contains the formal (i.E. H2O, FeCl3 or CaCl), Every
    >> Beginning letter of a element ist written in upper case. As you can
    >> see, I split first the $molecule with Letters in upper case, which
    >> means FeCl3 turns into {F,e,C,l3}, than I scan the splitted list,
    >> which is stored in the array @Literal, for capital letters, every
    >> capital letter will be pushed in a temporary Array. If the following
    >> item in array is not written in upper case, which means, that the
    >> Name of the atom contains more than one letter, it'll be also pushed
    >> in the same temporary Array, which will be later joined and puted in
    >> the output array. The final result of the Formal H20 should be
    >> {H2,O}, FeCl3 {Fe,Cl3} and so on....

    >
    > I think you are not doing this correctly.
    >
    > You are not parsing random letters, you are parsing chemical
    > elements' names in sequence. So don't just say "split on a letter."
    > Build a dictionary of element names (it's a finite list, although you
    > can anticipate new elements may need to be added at the end).
    > Something like this:
    >
    > my %elements = { H => { number => 1, extra => data => you => need },
    > He => { number => 2, ...},
    > ...
    > };
    >
    > Then, build your regular expression to match elementa from your
    > %element hash.


    If you can assume that only valid formulaes are given to the program,
    [A-Z][a-z]?\d* sounds sufficient to me.

    If you really want to check validity you can capture the [A-Z][a-z]?
    part and look it up in a hash. Moreover, if some letters are not
    possible (for example x), you could remove them from the character class
    (and making the program harder to read, I guess).

    > will generate a suitable parse tree for you, which will be a lot more
    > functional that your {H2,O} format.


    One lesson I learned the hard way: never make your program more
    funcional than the requirements. I.e. if you need cat, don't write
    OpenOffice :-D.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Feb 25, 2005
    #7
  8. Luotao Fu

    GreenLeaf Guest

    Luotao Fu wrote:
    > GreenLeaf <> schrieb:
    >>You might also want to consider the radicals (such as hydroxyl -OH)
    >>because they are sure to lead to incorrect results if you just ignore
    >>parenthesis: for instance Fe(OH)3.

    > Thanx for the advise, I didn't think about this one. However it might
    > not be a serious problem for me. We have limited the Input on only Stuffs
    > containing the first 100 Elements on the periodic Table. Which is more
    > important, I define the formatrules of the Inputfiles.


    Last time I checked, all Fe, O and H were below 100. :eek:)

    However, since this is a real program as you said, it _may be_ better to
    handle the parenthesis, because if you do not, somebody else will have
    to format Fe(OH)3 to FeO3H3 - or you will be limiting the usefulness of
    your program. Be nice and do them a favor, since it does not need _too
    much_ of additional work at your side. A couple more lines to make it
    able to handle stuff like Fe2(SO4)3 - as you see in the sub
    processToken() ;)

    I agree with John's idea though: _no need to bother_ if you will _never_
    get such formulae in the first place, and KISS.


    use strict;
    use warnings;

    while (<DATA>){
    my @atoms = /((?:\(.+\)|[A-Z][a-z]?)\d*)/g;
    my %total; # total count of each element
    foreach (@atoms) {
    my %stuff = processToken($_);
    while (my ($element, $count) = each %stuff){
    $total{$element} += $count;
    }
    }
    # here, you have all elements with their respective counts.
    while (my ($element, $count) = each %total){
    print "$element$count\n";
    }
    }

    sub processToken {
    my $token = shift;
    if ($token =~ /\(/){ # we have groups
    my ($elempart, $numpart) = $token =~ m/\((\w+)\)(\d*)/;
    my %grpcounts = processToken($elempart);
    $grpcounts{$_} *= ($numpart ? $numpart : 1)
    foreach (keys %grpcounts);
    return %grpcounts;
    } else {
    my @atoms = split /(?=[A-Z][a-z]?[0-9]*)/ => $token;
    my %atomcounts;
    foreach (@atoms){
    my ($element, $count) = /([A-Za-z]+)(\d*)/;
    $atomcounts{$element} += $count ? $count : 1;
    }
    return %atomcounts;
    }
    }

    __DATA__
    H2O
    FeCl3
    NaOH
    Fe(OH)3
    Fe2(SO4)3
     
    GreenLeaf, Feb 26, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. eadgbe
    Replies:
    3
    Views:
    575
    eadgbe
    Sep 11, 2003
  2. thewhizkid
    Replies:
    3
    Views:
    735
    Jerker Hammarberg \(DST\)
    Oct 7, 2003
  3. Replies:
    3
    Views:
    389
    Jim Langston
    Apr 7, 2007
  4. Replies:
    4
    Views:
    265
    Fred Kasner
    Apr 14, 2007
  5. Chris Carlen

    Command language parsing - how formal to get?

    Chris Carlen, Aug 10, 2007, in forum: C Programming
    Replies:
    14
    Views:
    645
Loading...

Share This Page