matching against a zillion patterns

Discussion in 'Ruby' started by George George, Oct 15, 2009.

  1. i have some script in which i would like to match a string against
    'many' regular expressions patterns.

    def group(string)
    if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
    group =1
    else
    ....
    end
    end

    My worry is the amount of patterns that i have (exceeding 400) and the
    efficiency and sanity of such an approach.What would you advice?


    Thank you.
    --
    Posted via http://www.ruby-forum.com/.
     
    George George, Oct 15, 2009
    #1
    1. Advertising

  2. George George

    steve Guest

    George George wrote:
    > i have some script in which i would like to match a string against
    > 'many' regular expressions patterns.
    >
    > def group(string)
    > if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
    > group =1
    > else
    > ....
    > end
    > end
    >
    > My worry is the amount of patterns that i have (exceeding 400) and the
    > efficiency and sanity of such an approach.What would you advice?
    >
    >
    > Thank you.


    Are the matches randomly distributed across the patterns? If not,
    perhaps you could arrange for the most common patterns in the first
    search, the next most common in a subsequent search and so on.

    Also recall that compiled regexps can be assigned to variables or
    constants, including arrays.

    Regards

    Steve.
     
    steve, Oct 15, 2009
    #2
    1. Advertising

  3. 2009/10/15 steve <>:
    > George George wrote:
    >>
    >> i have some script in which i would like to match a string against
    >> 'many' regular expressions patterns.
    >>
    >> def group(string)
    >> =A0if string=3D~ /pattern1 |patter2|pattern3|pattern(N)/ =A0#where N =3D=

    >100
    >> =A0 group =3D1
    >> else
    >> =A0....
    >> =A0end
    >> end
    >>
    >> My worry is the amount of patterns that i have (exceeding 400) and the
    >> efficiency and sanity of such an approach.What would you advice?
    >>
    >>
    >> Thank you.

    >
    > Are the matches randomly distributed across the patterns? =A0If not, perh=

    aps
    > you could arrange for the most common patterns in the first search, the n=

    ext
    > most common in a subsequent search and so on.
    >
    > Also recall that compiled regexps can be assigned to variables or constan=

    ts,
    > including arrays.


    Yep, this could be used to do something like

    PATTERNS =3D {
    1 =3D> [/p1|p2|p3/, /p4|p5|p6/],
    2 =3D> [/p7|p8/],
    }

    def group(s)
    PATTERNS.each do |gr, pats|
    return gr if pats.any? {|rx| rx =3D~ s} # short circuit!
    end
    nil # or exception
    end

    Obviously you want the most frequent patterns in the beginning of
    arrays and the least frequent at the end.

    I'm not sure though about the sanity aspect of this. George, can you
    disclose more details about the nature of your matching? Maybe
    there's a better and / or more efficient solution.

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Oct 15, 2009
    #3
  4. On Thu, Oct 15, 2009 at 8:24 AM, George George
    <> wrote:
    > i have some script in which i would like to match a string against
    > 'many' regular expressions patterns.
    >
    > def group(string)
    > =A0if string=3D~ /pattern1 |patter2|pattern3|pattern(N)/ =A0#where N =3D>=

    100
    > =A0 group =3D1
    > else
    > =A0....
    > =A0end
    > end
    >
    > My worry is the amount of patterns that i have (exceeding 400) and the
    > efficiency and sanity of such an approach.What would you advice?


    Your mega-pattern will be quite slow if many strings doesn't match any
    pattern, and even slower if many strings matches some patterns
    partially, since the regexp engine would end up backtracking a lot.

    Are you sure you can't construct a more general pattern and test for
    values of the match data after? I find it hard to imagine 400 useful
    patterns without any similar structure. For example,

    GROUPS =3D {
    "somegroup" =3D> 1,
    "othergroup" =3D> 2
    }
    if string =3D~ /^(\w+)\s+(\d+)$/
    group =3D GROUPS[$1.downcase]
    end

    Another strategy is divide and conquer. See if you can group your
    patterns into groups that are similar and construct a more general
    regexp which use can use as a initial filter to determine which of the
    actual patterns you need to test against. E.g.

    if string =3D~ /superpattern1/
    if string =3D~ /pattern1|pattern2|pattern3/
    group =3D 1
    end
    end
    if string =3D~ /superpattern2/
    if string =3D~ /pattern4|pattern5|pattern6/
    ...
     
    Lars Christensen, Oct 15, 2009
    #4
  5. George George wrote:
    > My worry is the amount of patterns that i have (exceeding 400) and the
    > efficiency and sanity of such an approach.What would you advice?


    I would advise you measure it, with your real regexp and your real data.

    I would expect that the regexp engine (written in C) would be much
    faster than something you could cobble together yourself in Ruby. Even
    400 times "match C1, skip match C1, match C2, skip, match C1, ..." in C
    will be very fast.

    Ideally, see if you can find a true regular expression engine that
    compiles to a DFA (deterministic finite state automaton). This will
    never backtrack, looking at each incoming character no more than once.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Oct 15, 2009
    #5
  6. On Oct 14, 2009, at 11:24 PM, George George wrote:

    > i have some script in which i would like to match a string against
    > 'many' regular expressions patterns.
    >
    > def group(string)
    > if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
    > group =1
    > else
    > ....
    > end
    > end
    >
    > My worry is the amount of patterns that i have (exceeding 400) and the
    > efficiency and sanity of such an approach.What would you advice?


    Regarding efficiency and sanity, I'd have to echo the previous
    responders that it depends a lot on what exactly you're trying to do.
    The one application that comes to mind where you'd have a large number
    of patterns to match is a parser. If you can redefine your problem so
    that it looks more like a parser, then that would open a whole world
    of potential solutions.

    Just for fun, taking the simple case where your patterns are actually
    simple strings, one strategy would be to construct a tree from the
    patterns. I'll "borrow" the dataset from stocknames.info:

    > cat ./check_guest_list

    #!/usr/bin/env ruby1.9

    require "hpricot"
    require "open-uri"

    stock_names = open("http://stocknames.info") { |page| Hpricot(page) }
    GUEST_LIST = stock_names.search("//li").map(&:innerHTML)

    tree = {}
    GUEST_LIST.each do |name|
    cur_leaf = tree
    name.each_char do |char|
    cur_leaf = cur_leaf[char] ||= {}
    end
    cur_leaf[:greeting] = "puts 'Hello, #{name}, welcome to the party!'"
    end
    print "Name please: "
    guest = $stdin.gets.chomp

    cur_leaf = tree
    guest.each_char do |char|
    cur_leaf = cur_leaf[char]
    break unless cur_leaf
    end

    if cur_leaf
    eval cur_leaf[:greeting]
    else
    puts "Sorry, you're not on the list."
    end

    > ./check_guest_list

    Name please: Josh Ballanco
    Sorry, you're not on the list.

    > ./check_guest_list

    Name please: Ed Vanders
    Hello, Ed Vanders, welcome to the party!


    Hope that helps some!

    Cheers,

    Josh
     
    Joshua Ballanco, Oct 15, 2009
    #6
  7. Take a look at the ragel language and parser generator.

    You can generate your parser as a ruby c plugin.
    Ragel compiles to an DFA, deterministic finite automata, it does not
    look on a character twice.

    I got great success on a similar problem...

    http://www.complang.org/ragel/



    On Thu, Oct 15, 2009 at 06:20:24PM +0900, Lars Christensen wrote:
    > On Thu, Oct 15, 2009 at 8:24 AM, George George
    > <> wrote:
    > > i have some script in which i would like to match a string against
    > > 'many' regular expressions patterns.
    > >
    > > def group(string)
    > >  if string=~ /pattern1 |patter2|pattern3|pattern(N)/  #where N =>100
    > >   group =1
    > > else
    > >  ....
    > >  end
    > > end
    > >
    > > My worry is the amount of patterns that i have (exceeding 400) and the
    > > efficiency and sanity of such an approach.What would you advice?

    >
    > Your mega-pattern will be quite slow if many strings doesn't match any
    > pattern, and even slower if many strings matches some patterns
    > partially, since the regexp engine would end up backtracking a lot.
    >
    > Are you sure you can't construct a more general pattern and test for
    > values of the match data after? I find it hard to imagine 400 useful
    > patterns without any similar structure. For example,
    >
    > GROUPS = {
    > "somegroup" => 1,
    > "othergroup" => 2
    > }
    > if string =~ /^(\w+)\s+(\d+)$/
    > group = GROUPS[$1.downcase]
    > end
    >
    > Another strategy is divide and conquer. See if you can group your
    > patterns into groups that are similar and construct a more general
    > regexp which use can use as a initial filter to determine which of the
    > actual patterns you need to test against. E.g.
    >
    > if string =~ /superpattern1/
    > if string =~ /pattern1|pattern2|pattern3/
    > group = 1
    > end
    > end
    > if string =~ /superpattern2/
    > if string =~ /pattern4|pattern5|pattern6/
     
    Markus Schirp, Oct 15, 2009
    #7
  8. Hi,

    Am Donnerstag, 15. Okt 2009, 15:24:35 +0900 schrieb George George:
    > i have some script in which i would like to match a string against
    > 'many' regular expressions patterns.
    >
    > def group(string)
    > if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
    > group =1
    > else
    > ....
    > end
    > end


    How about this:

    p = [ /a/, /b/, /c/]
    p.find { |q| "x" =~ q } #=> nil
    p.find { |q| "a" =~ q } #=> /a/
    p.find { |q| "c" =~ q } #=> /c/

    Bertram


    --
    Bertram Scharpf
    Stuttgart, Deutschland/Germany
    http://www.bertram-scharpf.de
     
    Bertram Scharpf, Oct 15, 2009
    #8

  9. > I'm not sure though about the sanity aspect of this. George, can you
    > disclose more details about the nature of your matching? Maybe
    > there's a better and / or more efficient solution.
    >
    > Kind regards
    >
    > robert


    Thank you so much for all the replies and suggestions!

    Robert, let me disclose some more information about the nature of the
    problem. Actually I have Protein sequences (strings of variable length
    composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be
    considered as a protein sequence. basically it's just a string.

    In my case each of the strings that i have may fall into 2 distinct
    classes or groups depending on whether they match a collection of about
    400 distict patterns in the first instance or another 200 or so patterns
    in the second instance. I do not have information on which is the most
    common pattern.

    To know which of the two groups that each of my sequences fall into, i
    resulted in to writing the code that i had presented ealier but realised
    that it may be inferior, inefficient or buggy and would like to improve
    it.

    I hope this helps to define the problem.

    In a nutshell; Given a set of strings A, each of which may belong to a
    group and where a group is characterised by a huge set of patterns(to be
    matched against), create a method that classifies each of the strings
    depending on which pattern the string contains.


    Thank you.


    --
    Posted via http://www.ruby-forum.com/.
     
    George George, Oct 15, 2009
    #9
  10. 2009/10/15 George George <>:
    >
    >> I'm not sure though about the sanity aspect of this. =A0George, can you
    >> disclose more details about the nature of your matching? =A0Maybe
    >> there's a better and / or more efficient solution.
    >>
    >> Kind regards
    >>
    >> robert

    >
    > Thank you so much for all the replies and suggestions!
    >
    > =A0Robert, let me disclose some more information about the nature of the
    > problem. Actually I have Protein =A0sequences (strings of variable length
    > composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be
    > considered as a protein sequence. basically it's just a string.
    >
    > In my case each of the strings that i have may fall into 2 distinct
    > classes or groups depending on whether they match a collection of =A0abou=

    t
    > 400 distict patterns in the first instance or another 200 or so patterns
    > in the second instance. I do not have information on which is the most
    > common pattern.
    >
    > =A0To know which of the two groups that each of my sequences fall into, i
    > resulted in to writing the code that i had presented ealier but realised
    > that it may be inferior, inefficient or buggy and would like to improve
    > it.
    >
    > I =A0hope this helps to define the problem.
    >
    > In a nutshell; Given a set of strings A, each of which may belong to =A0a
    > group and where a group is characterised by a huge set of patterns(to be
    > matched against), create a method that classifies each of the strings
    > depending on which pattern the string contains.


    Looks like this is tricky to become right. There seem to be some
    people around that do bioinformatics, maybe some of them do have a
    solution already.

    Things you could do off the top of my head: since it seems your
    patterns are only strings you could optimize the pattern by building a
    trie from it and then creating a RX from that. I've done this before
    (see below). Advantage is that your regular expression gets smaller
    and matching becomes more efficient (because of eliminated
    backtracking for NDA based RX engines).

    http://en.wikipedia.org/wiki/Trie

    There might be other options using one of the string matching
    algorithms around (boyer moore and the like).

    Kind regards

    robert


    def treeify(strings)
    root =3D TextTreeNode.new
    strings.each {|s| root.put s}
    root.to_s
    end

    TextTreeNode =3D Struct.new :children, :char, :term do
    def initialize(chr =3D nil)
    self.char =3D chr
    self.children =3D Hash.new {|h,k| h[k] =3D self.class.new(k)}
    end

    def put(s)
    node =3D self
    s.each_byte do |char|
    node =3D node.children[char]
    end
    node.term =3D true
    self
    end

    def to_s
    dump("")
    end

    def dump(out)
    out << Regexp.escape(char.chr) if char
    ch =3D children.values

    unless ch.empty?
    first =3D ch.shift
    bracket =3D term || !ch.empty?

    out << "(?:" if bracket
    first.dump(out)

    ch.each do |v|
    out << "|"
    v.dump(out)
    end

    out << ")" if bracket
    out << "?" if term
    end

    out
    end

    end

    puts treeify %w{foo fort bar baz battery}



    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Oct 15, 2009
    #10
  11. George George

    Mark Thomas Guest


    >  Robert, let me disclose some more information about the nature of the
    > problem. Actually I have Protein  sequences (strings of variable length
    > composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be
    > considered as a protein sequence. basically it's just a string.


    Have you looked at the BioRuby project? They may have classes that can
    help.
    http://bioruby.org/
     
    Mark Thomas, Oct 16, 2009
    #11
  12. Robert Klemme wrote:
    > 2009/10/15 George George <>:
    >>
    >>
    >> depending on which pattern the string contains.

    > Looks like this is tricky to become right. There seem to be some
    > people around that do bioinformatics, maybe some of them do have a
    > solution already.
    >
    > Things you could do off the top of my head: since it seems your
    > patterns are only strings you could optimize the pattern by building a
    > trie from it and then creating a RX from that. I've done this before
    > (see below). Advantage is that your regular expression gets smaller
    > and matching becomes more efficient (because of eliminated
    > backtracking for NDA based RX engines).


    Thank you so much Robert! I have tried your approach and compressed the
    patterns to a single regular expression. (removed backtracking). Seems
    like a parser solution approach may also help. e.g. the one that Josh
    had illustrated

    Thomas, I have looked at bioruby and I regulary use it and relatively
    know it quite well, but we dont have a solution that i am aware of. I
    might repost this on the bioruby list, together with a reference to this
    thread.
    Thank you so much.

    --
    Posted via http://www.ruby-forum.com/.
     
    George George, Oct 16, 2009
    #12

  13. > Robert, let me disclose some more information about the nature of the
    > problem. Actually I have Protein sequences (strings of variable length
    > composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be
    > considered as a protein sequence. basically it's just a string.


    I happen to be a molecular biologist myself, so this of course gets my
    attention ;)

    I assume there is a particular reason why you are looking for sequence
    features using regexp? Because it seems somewhat inefficient - which of
    course got you here in the first place. There is no way that you can use
    distance measures of sequences to cluster them, for example (like Blast
    + MCL)? If you are looking to group your seqs, that is. Also, databases
    like ProDom do basically just that - looking for particular sequence
    features in protein sequences. Sure, they focus on domains, but
    depending of the nature of your regexps, their tools may be applicable
    regardless. Then there is the new CS-BLAST, which uses scoring matrices
    - which may perhaps be derivable from your regexps, dunno. Well I
    suppose I could offer more help if the idea behind this
    fishing-experiment was a little clearer (i.e. what is it that you want
    to find out).

    In any case, I suppose depending on your search space, ruby may really
    not be the ideal approach. Hope it works out in the end!
    --
    Posted via http://www.ruby-forum.com/.
     
    Marc Hoeppner, Oct 16, 2009
    #13
  14. Marc Hoeppner wrote:

    > I assume there is a particular reason why you are looking for sequence
    > features using regexp? Because it seems somewhat inefficient - which of
    > course got you here in the first place. There is no way that you can use
    > distance measures of sequences to cluster them, for example (like Blast
    > + MCL)?


    First i cannot use any of the local alignment programs to do the
    clustering since my sequences are known to undergo lots of
    recombination, secondly i question the idea of BLAST BLOSUM matrices
    that have been derived from unbiased genomes. Am working with a pathogen
    that is famous for its high A/T content in its genome.

    Then why not use adjusted matrices? Yes we can, but then The high
    recombinations in the sequences renders that approach unusable. MCL is
    good for grouping proteins but also depends on pairwise local alignments
    and may be applicable for looking at unsupervised clustering or
    grouping. Secondly there is the issue of a relevant BLAST e-cut off as
    well as Inflation parameters to use with MCL.(Actually I have tried that
    approach already with diff' I values) and we are analysing the results.

    >Also, databases
    > like ProDom do basically just that - looking for particular sequence
    > features in protein sequences. Sure, they focus on domains, but
    > depending of the nature of your regexps, their tools may be applicable
    > regardless.


    Their tools may be applicable but we already know the motifs that we are
    searching for and they are not in ProDom or swissprot domains.

    >Then there is the new CS-BLAST, which uses scoring matrices
    > - which may perhaps be derivable from your regexps, dunno.


    Again BLAST and any alignment method for highly recombining sequences is
    not of much use. Instead we are using alignment free approaches to infer
    relationships in the sequences.

    The exact patterns that we have, are known(from our experimental
    evidence) to be associated with pathogenicity of a particular type and
    that is why am looking for exact matches. Else i could just have aligned
    the patterns and generated a HMM profile which i can again use to search
    in a generic way for the groups. Actually that is the next step. But
    first, I have opted to go for simple approaches first.

    >ruby may really not be the ideal approach

    True Ruby may not be suited for some jobs and applications but again, so
    is C++, java,php etc. The most important thing i think is the approach
    and not the choice of programming language. Secondly i don't like the
    idea of black boxes that i don't understand. :)

    Please feel free to contact me at georgkam address hosted at google's
    email domain. We don't want to hijack this excellent mailing list that
    is meant for Ruby with molecular biology discussions. I have re-posted
    this thread to bioruby and we can discuss it there.
    For now Robert's suggestion seems to have generated something that i
    wanted, a non redundant motif that is free from backtracking.

    Thank you.
    --
    Posted via http://www.ruby-forum.com/.
     
    George George, Oct 16, 2009
    #14
  15. 2009/10/16 George George <>:
    > Robert Klemme wrote:
    >> 2009/10/15 George George <>:
    >>>
    >>> depending on which pattern the string contains.

    >> Looks like this is tricky to become right. =A0There seem to be some
    >> people around that do bioinformatics, maybe some of them do have a
    >> solution already.
    >>
    >> Things you could do off the top of my head: since it seems your
    >> patterns are only strings you could optimize the pattern by building a
    >> trie from it and then creating a RX from that. =A0I've done this before
    >> (see below). =A0Advantage is that your regular expression gets smaller
    >> and matching becomes more efficient (because of eliminated
    >> backtracking for NDA based RX engines).

    >
    > Thank you so much Robert! I have tried your approach and compressed the
    > patterns to a single regular expression. (removed backtracking).


    And what did it do to performance? How many patterns do you need now
    or were you able to squeeze everything into one?

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Oct 16, 2009
    #15
  16. Jan Arts approach to the problem. forwarded from the bioruby list;
    Hey George,

    So if I understand correctly you've got a huge number of aminoacid
    sequences (how many?) and about 400 regular expressions. And for each of
    the aminoacid sequences: if they match just one of the regular
    expressions they are put in box A and if they match none of the regexps,
    they go into box B. Correct?

    It just happens that something very similar was the subject of Jim
    Tisdall's (from Beginning Perl for Bioinformatics fame) talk at the
    bioinformatics course we're teaching at the moment :)

    First thing: avoid loops. You don't want to take loop over all regexps
    for each AA sequences, or the other way around.

    Are all regexps of the same length? Would be nice if they are, but not
    critical. My approach would be to go over the data just once. So suppose
    the regexps all are of the same length.

    A. Prepare your data:
    a. "Decode" the regexps into literal strings: e.g. /A[BC]D/ become
    "ABD" and "ACD".
    b. Create a hash that contains all those things as keys.
    c. Concatenate all AA sequences together, joined with a non-AA, let's
    say a semicolon ";". E.g. CAARGNDLYSKNIG;GGARGNDLYSKNIG;KKARGNDLYSKNIG

    B. Do the actual search
    a. If the length of the strings to match (what used to be the regexps,
    and are now the keys in the hash) is 5: take the first 5 characters of
    your concatenated AA string and check if that substring exists as a key
    in the hash. If so: you know that the AA sequence between the
    surrounding ";" characters should go in box A.
    b. Advance 1 position: take AAs 2 to 6.
    c. Go back to a.

    You might have to tweak this approach to exactly fit your requirements,
    but if your code used to take a very long time, this might speed things
    up immensely.

    (George: can you forward this to the ruby mailing list it was discussed
    on initially? Cheers)

    Good luck,
    jan.
    --
    Posted via http://www.ruby-forum.com/.
     
    George George, Oct 16, 2009
    #16
  17. On 10/16/2009 08:23 AM, George George wrote:

    > Please feel free to contact me at georgkam address hosted at google's
    > email domain. We don't want to hijack this excellent mailing list that
    > is meant for Ruby with molecular biology discussions. I have re-posted
    > this thread to bioruby and we can discuss it there.


    I don't mind having the discussion on this list - the topic is really
    interesting and provides some insights into classes of problems we are
    rarely confronted with.

    > For now Robert's suggestion seems to have generated something that i
    > wanted, a non redundant motif that is free from backtracking.


    Sounds good! I'm glad the approach proves useful.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Oct 16, 2009
    #17
  18. * Lars Christensen <> (2009-10-15) schrieb:

    > Your mega-pattern will be quite slow if many strings doesn't match any
    > pattern, and even slower if many strings matches some patterns
    > partially, since the regexp engine would end up backtracking a lot.


    Why would it do try to backtrack here?

    mfg, simon .... l
     
    Simon Krahnke, Oct 18, 2009
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. thorsten
    Replies:
    1
    Views:
    483
  2. crichmon
    Replies:
    4
    Views:
    504
    Mabden
    Jul 7, 2004
  3. TimR
    Replies:
    2
    Views:
    446
    sticky
    Jan 10, 2007
  4. tomasz
    Replies:
    7
    Views:
    355
    Jonathan Gardner
    Dec 18, 2007
  5. Martin

    matching patterns after regex?

    Martin, Aug 12, 2009, in forum: Python
    Replies:
    8
    Views:
    390
    Martin
    Aug 13, 2009
Loading...

Share This Page