Speed issues iterating over chars

Discussion in 'Ruby' started by Martin Hansen, Aug 30, 2010.

  1. Hello all,


    This snip of code needs more speed. Ideas? What is the state of Ruby and
    Inline C?


    Cheers,

    Martin



    class Hash
    BASE_SOLEXA = 64

    # Soft masks sequence residues where the corresponding quality score
    # is below a given cutoff.
    def mask!(cutoff)
    if self.has_key? :SEQ and self.has_key? :SCORES
    seq = self[:SEQ]
    scores = self[:SCORES]
    i = 0

    scores.each_char do |score|
    seq = seq.downcase if score.ord - BASE_SOLEXA <= cutoff
    i += 1
    end

    self[:SEQ] = seq
    end

    self
    end
    end
     
    Martin Hansen, Aug 30, 2010
    #1
    1. Advertisements



  2. What about using String#gsub!?

    # untested
    seq.gsub! /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
    ch}.join("|")})+/ do |m|
    m.downcase!
    end

    You could speed up calculation of the regexp by placing this in the
    default block of a Hash, e.g.

    RXS =3D Hash.new {|h,cutoff| h[cutoff] =3D
    /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
    ch}.join("|")})+/

    and then

    seq.gsub! RXS[cutoff] do |m|
    m.downcase!
    end

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Aug 30, 2010
    #2
    1. Advertisements

  3. You could speed up calculation of the regexp by placing this in the

    This could be a good idea, but I am not entirely sure if it will work -
    at least the current suggestion gives an empty RXS hash. And I am unsure
    about what this do? map{|ch|"\\u00%02x" % ch}

    Thinking about this problem I wonder if it would be an idea to create a
    mask with transliterate and then use a bitwise operation to downcase.
    IIRC you change the case with a bitwise | operation with ' '. It doesn't
    look like Ruby have any methods bitwise operations on strings.


    Cheers,


    Martin
     
    Martin Hansen, Aug 30, 2010
    #3
  4. Just try it out in IRB. This creates a regexp with unicode names.
    You could however use Fixnum#chr instead, e.g.

    #{ (0..(BASE_SOLEXA + cutoff)).map {|ch| Regexp.escape(ch.chr)}.join("|") }

    This gives shorter and more readable strings.
    AFAIK there are none. But you can do

    irb(main):011:0> s =3D "aaa"
    =3D> "aaa"
    irb(main):012:0> s[0] =3D (s[0].ord | 0x0F).chr
    =3D> "o"
    irb(main):013:0> s
    =3D> "oaa"

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Aug 30, 2010
    #4
  5. OK, so for the right range that is equal to:

    regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr
    } )

    Which returns:

    (?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

    But I dont get how the substitution will scan one string (scores) an
    make changes to the other (seq):

    seq = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
    scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

    ?

    That is actually strange. In Perl you can do this:

    perl -le 'print "ABCDEFG" | " \0\0 "'
    => abCDefg

    Which I believe is very efficient. And the mask with " " and "\0" can be
    constructed with transliterate efficiently too.

    Cheers,


    Martin
     
    Martin Hansen, Aug 30, 2010
    #5
  6. It doesn't. Somehow I must have missed an important detail... :)

    You could do this though

    seq.gsub! /./ do |m|
    scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
    end

    Not too nice though. You could however do some preparation, e.g.
    store scores as an Array of Fixnum instead of using #ord.
    Kind regards

    robert


    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Aug 30, 2010
    #6
  7. You could do this though
    Yes, converting scores to arrays is bad since the scores are parsed from
    files as strings (millions of them). And I am unsure if substr
    substitutions are very efficient ...

    We need to trick this into the regex engine somehow.

    How about transforming scores to a mask string like this: 000111 where 1
    indicates that the corresponding sequence char should be lowercased
    (that can be done with tr). Then we plug this onto the sequence string:

    seq = "ATCGAT000111"

    And then we construct a regex with a forward looking identifier that
    reads the mask and manipulates the ATCG chars?


    Martin
     
    Martin Hansen, Aug 30, 2010
    #7
  8. Here's what I'd probably do.

    Create a custom class (and not use a Hash) for this, e.g.

    Score =3D Struct.new :seq, :score

    Create another structure for caching scores and a bit representation
    for downcase dependent on cutoff valiue, e.g.

    ScoreCache =3D Struct.new :score do
    def mask(cutoff)
    cache[cutoff]
    end

    def mask_sequence(cutoff, seq)
    mask(cutoff).each_bit do |idx|
    seq[idx] =3D seq[idx].downcase!
    end

    seq
    end

    private
    def cache
    @cache ||=3D Hash.new do |h,cutoff|
    c =3D 0

    self.score.each_with_index |ch,idx|
    c |=3D (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
    end

    h[cutoff] =3D c
    end
    end
    end

    # Store score string -> ScoreCache
    global_score_cache =3D Hash.new do |h,score|
    h[score] =3D ScoreCache.new score
    end

    class Integer
    def each_bit
    raise "Currently only positive implemented" if self < 0

    if block_given?
    idx =3D 0
    x =3D self

    while x !=3D 0
    yield idx if x[0] =3D=3D 1
    idx +=3D 1
    x >>=3D 1
    end

    self
    else
    Enumerator.new self, :each_bit
    end
    end
    end


    And then use it and profile.

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Sep 1, 2010
    #8
  9. Thanks Robert,

    The gsub solution seems to be reasonably efficient.
    But my original proposed naive loop is twice as fast:


    I dont really know how gsub and tr compares to the Perl equivalents
    speed wise - in Perl tr is precompiling a lookup table that is evil fast
    and the regex engine is also primed at compile time and runs extremely
    fast.

    I suspect that you need some C extension to go faster than this, but I
    don't really want to spend the time on that. I was exploring Inline C
    but that appears very fragile - I cannot even get the example from the
    cookbook up and running under Ruby 191/192.

    Looking at your last proposal I see three iterations where two are
    running narrow if loops. I have not testet it, but it looks suspicious.

    It is a shame that Ruby does not have build in bitwise operators for the
    String class allowing C speed masking. I have tried to get on the core
    mailing list to post this as a feature suggestion, but the ML form is
    stuffed.


    Cheers,


    Martin
     
    Martin Hansen, Sep 1, 2010
    #9


  10. Regexp is precompiled but I suspect that tr works at runtime only.
    For definitive answer you'll have to look at the source.
    Well, but the caching should avoid that too many loops are executed.
    I do not know however, how often you reuse values. If you need this
    in several processes you could save the current state in a file via
    Marshal which is quite fast.

    Kind regards

    robert

    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Sep 1, 2010
    #10
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.