Speed issues iterating over chars

M

Martin Hansen

Hello all,


This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?


Cheers,

Martin



class Hash
BASE_SOLEXA = 64

# Soft masks sequence residues where the corresponding quality score
# is below a given cutoff.
def mask!(cutoff)
if self.has_key? :SEQ and self.has_key? :SCORES
seq = self[:SEQ]
scores = self[:SCORES]
i = 0

scores.each_char do |score|
seq = seq.downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end

self[:SEQ] = seq
end

self
end
end
 
R

Robert Klemme

2010/8/30 Martin Hansen said:
Hello all,


This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?


Cheers,

Martin



class Hash
=A0BASE_SOLEXA =3D 64

=A0# Soft masks sequence residues where the corresponding quality score
=A0# is below a given cutoff.
=A0def mask!(cutoff)
=A0 =A0if self.has_key? :SEQ and self.has_key? :SCORES
=A0 =A0 =A0seq =A0 =A0=3D self[:SEQ]
=A0 =A0 =A0scores =3D self[:SCORES]
=A0 =A0 =A0i =A0 =A0 =A0=3D 0

=A0 =A0 =A0scores.each_char do |score|
=A0 =A0 =A0 =A0seq =3D seq.downcase if score.ord - BASE_SOLEXA <=3D= cutoff
=A0 =A0 =A0 =A0i +=3D 1
=A0 =A0 =A0end

=A0 =A0 =A0self[:SEQ] =3D seq
=A0 =A0end

=A0 =A0self
=A0end
end


What about using String#gsub!?

# untested
seq.gsub! /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/ do |m|
m.downcase!
end

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS =3D Hash.new {|h,cutoff| h[cutoff] =3D
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end


This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\\u00%02x" % ch}

Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ' '. It doesn't
look like Ruby have any methods bitwise operations on strings.


Cheers,


Martin
 
R

Robert Klemme

2010/8/30 Martin Hansen said:
You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS =3D Hash.new {|h,cutoff| h[cutoff] =3D
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
=A0 m.downcase!
end


This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\\u00%02x" % ch}

Just try it out in IRB. This creates a regexp with unicode names.
You could however use Fixnum#chr instead, e.g.

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch| Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.
Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ' '. It doesn't
look like Ruby have any methods bitwise operations on strings.

AFAIK there are none. But you can do

irb(main):011:0> s =3D "aaa"
=3D> "aaa"
irb(main):012:0> s[0] =3D (s[0].ord | 0x0F).chr
=3D> "o"
irb(main):013:0> s
=3D> "oaa"

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr
} )

Which returns:

(?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an
make changes to the other (seq):

seq = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

?

AFAIK there are none.

That is actually strange. In Perl you can do this:

perl -le 'print "ABCDEFG" | " \0\0 "'
=> abCDefg

Which I believe is very efficient. And the mask with " " and "\0" can be
constructed with transliterate efficiently too.

Cheers,


Martin
 
R

Robert Klemme

2010/8/30 Martin Hansen said:
#{ (0..(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex =3D Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr
} )

Which returns:

(?-mix:;|<|=3D|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an
make changes to the other (seq):

seq =A0 =A0=3D 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
scores =3D '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

?

It doesn't. Somehow I must have missed an important detail... :)

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.
That is actually strange. In Perl you can do this:

perl -le 'print "ABCDEFG" | " =A0\0\0 =A0 =A0"'
=3D> abCDefg

Which I believe is very efficient. And the mask with " " and "\0" can be
constructed with transliterate efficiently too.

Kind regards

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

You could do this though
seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient ...

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:

seq = "ATCGAT000111"

And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?


Martin
 
R

Robert Klemme

2010/8/30 Martin Hansen said:
You could do this though

seq.gsub! /./ do |m|
=A0 scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. =A0You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient ...

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:

seq =3D "ATCGAT000111"

And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?

Here's what I'd probably do.

Create a custom class (and not use a Hash) for this, e.g.

Score =3D Struct.new :seq, :score

Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.

ScoreCache =3D Struct.new :score do
def mask(cutoff)
cache[cutoff]
end

def mask_sequence(cutoff, seq)
mask(cutoff).each_bit do |idx|
seq[idx] =3D seq[idx].downcase!
end

seq
end

private
def cache
@cache ||=3D Hash.new do |h,cutoff|
c =3D 0

self.score.each_with_index |ch,idx|
c |=3D (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
end

h[cutoff] =3D c
end
end
end

# Store score string -> ScoreCache
global_score_cache =3D Hash.new do |h,score|
h[score] =3D ScoreCache.new score
end

class Integer
def each_bit
raise "Currently only positive implemented" if self < 0

if block_given?
idx =3D 0
x =3D self

while x !=3D 0
yield idx if x[0] =3D=3D 1
idx +=3D 1
x >>=3D 1
end

self
else
Enumerator.new self, :each_bit
end
end
end


And then use it and profile.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
M

Martin Hansen

Thanks Robert,

The gsub solution seems to be reasonably efficient.
seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

But my original proposed naive loop is twice as fast:
scores.each_char do |score|
seq = seq.downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end


I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.

I suspect that you need some C extension to go faster than this, but I
don't really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

It is a shame that Ruby does not have build in bitwise operators for the
String class allowing C speed masking. I have tried to get on the core
mailing list to post this as a feature suggestion, but the ML form is
stuffed.


Cheers,


Martin
 
R

Robert Klemme

Thanks Robert,

The gsub solution seems to be reasonably efficient.
=A0seq.gsub! /./ do |m|
=A0 scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

But my original proposed naive loop is twice as fast:
scores.each_char do |score|
=A0 seq =3D seq.downcase if score.ord - BASE_SOLEXA <=3D cutoff
=A0 i +=3D 1
end


I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.


Regexp is precompiled but I suspect that tr works at runtime only.
For definitive answer you'll have to look at the source.
I suspect that you need some C extension to go faster than this, but I
don't really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

Well, but the caching should avoid that too many loops are executed.
I do not know however, how often you reuse values. If you need this
in several processes you could save the current state in a file via
Marshal which is quite fast.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top