yet another text parser...

Discussion in 'Ruby' started by Marc Hoeppner, Jul 18, 2007.

  1. Hi,

    I have yet another question about how to write a specific text parser in
    ruby...
    So, without further ado - this is what the source file looks like:

    Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
    antigen precursor [Plasmodium falciparum 3D7]
    (1085 letters)

    Database: KOG
    112,920 sequences; 47,500,486 total letters

    Searching..................................................done



    Score
    E
    Sequences producing significant alignments: (bits)
    Value

    At2g21510 96
    3e-19
    At4g39150 95
    1e-18
    At1g76700

    and so on...

    What I want to do is the following:
    Read the source file - and if a line starts with "Query=", strip
    everything from the line but the expression "gi|xxxxx". That part was no
    problem with gsub, mind you. But, now the tricky thing (or not, I
    guess...).
    Go from there until you find a line starting with "Sequence", skip this
    line and the following and puts the third line together with the
    "gi|xxxxx"
    So from the above example it would look like this:

    gi|23510597 At2g21510

    No, ideally I wouldnt have to include this skip-lines part, but I cant
    find a regexp, that lets me reliably identify the first line of the
    results block (not all possible results start with At...).

    How I tried to do it:

    def stripname line
    s = line.gsub(/Query=/, '')
    u = s.gsub(/\|emb.*/, '')
    end

    count = 0 # initializing variables
    t = nil
    v = nil

    ARGF.each do |l|

    puts l unless count.zero?
    count = [0, count-1].max

    if l.match(/^Query=/)
    t = stripname l
    elsif l.match(/^Sequences/)
    l = $1
    count = 2
    puts "#{t}#{l}"
    else
    end
    end

    But the output looks terrible:
    gi|23510597

    At2g21510
    96 3e-19
    gi|23510599

    At5g14980
    58 3e-08
    gi|23510600

    And no matter what I try, I cant get the gi|xxxx and the corresponding
    "best hit" in the same line. Tried it with hashes, but frankly dont know
    enough about those yet.
    So If anyone has a helpful comment or solution, I would be extremely
    grateful!

    Cheers,

    Marc

    --
    Posted via http://www.ruby-forum.com/.
    Marc Hoeppner, Jul 18, 2007
    #1
    1. Advertising

  2. I'd throw it all into one big ugly regex:
    s.match(/Query=
    (.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
    => "gi|23510597 At2g21510"

    --
    Posted via http://www.ruby-forum.com/.
    Andreas Schwarz, Jul 18, 2007
    #2
    1. Advertising

  3. Andreas Schwarz wrote:
    > I'd throw it all into one big ugly regex:
    > s.match(/Query=
    > (.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
    > => "gi|23510597 At2g21510"


    Thanks for the suggestion! However, if someone has a suggestion
    regarding the following code and how to fix it, I'd be happy...its
    almost working and I just need to understand why it is behaving a bit
    odd. So here is the code.

    def stripname line
    s = line.gsub(/Query=/, '')
    u = s.gsub(/\|emb.*/, '')
    end


    count = 0
    gene = nil
    store = Array.new

    ARGF.each do |l|

    store.push(l) unless count.zero?
    count = [0, count-1].max

    if l.match(/^Query=/)
    gene = stripname l

    elsif l.match(/^Sequences/)
    count = 2
    puts "#{gene.strip} #{store.last.to_s.strip}"
    else

    end
    end



    Problem:

    Reads: If line is found that starts with "Query=", use the method
    stripname on it and store it in the variable "gene". Go further, and if
    you find a line that starts with "Sequence", use the above specified
    procedure "count". Now this is the problem right now. After I wasnt able
    to figure out to get the formatting right, I decided to stick to the
    skip-line approach and instead of having it printed, to store it in an
    array. From there I simply read the last entry.

    BUT: instead of printing every stored hit to the corresponding "gene",
    it shifts the whole thing 1 line. So that each "gene" is associated with
    the "best hit" of the previous match to "Query=".

    gi|23510597
    gi|23510599 At2g21510
    gi|23510600 At5g14980

    Now, I could solve that easily with a capable text editor, but I think
    there must be an easy solution to this...right?

    Cheers,
    Marc

    --
    Posted via http://www.ruby-forum.com/.
    Marc Hoeppner, Jul 18, 2007
    #3
  4. Marc Hoeppner

    Robert Dober Guest

    On 7/18/07, Marc Hoeppner <> wrote:
    > Hi,
    >
    > I have yet another question about how to write a specific text parser in
    > ruby...
    > So, without further ado - this is what the source file looks like:
    >
    > Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
    > antigen precursor [Plasmodium falciparum 3D7]
    > (1085 letters)
    >
    > Database: KOG
    > 112,920 sequences; 47,500,486 total letters
    >
    > Searching..................................................done
    >
    >
    >
    > Score
    > E
    > Sequences producing significant alignments: (bits)
    > Value
    >
    > At2g21510 96
    > 3e-19
    > At4g39150 95
    > 1e-18
    > At1g76700
    >
    > and so on...
    >
    > What I want to do is the following:
    > Read the source file - and if a line starts with "Query=", strip
    > everything from the line but the expression "gi|xxxxx". That part was no
    > problem with gsub, mind you. But, now the tricky thing (or not, I
    > guess...).
    > Go from there until you find a line starting with "Sequence", skip this
    > line and the following and puts the third line together with the
    > "gi|xxxxx"
    > So from the above example it would look like this:
    >
    > gi|23510597 At2g21510
    >
    > No, ideally I wouldnt have to include this skip-lines part, but I cant
    > find a regexp, that lets me reliably identify the first line of the
    > results block (not all possible results start with At...).
    >
    > How I tried to do it:
    >
    > def stripname line
    > s = line.gsub(/Query=/, '')
    > u = s.gsub(/\|emb.*/, '')
    > end
    >
    > count = 0 # initializing variables
    > t = nil
    > v = nil
    >
    > ARGF.each do |l|
    >
    > puts l unless count.zero?
    > count = [0, count-1].max
    >
    > if l.match(/^Query=/)
    > t = stripname l
    > elsif l.match(/^Sequences/)
    > l = $1
    > count = 2
    > puts "#{t}#{l}"
    > else
    > end
    > end
    >
    > But the output looks terrible:
    > gi|23510597
    >
    > At2g21510
    > 96 3e-19
    > gi|23510599
    >
    > At5g14980
    > 58 3e-08
    > gi|23510600
    >
    > And no matter what I try, I cant get the gi|xxxx and the corresponding
    > "best hit" in the same line.

    It is a terrible thing happens to me all the time, one tends to forget
    these \n's.
    Well fortunately we have #chomp, but maybe you want to use #strip
    which removes trailing (and leading) WS \n included.

    HTH
    Robert
    >Tried it with hashes, but frankly dont know
    > enough about those yet.
    > So If anyone has a helpful comment or solution, I would be extremely
    > grateful!
    >
    > Cheers,
    >
    > Marc
    >
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >



    --
    I always knew that one day Smalltalk would replace Java.
    I just didn't know it would be called Ruby
    -- Kent Beck
    Robert Dober, Jul 18, 2007
    #4
  5. Marc Hoeppner

    Robert Dober Guest

    On 7/18/07, Marc Hoeppner <> wrote:
    > Andreas Schwarz wrote:
    > > I'd throw it all into one big ugly regex:
    > > s.match(/Query=
    > > (.+?\|.+?)\|.+?\(bits\)\s+Value\s+(.+?)\s+/m).to_a[1..2].join(' ')
    > > => "gi|23510597 At2g21510"

    >
    > Thanks for the suggestion! However, if someone has a suggestion
    > regarding the following code and how to fix it, I'd be happy...its
    > almost working and I just need to understand why it is behaving a bit
    > odd. So here is the code.
    >
    > def stripname line
    > s = line.gsub(/Query=/, '')
    > u = s.gsub(/\|emb.*/, '')
    > end
    >
    >
    > count = 0
    > gene = nil
    > store = Array.new
    >
    > ARGF.each do |l|
    >
    > store.push(l) unless count.zero?
    > count = [0, count-1].max
    >
    > if l.match(/^Query=/)
    > gene = stripname l
    >
    > elsif l.match(/^Sequences/)
    > count = 2
    > puts "#{gene.strip} #{store.last.to_s.strip}"
    > else
    >
    > end
    > end
    >
    >
    >
    > Problem:
    >
    > Reads: If line is found that starts with "Query=", use the method
    > stripname on it and store it in the variable "gene". Go further, and if
    > you find a line that starts with "Sequence", use the above specified
    > procedure "count". Now this is the problem right now. After I wasnt able
    > to figure out to get the formatting right, I decided to stick to the
    > skip-line approach and instead of having it printed, to store it in an
    > array. From there I simply read the last entry.
    >
    > BUT: instead of printing every stored hit to the corresponding "gene",
    > it shifts the whole thing 1 line. So that each "gene" is associated with
    > the "best hit" of the previous match to "Query=".

    Are you pushing before or after you use the last element of the array?
    But you should go back to your original idea, which works just fine,
    now that you have discovered #strip, before my post :)

    Now this is a Ruby ML, right, so maybe you would accept that I Rubyish
    the code a little bit ;)

    gi = nil
    ARGF.each do |line|
    case line
    when /Query=\s*(gi\|.*?)\|/
    gi = $1
    when /Sequence/
    puts gi.strip << " " << (1..2).map{ ARGF.readline }.last.strip
    end
    end

    HTH
    Robert
    --
    I always knew that one day Smalltalk would replace Java.
    I just didn't know it would be called Ruby
    -- Kent Beck
    Robert Dober, Jul 18, 2007
    #5

  6. >
    > Now this is a Ruby ML, right, so maybe you would accept that I Rubyish
    > the code a little bit ;)
    >
    > gi = nil
    > ARGF.each do |line|
    > case line
    > when /Query=\s*(gi\|.*?)\|/
    > gi = $1
    > when /Sequence/
    > puts gi.strip << " " << (1..2).map{ ARGF.readline
    > }.last.strip
    > end
    > end
    >


    Very nice, thank you!

    --
    Posted via http://www.ruby-forum.com/.
    Marc Hoeppner, Jul 18, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hermit Dave

    Re: Yet another .net bug?

    Hermit Dave, Jan 18, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    406
    Hermit Dave
    Jan 18, 2004
  2. Manlio Perillo

    Yet Another Command Line Parser

    Manlio Perillo, Oct 26, 2004, in forum: Python
    Replies:
    9
    Views:
    354
    Manlio Perillo
    Oct 27, 2004
  3. Berehem
    Replies:
    4
    Views:
    541
    Lawrence Kirby
    Apr 28, 2005
  4. Giulio  Piancastelli

    (Yet Another?) RSS::Parser test suite

    Giulio Piancastelli, Nov 17, 2004, in forum: Ruby
    Replies:
    6
    Views:
    186
    Kouhei Sutou
    Nov 23, 2004
  5. Eric Mahurin
    Replies:
    10
    Views:
    235
    Eric Mahurin
    Sep 14, 2005
Loading...

Share This Page