BioRuby & Google Summer of Code 2011

Discussion in 'Ruby' started by Ra, Mar 25, 2011.

  1. Ra

    Ra Guest

    [Note: parts of this message were removed to make it a legal post.]

    Dear All,
    our project, is looking for students to participate at GSoC 2011,
    thanks to OBF and NESCENT
    Please feel free to forward this message to your university-ml, lab or
    local ruby group.

    Use our ml to discuss ideas and feel free to contact the mentors or
    any other member for the development team.

    March 18-27: Would-be student participants discuss application ideas
    with mentoring organizations.
    March 28: Student application period opens.
    April 8 19:00 UTC Student application deadline.

    Our proposals

    Proposal 2011

    Support Next Generation Sequencing (NGS) in BioRuby

    The processing and analyzing of NGS data is challenging for a variety
    of reasons, in particular due to the fact that the data-sets are
    usually very large and contain a vast amount of information and a high
    number of unknown data. Furthermore there are many different
    approaches to perform NGS analyses and several software tools need to
    be integrated to produce reliable results. Since this topic is so
    important for the BioRuby community we started a sub-project
    bioruby-ngs for analyzing NGS data. The project is in an early stage
    of development but notable results have been quickly gained. Many
    topics need to be still addressed, in particular:
    data and results reporting
    workflow management
    DSL for describing experimental designs
    YALIMS (Yet Another LIMS), a simple web based Lims for raw datasets
    processing, with reporting and monitoring
    Due to the open nature of the project the student will choose which
    feature he/she wants to develop and to focus on. The student will
    learn basic concept of NGS data analysis and will work tightly with a
    mentor to produce a working library that will be integrated into the
    BioRuby NGS project.
    Difficulty and needed skills
    Medium to Hard depending on the topic selected.
    The project requires
    Bash programming and knowledge of the Linux environment
    Ruby on Rails 3.x
    Raoul J.P. Bonnal, Francesco Strozzi
    Project overview and updates
    Source code
    BioRuby Wrapper for Command line application

    The main reason for this project is the need to support different
    stand-alone applications critical for Next Generation Sequences
    analyses. Direct binding to existing C/C++ source code or rewriting
    all the applications is impractical and a waste of resources. A quick
    solution is to use stand-alone applications directly, integrating them
    into the BioRuby API. Some work has been already done in the BioRuby
    NGS project with this wrapper but a better support for demanding I/O
    processes is required. Following this design pattern will be possible
    to improve also the support for other bioinformatics suites, like
    EMBOSS, outdated in BioRuby at the time of this proposal.
    The student will familiarize with advanced meta-programming concepts
    in Ruby and will contribute to the definition of a DSL for this
    wrapping library. He/she will build also a parser to automatically
    define additional wrappers for the EMBOSS suites starting from the ACD
    configuration files.
    Difficulty and needed skills
    Medium. Good Ruby knowledge and experience with meta-programming are
    required to achieve the goals.
    The project requires
    Ruby 1.9
    Ruby Metaprogramming
    Raoul J.P. Bonnal, Francesco Strozzi
    Source code, wrapper branch
    Represent bio-objects and related information with images

    Most of the time, after a bioinformatics analysis, the resulting data
    needs to be re-processed into a graphical way since we, as
    human-beings, are more comfortable accessing results and data visually
    than browsing a huge table with interconnected information. Very often
    it is also difficult to extrapolate the real biological meaning from a
    raw datasets. The main idea of this proposal is to define and attach
    graphical functions to BioRuby objects and consequently to the results
    computed from a generic process or pipeline. With this solution, it
    would be possible to explore them more naturally but also to export
    and integrate the information into a web environment, for sharing the
    knowledge and the results. For example, different objects storing
    alignments results could share the same interface and display their
    data in a common way. The same is true also for other kind of objects
    or computational procedures.
    The student and the mentor will define together a minimum set of
    features that need to be shared by the BioRuby objects and that could
    be visualized. Then the student will create a library/module to
    implement these graphical features within the BioRuby project. He/she
    will gain experience with Rubyvis as the graphical API and with Ruby
    on Rails for web visualization.
    Difficulty and needed skills
    Medium/Hard. The student will need to define a graphical API and
    integrate the new code with the existing BioRuby modules. High level
    coding skills will be required to create a clean API with a clear
    The project requires
    Very good knowledge of Ruby (1.9) and pattern design
    Basic concepts of graphics/visualization
    Ruby on Rails basic knowledge
    Raoul J.P. Bonnal, Christian Zmasek

    Modular annotation knowledge base for BioRuby

    Handling data sets coming from platforms for gene expression analysis
    or real time PCR requires to access the corresponding gene annotations
    several times during the measurements. This kind of information is
    normally stored into remote databases that provide the required
    knowledge and data. Problems arise when the available databases do not
    support a specific version of the data of interest or when huge
    queries need to be submitted. A BioRuby knowledge base, designed to be
    modular and expandable through time, could solve these problems. A
    good compromise between performances and portability could be achieved
    using embedded databases and accessing the data through a clean API.
    The student and the mentor will explore which platforms should be
    supported by their popularity. Then the student will recover the
    essential annotation and will design a simple database schema to
    support all the relevant non-redundant information. The schema will be
    flexible enough to allow interconnecting the dataset with external
    databases or resources for subsequent analyses. After this phase of
    discovery and design, the student will build the database using SQLite
    and will write a Ruby library to access the data using ORM
    Difficulty and needed skills
    Medium. The student will need to define the core data to be included
    into the database and how this information will be organized and
    accessed by the end-user. The Ruby library will be created using the
    powerful ActiveRecord paradigms, but good coding skills will be
    required to design an efficient API with a clear documentation.
    The project requires
    Minimal SQL dialect
    Good knowledge of Ruby
    Experience in querying biological databases
    Experience with annotation data
    Raoul J.P. Bonnal, Francesco Strozzi


    BioRuby forester

    Forester is a collection of software libraries, mostly written in
    Java, for comparative genomics and evolutionary biology research. A
    prominent example of a tool based on forester is the phylogenetic tree
    explorer Archaeopteryx. Most of forester's use-cases are associated
    with the use of evolutionary trees as tools for establishing
    (functional) relations between genes or proteins (for example protein
    function prediction with RIO) and comparing genome based features
    between different species. Therefore, it implements objects
    representing evolutionary trees overlaid with biological data from
    other sources (e.g. protein domain architectures), as well as
    algorithms operating on these, such as the automated inference of
    ancestral taxonomies on gene trees, which has proven useful in the
    functional interpretation of large gene trees.
    Most of these methods are currently only accessible via the
    command-line or through the GUI of Archaeopteryx and therefore
    difficult or impossible to use from other computer programs or
    toolkits (such as BioRuby). Although forester is mostly written in
    Java, it also contains components in Ruby ("evoruby"). These implement
    operations on multiple sequence alignments (MSAs) that are crucial in
    the development of workflows for automated, large scale, phylogenetic
    inference, including I/O, and efficient MSA manipulation (such as
    deletion of all columns with a gap-portion larger than a given
    threshold, removal of short and/or redundant sequences).
    The goal would be to develop a framework for accessing forester's
    central algorithms and applications from within BioRuby. It is
    expected that this project will be implemented in form of a BioRuby
    plugin in order to avoid creating additional dependencies for the main
    BioRuby distribution. Full two-way access between the Java and Ruby
    languages can be accomplished by using JRuby as the underlaying
    Depending on the level of experience and skills of a student, a
    project proposal could also include either or both of the following
    additional goals.
    BioRuby and the "evoruby" components of forester partially overlap in
    functionality. You could incorporate MSA management functionality
    present in "evoruby" but missing in BioRuby into the BioRuby
    distribution. This would not only make that functionality immediately
    accessible to all BioRuby users, but would also allow a larger
    community of developers to participate in maintentence and future
    development of these components.
    Display gene conversions. This would entail developing a parser for
    GENECONV output and use the newly developed BioRuby-forester link to
    directly display gene conversions within Archaeopteryx.
    The student needs to learn two disparate toolkits, BioRuby and forester.
    The project involves two programming languages, Ruby and Java.
    Need to understand the BioRuby plugin system.
    Involved toolkits or projects
    BioRuby plugin system
    Degree of difficulty and needed skills
    Expected difficulty: Medium. Proficiency in at least one of the two
    involved programming languages, Ruby and Java, is necessary.
    Experience/interest in molecular evolution or comparative genomics is
    required, and experience with BioRuby or forester will help.
    Christian Zmasek, Pjotr Prins, Raoul J.P. Bonnal


    Ra, Mar 25, 2011
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Ascher
    David Ascher
    Jun 2, 2005
  2. Ville Vainio
    Ville Vainio
    May 3, 2006
  3. jtauber

    Google Summer of Code 2007

    jtauber, Mar 10, 2007, in forum: Python
    Mar 10, 2007
  4. Michael
    Mar 22, 2007
  5. Pierrick Brihaye

    [ANN] Google Summer of Code 2008

    Pierrick Brihaye, Mar 19, 2008, in forum: XML
    Pierrick Brihaye
    Mar 19, 2008
  6. Peter Flynn
    Peter Flynn
    Jun 24, 2011
  7. Peter Flynn
    Peter Flynn
    Aug 31, 2011
  8. Michael Neumann
    Mark Volkmann
    Mar 14, 2007