Science and Technology Production

Libro de resúmenes - Another Tool for Genomic Comprehension (ATGC): an ontology driven database and web interface applied to Sunflower Microarray Project

Congress

Authorship:

Bernardo J. Clavijo ; Paula Fernandez ; Gonzalez Sergio ; Rivarola M ; Heinz R ; FARBER, MARISA DIANA ; Norma Paniego

Date:

2012

Publishing House and Editing Place:

International Society for Computational Biology

Summary *

Although microarray technology started a new era of high-throughput transcriptomic analysis approximately ten years ago, starting with 8,000 printed genes by Affymetrix in Arabidopsis thaliana and later on scaling up to 45,000 printed genes in rice and 90,000 in Brassica, next generation sequencing (NGS) technologies are nowadays opening a new era of even deeper understanding of genomics and transcriptomics in different species. However, for the foreseeable future both technologies will coexist each focusing on different tasks, or by complementing biological and value information or by designing dedicated oligonucleotide arrays to support functional studies on a specified pathway/developmental stage. One obvious application of microarray technology is the transcriptional profiling in species that have neither their own genome sequenced nor a reference genome from a closely related species. For some of these species a commercial microarray based on an existing own-design are available (Agilent, Affimetrix, Nimblegen, etc). Sunflower is a species that fits into this framework, even though a genome sequence initiative is in progress, there is no reference genome available. In this case, the only source of functional information is limited to ESTs databases, which in the case of cultivated sunflower is rather extensive, more than 133,000 ESTs are publicly available (http://ncbi.nlm.nih.gov/dbEST/dbEST_summary.html) covering libraries prepared from several lines and cultivars, and the production of ca. 6Gb of next-generation sequence data assembly for the purposes of SNP discovery, recently published. However, it should also be noted that ESTs libraries tend to be significantly contaminated with vector sequences and chimeras, and have relatively low quality DNA information derived from the library sequencing strategy which prioritizes obtaining a large number of single pass sequences, being necessary to standardize a set of bioinformatics routines in order to clean and decontaminate public raw sequences. Currently, the shortage of candidate genes underlying agronomically important traits represents one of the main drawbacks in sunflower molecular breeding. In this context, functional tools which allow concerted transcriptional studies, as high density oligonucleotide microarray, strongly support the discovery and characterization of novel genes. Oligonucleotide-based chips not only allow the analysis for a whole transcriptome but they are also considered more accurate than cDNA-based chips due to the reduction of manipulation steps. The possibility to implement this technology on any custom array system like Agilent, Nimblegen, and others, has the potential to create a very useful tool for gene discovery in non-model crops. In addition, the use of longer probe format represents a major advantage of Agilent oligonucleotide microarrays over others technologies based on a higher stability in the presence of sequence mismatches, being consequently, more suitable for the analysis of highly polymorphic regions. In our lab, a public and proprietary datasets of H. annuus L. ESTs have been used to create a comprehensive sunflower unigene collection. In this study, public and proprietary H. annuus L. EST datasets have been used to create a comprehensive unigene collection. These dataset comprises 34 cDNA libraries available from different cultivars and various tissues and anatomical parts, from plants grown at different physiological conditions. In this work, we present the development of a comprehensive Sunflower Unigene Resource of H. annuus L. (SUR v 1.0), its functional annotation and the design and validation of a custom sunflower oligonucleotide-based microarray for identification of concerted transcriptional responses associated to biotic and abiotic responses. This development represents an initiative of the Sunflower Argentinean Consortium, working in collaboration with the Institute Principe Felipe, Valencia, Spain, within the frame of a public research project. To design and customize this microarray, clustering and assembling of 133,682 public ESTs was achieved resulting in 12,924 contigs and 28,089 singletons by using CAP3 with parameters set accordingly to the most relevant and recently published microarray designs (p=95, f=45, h=25, o=80). After cleaning and removal of low quality and short (<100 bp) sequences, the dataset was reduced to 132,479 reads. Also, additional processed ESTs or gene sequences of special interest for relevant traits were added to the initial dataset. The final assembly resulted in 41,013 putative transcripts. This analysis showed no bias among ESTs originated from sunflower cDNA libraries deposited in GenBank, giving strong evidence about the microarray´s design and its potential functional coverage. Finally, a set of 678 consensus contigs (or super-contigs) was generated from unigenes that showed a high BLAST sequence homology but did not cluster together in the CAP3 assembly. These super-contigs could address potential variants stemming from sequencing errors, gene duplication processes or allelic variants. These contigs were included in the microarray design, which resulted in 40,169 probes. Moreover, GO terms mapping were carefully done running Blast2GO against a local GO database (2011-08 update). Annotation was completed by running a local installation of InterProScan v4.7 followed by InterPro2GO (database version 31.0, release February 2011). Hence, we considered the whole sequences with BLASTX hits and used the same reading frame, and for anonymous sequences we considered 6-frame translations. In this work, we present ATGC (Another Tool for Genomic Comprehension), a database to store, visualize, analyze and share this information, also including probes associated to each unigene represented in the microarray. This database is available at http://bioinformatica.inta.gov.ar/ATGC/, actually with user and password restriction access. ATGC is based on Chado (Generic Model Organism Database, http://gmod.org), an ontology driven relational database schema implemented in PostgreSQL, and a web interface based on web2py. One of the main goals for ATGC is to facilitate the exploration and visualization of the data. The main development effort was done to exploit GO annotation and analyzing the annotated genes, allowing users to move through the GO-DAG structure. This approach navigates between different classes of available genes on different projects. A strong emphasis has been dedicated on having each gene once in each GO category. GO term Feature Search pages shows every feature directly annotated, adding every feature indirectly annotated and mentioning, for every feature displayed, which term inherits the searched name or ID. This routine has expanded dramatically the possibilities for interpretation and exploration centered on GO annotation. As a way to facilitate the access to information, we have included in the Feature Detail Page all information related to the feature including links to related data. Oligonucleotide microarray probe sequence and a list of potential cross hybridization probes probably matching the same unigene are among this data. We are currently working on minor debugging and creating an easy installer for ATGC for different Unix distributions and even Windows Operative System. The sunflower project is expanding new possibilities, whereas updated ontologies are being tested to add more information, planning the integration of a genome browser through DMAP to enable genomic querying (see poster DMAP in this meeting). Finally, we planned to optimize the collection management features, allowing users to create and manipulate lists of features by different criteria, even connecting the database to complementary platforms for data processing and analysis like Galaxy and DMAP, providing a mean to perform an accurate protocol for data manipulation and storage. Information provided by the agent in SIGEVA

Key Words

bioinformaticsunflowermicroarrayontology