Cyprinid Genomics
Upon completion of my PhD at South African National Bioinformatics Institute, University of Western Cape in 2001, I spent six years in Singapore where my research could broadly be described as the evolution of teleost genomes. During this period, I have had productive collaborations with Profs Byrappa Venkatesh, Sydney Brenner, Laszlo Orban and Vladimir Bajic.
I have worked on various aspects of vertebrate evolution: characterized in part, by an event 450 million years ago when ray-finned fishes diverged from lobe-finned fishes (tetrapods). Teleosts comprise 99% of ray-finned fishes and much of my work in Singapore has focused on Fugu, zebrafish and carp.
Gene/genome expansion
In 1970 Ohno postulated that the origins of vertebrate innovations and new gene functions are the result of gene or genome duplications. Over the years many investigators have demonstrated tissue specificity for a gene that is a member of a multi-gene family. These studies have also highlighted genes that were present in multiple copies in tetrapods as compared to fish and vice versa. Studies using the assembled human genome clearly demonstrated that mammalian genomes are characterized by two genome duplications compared to invertebrates or genomes similar to the vertebrate ancestral state (eg., Ciona Intestinalis). The assembled Fugu genome, at that time, afforded us the opportunity to investigate the extent of gene-specific expansions in fish as postulated by single gene studies. As part of the bioinformatics team that annotated the Fugu genome, I had the opportunity to carry out a large scale comparison between the Fugu and human genomes (Aparico et al., Science (2002); Christoffels et al Mol Bio and Evolution (2004)) in the laboratory of Prof Venkatesh. I developed a pipeline and accompanying software to generate protein families, produce phylogenetic trees to trace the origins of a gene family and assess the evolutionary rates between lineages. The software I developed accounted for accelerated substitutions, and allowed us to discover a large number of gene duplications that coincided with the age of the tetraopod-fish split. Furthermore, these duplications were localized to contiguous DNA segments: statistically significant arrangements compared to a randomized genome. Collectively, the data provided strong support that fish genomes had undergone an additional round of whole genome duplication independent of other vertebrates. Subsequent work by others have verified our analysis.
The presence of duplicated genes organized in clusters on different chromosomes is very well documented for Hox genes. These development genes are arranged in vertebrate genomes as an array of 13 genes maximum and present on four chromosomes. The arrangement of these genes also coincide with the spatial and temporal expression of these genes during development. During my postdoctoral fellowship we set out to sequence the entire complement of Hox genes in the coelacanth. Once thought to be extinct, the coelacanth was discovered off the east coast of South Africa in 1938. The importance of the coelacanth discovery hinges on the fact that this species represents the closest link to the tetrapod ancestors (four-limbed animals). The DNA segments containing the coelacant Hox genes were sequenced in Prof Venkatesh’s laboratory and I carried out the phylogenetic analysis to confirm the correct gene assignment (orthology) between Hox genes of different species (Koh et al PNAS (2004)).
Gene duplication detection (GDD) software
The large-scale bioinformatics analyses that I have established in the above studies were implemented in a web-based application when I started my own laboratory at Temasek LifeSciences upon completion of my postdoctoral fellowship. The web application has been generalized such that the steps needed to identify gene duplicates were modularized, allowing end-users to configure their own analysis pipeline. The web-application was implemented on a computer cluster thereby allowing biologists to carrry out large-scale analyses without the need for high-end computational skills. The advantage of this system over others, is the coupling of an annotation pipeline to a visualization tool that can be queried. The latter component awaits completion before the manuscript can be submitted.
Genome annotation
Genome sequence and assembly are followed by a detailed description of gene content in every genome project. The methods of identifying genes in a newly assembled genome are very often ab-initio predictions followed by a sequence-based simliarity screening. Basically, known genes or proteins are used to find identical matches to regions of a “newly assembled” genome. It is not difficult to imagine that two species that are distantly related will have differences in their orthologous genes and these nucleotide differences will affect the identification of similar gene sequences. For the Fugu genome annotation, we used vertebrate and invertebrate genes and proteins to identify regions on the Fugu genome that had similarity at the DNA and protein level. Given the 400+ million years of sequence divergence, it was not a surprise to find incomplete genes using this approach. Nonetheless we implemented this strategy using the well established ENSEMBL pipeline. Recently, my laboratory in Singapore in collaboration with Prof Laszlo Orban has shown that we could obtain significantly better gene identification by using sequences from closely related species (Christoffels et al BMC Bioinformatics (2006)). For example, zebrafish and carp are separated by 50 million years. We mapped carp transcripts to the zebrafish genome and identified loci in the zebrafish genome that were only computer predictions. This result indicates that organisms distantly related to zebrafish, such as human or mouse, did not show any similarity to a computer predicted gene on the zebrafish genome. However, a carp transcript showed sufficient similarity to these predicted loci and provided preliminary support for the gene prediction. Subsequent PCR amplification confirmed the presence of a zebrafish transcript corresponding to its initial gene prediction. The use of carp transcript mapping to the zebrafish genome also identified examples of alternative splice forms, gene loci that were otherwise gene-poor regions. The data highlights the importance of understanding the nucleotide composition of an organism in order to develop tools that can delineate the exact promoter or genic features relative to other organisms.
Sex Differentiation in Cyprinids
Our work on the annotation of fish genomes has sparked a collaboration for the past three years with Prof Laszlo Orban on identifying genes related to sex differentiation (Sreenivasan et al (2008) PlosONE). The mRNA data was and continue to be produced in my collaborator’s laboratory. We have been developing ways of managing this data through a series of analyses. In particular, our data processing provides the material needed to print custom-made microarray slides. The underlying software, EST management and analysis protocols (EMAP), is described in a manuscript in preparation pending modifications to the software as a production standard. As I have relocated to South Africa, I will be completing the EMAP software at the University of Western Cape. Even though EMAP has been designed in the context of fish genomics, the underlying principles of database management, and routine screening for domains and genes of interest are used routinely by many genome projects.
