Current Research

The origin of the orphan genes
Orphan genes are detected by homology search as genes which do not have any homologs in other organisms. In other words, orphan genes are species or lineage specific genes. Orphan genes are always detected from any genomes, so it may play a rule in evolution. We are developing the method to find the origin of orphan gene.

Convergent evolution of proteins
Convergent evolution of the body shape has often occurred through the adaptation of organisms to the environment. However, convergent evolution on the amino acid level are rarely reported. Therefore, we are trying to detect novel convergent proteins using large amount of protein data, and to estimate the influence of convergent gene to the evolution of species.

Influence of gene-gain and gene-loss event on the evolution of a lineage
We are developing a software to detect gene gain and gene loss event on a lineage, and estimate the functions of the gene. Using the software, we hope to understand the contribution of gene-gain and gene-loss to the adaptation to environment through the evolution.

Inferring phylogenetic trees using large amount of genes
Recently, large amount of sequence data of gene is available by high throughput sequencer. Therefore, we examined the robustness of the method to construct the phylogenetic tree using large amount of genes by simulation tests.


Previous Researches

Development of a tool (Ortholog-Finder) for making ortholog dataset (2008-2016)
Orthologs are widely used for phylogenetic analysis of species; however, identifying genuine orthologs among distantly related species is challenging, because genes obtained through horizontal gene transfer (HGT) and out-paralogs derived from gene duplication before speciation are often present among the predicted orthologs. We developed a program, "Ortholog-Finder" to obtain ortholog data sets for performing phylogenetic analysis by using all open-reading frame data of species. The program includes five processes for minimizing the effects of HGT and out-paralogs in phylogeny construction: 1) HGT filtering: Genes derived from HGT could be detected and deleted from the initial sequence data set by examining their base compositions. 2) Out-paralog filtering: Out-paralogs are detected and deleted from the data set based on sequence similarity. 3) Classification of phylogenetic trees: Phylogenetic trees generated for ortholog candidates are classified as monophyletic or polyphyletic trees. 4) Tree splitting: Polyphyletic trees are bisected to obtain monophyletic trees and remove HGT genes and out-paralogs. 5) Threshold changing: Out-paralogs are further excluded from the data set based on the difference in the similarity scores of genuine orthologs and out-paralogs. We examined how out-paralogs and HGTs affected phylogenetic trees constructed for species based on ortholog data sets obtained by Ortholog-Finder with the use of simulation data, and we determined the effects of confounding factors. We then used Ortholog-Finder in phylogeny construction for 12 Gram-positive bacteria from two phyla and validated each node of the constructed tree by comparison with individually constructed ortholog trees.


Development of a tool for generating a phylogenetic tree with horizontal gene transfer (2010-2011)
Horizontal gene transfer (HGT) is a common event in prokaryotic evolution. Therefore, it is very important to consider HGT in the study of molecular evolution of prokaryotes. This is true also for conducting computer simulations of their molecular phylogeny because HGT is known to be a serious disturbing factor for estimating their correct phylogeny. To the best of our knowledge, no existing computer program has generated a phylogenetic tree with HGT from an original phylogenetic tree. We developed a program called HGT-Gen that generates a phylogenetic tree with HGT on the basis of an original phylogenetic tree of a protein or gene. HGT-Gen converts an operational taxonomic unit or a clade from one place to another in a given phylogenetic tree. We have also devised an algorithm to compute the average length between any pair of branches in the tree. It defines and computes the relative evolutionary time to normalize evolutionary time for each lineage. The algorithm can generate an HGT between a pair of donor and acceptor lineages at the same evolutionary time. HGT-Gen is used with a sequence-generating program to evaluate the influence of HGT on the molecular phylogeny of prokaryotes in a computer simulation study.
HGT-Gen

Characterization of mouse sex-specific DMR (Differentially Methylated Region) (2007-)
DNA methylation is a key mechanism of genomic imprinting. Sex-specific Differentially Methylated Region (DMR) is a genomic region in which cytosine of CpGs are methylated in paternal allele or maternal allele. Until now, sequence homology, common motif and pattern have not been found in DMRs. Actually, there are reports for the sequence character of DMR, but the character is not shared by all DMRs. This project is now suspended.

Phylogenetic construction of all bacterial phyla by using large amount of orthologs (2004-2007)
The construction of the correct phylogenetic tree remains a key issues in evolution. Generally speaking, a phylogenetic tree, once correctly constructed, would be used as a contour map in biology. However, many lingering problems exist with the construction of the correct tree. One of them is that trees constructed by using single genes are often inconsistent to one another. Unfortunately, none of those methods is perfect in that they tend to yield inaccurate relationships particularly for distantly related species due to disturbing factors such as horizontal gene transfer, gene loss in out-paralogs, shrinkage of long branches, and/or unusual base compositions (Fitch 2000, Delsuc et al. 2005, Snel et al. 2005). Nonetheless, two of them are worth mentioning, because they have been used more frequently than the others. One is to construct a consensus tree (supertree) that is made up with consistent parts of individual trees each of which is constructed from a different data source (e.g., Bininda-Emonds et al. 2002, Doubin et al. 2002). The other is the alignment-concatenated tree, which is obtained by using the concatenated multiple alignment of amino acid or nucleotide sequences (e.g., Brown et al. 2001). It is reminded that the two types still suffer from the loss of reliability due at least to gene loss in out-paralogs and horizontally transferred genes (HTGs). Therefore, we first examined and refined the extant ortholog databases of bacterial genomes to exclude as many HTGs and out-paralogs as possible. We then constructed a concatenated tree of bacterial phyla by using the refined database. Furthermore, we developed a method for evaluating the nodes of a constructed tree, and applied it to our concatenated tree. It is noted that our method is conceptually and methodologically different from the bootstrap test (Felsenstein 1981). Our results showed that there are many branching points with low confidence even if they have higher bootstrap value. In our bacterial phylogeny construction and evaluation, we particularly focused on the phylogenetic position of thermophilic eubacteria that is directly related with the earliest eubacterial cluster. From the results obtained in this analysis, it is concluded that thermophilic eubacteria is the oldest lineage in eubacteria.

The origin of Eukaryotes is suggested as the symbiosis of pyrococcus into gamma-proteobacteria by phylogenetic tree based on gene content (2002-2004)
We suggested the origin of eukaryotes to be the symbiosis of archaea into eubacteria by "Homology Hit Analysis" using the whole genes of many genomes as above. In these analyses, the species participating in the symbiosis were not clarified. Here, we carried out phylogenetic tree analysis for a eukaryote (yeast), 13 archaea and 49 eubacteria to determine the evolutionary position of them, because this makes it possible to determine even if they are at ancestral position (e.g. Ancestor of A and B is a symbiont.). Also, the effect of gene duplication after speciation was not addressed in the previous analyses. To avoid that effect, we grouped duplicated genes as paralogs. Furthermore, we separated eukaryotic paralogs into three groups by homology to archaea, eubacteria (other than alpha-proteobacteria) and alpha-proteobacteria, and treated them as individual organisms. Then we compared the sequences of the three groups of eukaryotes and 62 bacteria of which genome sequence data was available to detect the reciprocal best hit pairs as orthologs. The evolutionary distance was defined as 1/N between two species. Here, N stands for number of orthologs. We constructed the phylogenetic tree for 62 prokaryotes and three groups of eukaryotes using the evolutionary distance by the Neighbor-joining method. The result shows that the shape of the part of bacterial phylogeny in this tree is similar to that of the phylogenetic tree of 16SrRNA. The branches of "archaea-related paralogs", "eubacteria (other than alpha-proteobacteria)-related paralogs" and "alpha-proteobacteria-related paralogs" were diverged from the lineages of pyrococcus, gamma-proteobacteria and alpha-proteobacteria, respectively. Most of the bootstrap values at each divergence point are larger than 90%. It is also clear that "archaea-related paralogs", "eubacteria (other than alpha-proteobacteria)-related paralogs" and "alpha-proteobacteria-related paralogs", mainly have function of genetic information, operation and mitochondria, respectively. These results suggest that the origin of eukaryotes derived from the symbiosis of pyrococcus into gamma-proteobacteria.

Origin of eukaryotic cell nuclei by symbiosis of Archaea in Bacteria is revealed by homology-hit analysis (1998 -2002)
Origin of eukaryotes has been one of the biggest issues in evolutionary biology, because the phylogenies of each eukaryotic gene are not consistent. Though most of researches usually used individual gene data for the phylogenetic analysis, we used homologous genes as much as possible between eukaryote (yeast) and 15 prokaryotes of which the genome data were available. At that time, the 16-species genome dataset was larger than ever before for any genomic analysis. We classified the eukaryotic genes into 43 functional categories. This makes it possible to discuss the origin of genes in each function. Genes related to mitochondria were removed from these categories to avoid the influence of mitochondrial genes. With these dataset, We determined the origin of functional gene category of eukaryotes by newly developed method, "homology-hit analysis". The results shows that the categories related to genetic information (DNA replication, transcription, translation, and so on) derived from archaea, and those of operation (cellular operational proteins including metabolisms) derived from eubacteria. Actually, there were some reports that show some genes derived from eubacteria, and the researchers thought that such genes were from mitochondria that is descendants of alpha-proteobacteria. However, our research shows that genes for operation derived from eubacteria using the dataset without the genes related mitochondria. Since proteins for genetic information are mostly located in nuclei and those of operation are mostly located in cytoplasm, it is suggested that origin of eukaryotic cell nuclei is derived from archaea, and also cytoplasm is derived from eubacteria.