Ortholog-Finder: A tool for constructing an ortholog dataset

Ortholog-Finder is a program designed for obtaining ortholog datasets from large corrections of sequence data for use in phylogenetic analysis. The program uses novel processes to minimize genes derived from HGT and out-paralogs in ortholog datasets. Ortholog-Finder can construct a concatenated tree by using the ortholog dataset generated according to the NJ method. The program does not support the maximum likelihood method or Bayes method because the calculation times required for choosing the optimal substitution model and constructing phylogenetic trees are extremely long. For users who wish to use these methods, Ortholog-Finder provides concatenated sequence data of orthologs in the phylip format, and this can be used in other tree-building programs.

System requirements:
Ortholog-Finder operates on Linux operating systems, as tested on Ubuntu 12.04 LTS and CentOS 6.5. The computation time is approximately 12 h for data on 12 gram-positive bacteria on an ordinary PC (Core i5 2.8 GHz). If the HGT-filter option is OFF, computation time is around 1 h.
When using a virtual machine like virtual box, Ortholog-Finder is available on Windows or MacOSX. The virtual machine is downloadable.

Software requirements:
BLAST+ (Altschul et al. 1990), FastTree (Price et al. 2010), MAFFT (Katoh et al. 2002), Gblocks (Castresana et al. 2000), Bioperl (Stajich et al. 2002), EMBOSS (Rice et al. 2000), mcl (Van Dongen 2000), and OrthoMCL (Li et al. 2003) The Java Runtime Environment must be installed.

Download:
Ortholog-Finder is downloadable. The latest version is 4.07. [FastTree is used for inferring phylogenetic tree instead of clustalw (NJ method). 2016 Sep.13 Updated]

Installation:
  • Installing Ortholog-Finder and the required programs on Ubuntu.
  • Installing Ortholog-Finder and the required programs on CentOS.
  • A virtual machine for virtual box is downloadable. Ortholog-Finder and the other required programs are installed in the virtual machine. Users can easily try Ortholog-Finder by using the virtual machine.

  • Input files:
    Input files in the GenBank format (*.gbk) or multi-FASTA format (amino acid sequences, *.faa) are required for each species selected. If the input files are in GenBank format, HGT filtering is available in Ortholog-Finder for inferring HGT genes and removing them from the ortholog dataset. If the input files are in multi-FASTA format, HGT filtering is not available. All species must belong to one of the 2 taxonomic groups in order to discover out-paralogs and HGT events. All files must be located in "Group1" or "Group2" directory. The file name should be "genus_species_xxx.xxx". These are examples for GenBank file and FASTA file on the NCBI website.

    Output files:
    Two output file formats are available for ortholog sequence data. One is a multi-FASTA format, which includes sequence data of each ortholog. The other is a multi-FASTA-like format that includes all ortholog sequence data in one file. Moreover, 2 alignment-concatenated trees are constructed using the NJ method. One tree features bootstrap values, the other features ortholog-support percentages. For users who seek to construct a concatenated phylogenetic tree by using the maximum likelihood method or Bayes method, Ortholog-Finder provides concatenated sequence data of orthologs in the phylip format, and this can be used in other tree-building programs. If HGT filtering is enabled, the inferred HGT sequences and the remaining sequences are saved separately and the data can be used for other analyses.

    Parameters
  • Linux:
  • Ortholog-Finder is available on Ubuntu 32bit, Ubuntu 64bit, CentOS 32bit, and CentOS 64bit. Users choose one at "Linux" item.
  • HGT Filtering:
  • ON = enable, OFF = disable
    ON (Use HGT-filter only) = Ortholog data are not generated. This mode is used only for detecting HGT data.
  • HGT Filtering P-value:
  • The threshold of the chi-square test for inferring HGT is adjustable.
  • E-value start:
  • The BLAST threshold at the first cycle is adjustable.
  • E-value end:
  • The BLAST threshold at the last cycle is adjustable.
  • Tree split:
  • ON = enable, OFF = disable
  • Minimum number of species belonging to Group 1:
  • The threshold of the number of species belonging to Group 1 is adjustable. If the number of species is less than the threshold, the ortholog data will be withdrawn.
  • Minimum number of species belonging to Group 2:
  • The threshold of the number of species belonging to Group 2 is adjustable.
  • Maximum number of paralogs in an ortholog candidate for tree splitting:
  • The threshold of the number of paralogs in an ortholog candidate is adjustable. If the number of paralogs is more than the threshold, tree splitting will be skipped.
  • Parallel job number:
  • Number of CPU cores is adjustable. HGT filtering and BLAST run in parallel.
  • Output file:
  • The ortholog sequence data are output in multi-FASTA-like format that includes all ortholog sequence data in one file. If users choose "Orthologs are saved in each file," ortholog sequence data will also be output in multi-FASTA format, which includes sequence data for each ortholog.

    Getting started (Ortholog-Finder with single linkage):
    Run start-s.sh to open the window. Set the parameters. Click "Go" to run the program.

    Getting started (Ortholog-Finder with OrthoMCL):
    Run start-m.sh to open the window. Set the parameters. Click "Go" to run the program.

    Screenshot_Ortholog-finder.png
    Publication:
    Genome Biol Evol. 8(2):446-457 2016.