Identification of highly conserved elements (HCEs) in the set of genomes

The iHCE software package implements the method described in [1]. It is aimed at identification of HCEs in a set of relatively well assembled complete genomes. These programs have been evaluated on nuclear genomes of the superphylum Alveolata [1] as well as mitochondrial genomes of infusoria (the phylum Ciliophora) [2] and monocotyledonous plants. The package consists of the following three programs intended for MPI-enabled supercomputer, corresponding to the three stages of the method set forth in [1].

  1. The PairHits program finds all pairs of approximately matching words in two sequences from different genomes, thus generating edges of the source graph.
  2. The BldGraph program accomplishes a compaction of the source graph converting it into the initial multipartite graph (each part corresponds to a genome).
  3. The FinDense program transforms the initial graph into the final one and identifies m-dense subgraphs (clusters) in the latter graph. Each cluster is a connected component composed of vertices (i.e., similar words) that belong to at least m parts and are connected by edges of the greatest total weight.

These programs assume the processing of Big Data and are intended only for 64-bit CPUs and operating systems. Different stages of the algorithm have different computational complexity and scalability. This is why the package has been split into separate programs. In order to decrease the size of files and speed up computation, the programs often use specific data formats almost without any checks. The user is fully responsible for correct compiling and interpreting of such formatted files. For example, they might create a database to store the source data in any desired format and develop a database application or script to get source files in the required format. This is the way we used, but we do not describe it in detail.

All programs are written in C++ and have the command line interface to specify most important parameters. Settings made in the command line have the highest priority. All adjustable parameters can be set in the configuration file, which is required and used by these programs for all parameters except those modified through the command line. If the parameter is not specified either in command line or in configuration file, the default value will be set in a program, though not for every parameter. A template configuration file is provided in downloadable examples below. Short help on the command line options will be displayed if running the program with argument -? or --help.

The Windows 64-bit executables (variants with and without MPI) and the source code for Linux can be downloaded through the links below. The source code is compatible with most implementations of MPI v.1.2 and above; it is provided under the GNU General Public License (GPL) v.3. GNU GPL V3
The MPI-enabled executables assume that MPICH2 64 bit v.1.4.1p1 (the last version developed for Windows) has been installed in the system. The user can download a complete release from the developer's site or use the link below to get just the 64 bit installable file.
An alternative variant, which is supported by the separate set of executables, requires installing Microsoft MPI v.7.1 64 bit. The redistributable setup file is available through the link below.

Downloadable files

Variant without MPI Variant with MPICH2 1.4.1p1 Variant with Microsoft MPI 7.1
PairHits executable for Windows 64-bit pairhits64nompi-1.12.zip pairhits64-1.12.zip pairhits64ms-1.12.zip
BldGraph executable for Windows 64-bit bldgraph64nompi-2.16.zip bldgraph64-2.16.zip bldgraph64ms-2.16.zip
FinDense executable for Windows 64-bit findense64nompi-1.6.zip findense64-1.6.zip findense64ms-1.6.zip
Test example for Windows ihce-wintest-4.34.zip
MPICH2 1.4.1p1 installable file for Windows 64-bit mpich2-1.4.1p1-win-x86-64.msi
Microsoft MPI 7.1 redistributable for Windows 64-bit MSMpiSetup.exe
iHCE v.4.34 source codes and test example for Linux — GNU GPL V3 ihce-src-4.34.tgz

References

  1. L.I. Rubanov, A.V. Seliverstov, O.A. Zverkov, V.A. Lyubetsky. A method for identification of highly conserved elements and evolutionary analysis of superphylum Alveolata. BMC Bioinformatics, 2016, Vol. 17, Art. 385. DOI: 10.1186/s12859-016-1257-5
  2. R.A. Gershgorin, K.Yu. Gorbunov, O.A. Zverkov, L.I. Rubanov, A.V. Seliverstov, V.A. Lyubetsky. Highly conserved elements and chromosome structure evolution in mitochondrial genomes in ciliates. Life, 2017, Vol. 7, Iss. 1, Art. 9. DOI: 10.3390/life7010009