`lossgainRSL`: a program for prediction of gene losses and gains between several groups of species

The program lossgainRSL allows the user, for a given reference species, to identify its genes that are present in and/or absent from several groups of species in accordance with a given logical function (predicate). The elementary condition is “gene X is present in the group of species S”, and the predicate is a Boolean function composed of such conditions using logical connectives AND, OR, NOT. A group of species does not need to be a taxonomic group; it can be formed on the basis of arbitrary traits. In particular, a species may belong to multiple groups and S may consist of a single species. For a gene X to present in the group S, it is required that X presents in at least p species from S (p is a parameter of the group). The presence of gene X from the reference species in some other species means that (1) such species includes gene X', which is an ortholog or close homolog of X, and (2) there are several distinct genes (“witnesses”) near the gene X, which have respective orthologs located near X' in that species. Thus, synteny is taken into account. The groups of species, predicate, number of witnesses and other synteny details, range of nearness, and extent of homology — all are the program parameters. Thus, lossgainRSL is applicable to the wide range of studies.

The program is written in C++ as a command line utility for Windows/Linux. The command syntax is

lossgainRSL [-x|--x] [-qN] [-gM] [-n] 
[%name=value...] [config] [outfile]

and is case insensitive, except for file names in Linux. Short help on the command line will be displayed by running the program with any of the arguments ? -? -h. All lossgainRSL arguments are optional and have the following functions:

-x: Option -x specifies that the program will output only genes identified in the reference species along with their orthologs in other species, even if entire synteny blocks (i.e., including witnesses) were requested in the configuration file. And vice versa, if --x is specified, the program will output the genes of any species together with the witnesses in that species irrespective of the configuration setting.
-qN: This option allows shortening the program log output to console. Default N=0 is a maximum detailed log. If -q3 is specified, the program outputs only final result thus producing the shortest log.
-gM: Controls the frequency of output during the search of candidate genes. The value of M is a step of displaying serial numbers of genes in each genomic sequence under analysis (chromosome, scaffold, contig, etc.). Zero value cancels this output. The step of 10 is used by default.
-n: Prevents the program from using MPI. If lossgainRSL fails because MPI has not been installed in the OS, it is recommended first to specify this option. If the error persists, use the program version compiled without MPI support (see downloadable files below).
%name=value: The program allows the user to specify numerical parameters in the configuration file (in the form of %name, where name is an identifier); the effective values of such parameters to be specified in the command line by this argument(s). Separate argument is required for each named parameter; the order is arbitrary. Numerical values may be of integral or fractional types, written in fixed or scientific form. To use the command in Windows scripts, the percent character should be doubled.
config: The name of configuration file including optional path. Default is file config.ini in the current directory.
outfile: The name of output file including optional path. Default is file result.txt in the current directory.

The program outputs its log to the standard stream stdout (console by default), which can be redirected to a file if desired. If requested in the configuration file, debug data will output to the standard stream stderr (also console by default); it can be redirected to a file by 2>logfile parameter, where logfile is the file name.

If MPI support is provided by OS, the program allows multiprocessing. The following command line should be used:

mpirun -np P lossgainRSL [-x|--x] [-qN] 
[-gM] [%name=value...] [config] [outfile]

where mpirun is a name of the MPI launcher program (mpiexec in Windows with Microsoft MPI installed), P is the number of logical CPUs (e.g. cores) to run lossgainRSL. Other arguments are the same as in single CPU mode.

The configuration file controls all work of the program and is required for its use. This text file consists of five sections; each one starts with a line containing the section name in brackets. Lines beginning with semicolon (;) or double slash (//) characters are ignored as well as a part of any line starting with double slash. Within a line, fields are separated by tab character(s); line wrapping is prohibited. The configuration examples with short comments are provided in the test example, the detailed description is available in the user's manual.

The current version of lossgainRSL is primarily orientated to species and genomic data included in the Ensembl database, but the program manual also describes other possibilities. Specifically, the program addspecies (downloadable below) can help in addition of few species from GenBank. The program lossgainRSL uses three types of input data:

Table of genes: It contains IDs of proteins and genes, gene coordinates, names and annotations. A separate text file in TSV (tab-separated values) format is required for each species involved in the task under consideration. The file name must coincide with the name of species in the configuration file; optional path and extension (common for all tables of genes) should be specified there too. This table can be formed through a direct SQL query to the Ensembl database or interactively using the BioMart data-mining tool (see the program manual for details). A portion of the table of genes imported to Excel is exemplified in Rat_genes sheet of the file dataformat.xlsx.
Table of orthologs: A separate TSV-formatted file is required for the reference species and each substitute species in a 3-species condition (see the program manual for details). Each line of the file contains two IDs of orthologous genes, the first one belonging to the reference/substitute species, and the second — to another species involved in the task. Heading or empty lines are ignored. The file name must coincide with the name of reference/substitute species in the configuration file; optional path and extension (common for all tables of orthologs) should be specified there too. The table can be formed by merging the sub-tables for each pair (reference_species, other_species) obtained from the Ensembl database interactively or through a SQL query. An opening portion of the table of orthologs imported to Excel is exemplified in Mouse_orthologs sheet of the file dataformat.xlsx.
Table of paralogs: This table is necessary if the selection of orthologous genes is to be performed with paralogs taken into account. A separate TSV-formatted file is required for each species where paralogs are to be considered. The file is similar to the table of orthologs, but each line contains IDs of two paralogous genes from the same species. The file name must coincide with the name of reference/substitute species in the configuration file; optional path and extension (common for all tables of paralogs) should be specified there too. This table can be formed from the Ensembl database interactively or through a SQL query. An opening portion of the table of paralogs imported to Excel is exemplified in Mouse_paralogs sheet of the file dataformat.xlsx.

As an alternative to the collection of tables of orthologs and paralogs, lossgainRSL allows for using three other types of the information on homologous genes (see the program manual for details):

Matrix of the protein likeness: A separate TSV-formatted file is required for each species involved in the task under consideration. The file name must coincide with the species name in the configuration file; optional path and extension (common for all such matrices) should be specified there too. Each line of the file must contain four fields: two IDs of homologous proteins (first one from the given species, second one from any species), Raw score, and E-value corresponding to the optimal alignment of mentioned proteins. This file can be formed with BLASTP, provided that amino acid sequences for all protein IDs have been prefetched from the Ensembl or GenBank. An opening portion of the matrix imported to Excel is exemplified in Xenopus_scores sheet of the file dataformat.xlsx.
Table of protein clusters: A single TSV-formatted file is used that is formed by the protein clustering program. This table contains one protein/gene per line, but these lines are grouped in clusters of orthologous genes; the clusters are separated by blank line. Each gene occurs not more than once according to the variant of splicing that yields a protein with the biggest likeness. Clusters consisting of only one gene (singletons) are omitted. Except for protein/gene ids, other fields are ignored; all necessary data on genes is obtained from independent tables of genes (see above). An opening portion of this table imported to Excel is exemplified in Gene_clusters sheet of the file dataformat.xlsx.
Table of orthogroups: A single TSV-formatted file is used that is formed from the OrthoDB database. This table is similar in contents and format to the union of the above-described tables of genes. It contains one protein/gene per line, but these lines are grouped in the groups of orthologous genes; the orthogroups are separated by blank line. Each protein/gene line must include a field with a species name corresponding to those in the configuration file (unknown species are ignored). Separate tables of genes are not required in this case because all necessary data is contained in the table of orthogroups. An opening portion of this table imported to Excel is exemplified in Orthogroups sheet of the file dataformat.xlsx.

The program lossgainRSL outputs to the TSV-formatted file suitable for direct import to the Excel spreadsheet. The file contains all found genes of the reference species that satisfy the configured predicate and parameters. The output can optionally contain the entire synteny block for each gene (including witnesses) as well as homologous genes in other species. An example of output file — the list of mouse genes on Y chromosome that absent from Y chromosome in human but present in at least two species among capuchin, rhesus, pig and rat — from the test example, imported to Excel, is given in Result sheet of the file dataformat.xlsx.

Downloadable files:

User's manual for lossgainRSL: English v6.31
Input/output data formats (imported to Excel): dataformat.xlsx
lossgainRSL binary executables for Windows: v6.31
Test example for Windows with six species: example.zip
lossgainRSL source code and test example for Linux (under GNU GPL): v6.31
Auxiliary utilities and scripts: gbinput v6.31

lossgainRSL: a program for prediction of gene losses and gains between several groups of species

Downloadable files:

`lossgainRSL`: a program for prediction of gene losses and gains between several groups of species