lossgainRSL: a program for prediction of gene losses and gains between several groups of species
The program lossgainRSL allows the user, for a given reference species, 
to identify its genes that are present in and/or absent from several groups of species in 
accordance with a given logical function (predicate). The elementary condition is “gene 
X is present in the group of species S”, and the predicate is a Boolean 
function composed of such conditions using logical connectives AND, OR, NOT. A group of species 
does not need to be a taxonomic group; it can be formed on the basis of arbitrary traits. 
In particular, a species may belong to multiple groups and S may consist of a single 
species. For a gene X to present in the group S, it is required that 
X presents in at least p species from S (p is 
a parameter of the group). The presence of gene X from the reference species in some 
other species means that (1) such species includes gene X', which is an ortholog or 
close homolog of X, and (2) there are several distinct genes (“witnesses”) 
near the gene X, which have respective orthologs located near X' in that 
species. Thus, synteny is taken into account. The groups of species, predicate, number 
of witnesses and other synteny details, range of nearness, and extent of homology — all are the 
program parameters. Thus, lossgainRSL is applicable to the wide range of studies.
The program is written in C++ as a command line utility for Windows/Linux. The command syntax is
lossgainRSL [-x|--x] [-qN] [-gM] [-n] 
[%name=value...] [config] [outfile]
and is case insensitive, except for file names in Linux. Short help on the command line 
will be displayed by running the program with any of the arguments ? -? 
-h. All lossgainRSL arguments are optional and have the following 
functions:
- -x
- 
    Option -xspecifies that the program will output only genes identified in the reference species along with their orthologs in other species, even if entire synteny blocks (i.e., including witnesses) were requested in the configuration file. And vice versa, if--xis specified, the program will output the genes of any species together with the witnesses in that species irrespective of the configuration setting.
- -qN
- 
    This option allows shortening the program log output to console. Default N=0 is 
    a maximum detailed log. If -q3is specified, the program outputs only final result thus producing the shortest log.
- -gM
- Controls the frequency of output during the search of candidate genes. The value of M is a step of displaying serial numbers of genes in each genomic sequence under analysis (chromosome, scaffold, contig, etc.). Zero value cancels this output. The step of 10 is used by default.
- -n
- 
    Prevents the program from using MPI. If lossgainRSLfails because MPI has not been installed in the OS, it is recommended first to specify this option. If the error persists, use the program version compiled without MPI support (see downloadable files below).
- %name=value
- 
    The program allows the user to specify numerical parameters in the configuration file 
    (in the form of %name, wherenameis an identifier); the effective values of such parameters to be specified in the command line by this argument(s). Separate argument is required for each named parameter; the order is arbitrary. Numerical values may be of integral or fractional types, written in fixed or scientific form. To use the command in Windows scripts, the percent character should be doubled.
- config
- 
    The name of configuration file including optional path. Default is file 
    config.iniin the current directory.
- outfile
- 
    The name of output file including optional path. Default is file result.txtin the current directory.
The program outputs its log to the standard stream stdout (console by default), 
which can be redirected to a file if desired. If requested in the configuration file, debug 
data will output to the standard stream stderr (also console by default); 
it can be redirected to a file by 2>logfile parameter, where 
logfile is the file name.
If MPI support is provided by OS, the program allows multiprocessing. The following command line should be used:
mpirun -np P lossgainRSL [-x|--x] [-qN] 
[-gM] [%name=value...] [config] [outfile]
where mpirun is a name of the MPI launcher program (mpiexec 
in Windows with Microsoft MPI installed), P is the number of logical CPUs 
(e.g. cores) to run lossgainRSL. Other arguments are the same as in single CPU mode.
The configuration file controls all work of the program and is required for its use. This text 
file consists of five sections; each one starts with a line containing the section name in brackets. 
Lines beginning with semicolon (;) or double slash (//) characters are 
ignored as well as a part of any line starting with double slash. Within a line, fields are 
separated by tab character(s); line wrapping is prohibited. The configuration examples with short 
comments are provided in the test example, the detailed 
description is available in the user's manual.
The current version of lossgainRSL is primarily orientated to species and genomic 
data included in the Ensembl 
database, but the program manual also describes other possibilities. Specifically, the program 
addspecies (downloadable below) can help in addition of few species from GenBank. 
The program lossgainRSL uses three types of input data:
- Table of genes
- 
    It contains IDs of proteins and genes, gene coordinates, names and annotations. A separate 
    text file in TSV (tab-separated values) format is required for each species involved in 
    the task under consideration. The file name must coincide with the name of species in the 
    configuration file; optional path and extension (common for all tables of genes) should be 
    specified there too. This table can be formed through a direct SQL query to the Ensembl 
    database or interactively using the 
    BioMart 
    data-mining tool (see the program manual for details). 
    A portion of the table of genes imported to Excel is exemplified in Rat_genes sheet of the file 
    dataformat.xlsx.
- Table of orthologs
- 
    A separate TSV-formatted file is required for the reference species and each substitute species 
    in a 3-species condition (see the program manual for 
    details). Each line of the file contains two IDs of orthologous genes, the first one belonging 
    to the reference/substitute species, and the second — to another species involved in the task. 
    Heading or empty lines are ignored. The file name must coincide with the name of 
    reference/substitute species in the configuration file; optional path and extension (common 
    for all tables of orthologs) should be specified there too. The table can be formed by merging 
    the sub-tables for each pair (reference_species, other_species) obtained from the 
    Ensembl database interactively or through a SQL query. An opening portion of the table of 
    orthologs imported to Excel is exemplified in Mouse_orthologs sheet of the file 
    dataformat.xlsx.
- Table of paralogs
- 
    This table is necessary if the selection of orthologous genes is to be performed with paralogs 
    taken into account. A separate TSV-formatted file is required for each species where paralogs 
    are to be considered. The file is similar to the table of orthologs, but each line contains 
    IDs of two paralogous genes from the same species. The file name must coincide with the name 
    of reference/substitute species in the configuration file; optional path and extension (common 
    for all tables of paralogs) should be specified there too. This table can be formed from the 
    Ensembl database interactively or through a SQL query. An opening portion of the table of 
    paralogs imported to Excel is exemplified in Mouse_paralogs sheet of the file 
    dataformat.xlsx.
As an alternative to the collection of tables of orthologs and paralogs, lossgainRSL 
allows for using three other types of the information on homologous genes (see the program manual 
for details):
- Matrix of the protein likeness
- 
    A separate TSV-formatted file is required for each species involved in the task under 
    consideration. The file name must coincide with the species name in the configuration file; 
    optional path and extension (common for all such matrices) should be specified there too. 
    Each line of the file must contain four fields: two IDs of homologous proteins (first one 
    from the given species, second one from any species), Raw score, and E-value corresponding 
    to the optimal alignment of mentioned proteins. This file can be formed with BLASTP, provided 
    that amino acid sequences for all protein IDs have been prefetched from the Ensembl or GenBank. 
    An opening portion of the matrix imported to Excel is exemplified in Xenopus_scores sheet of 
    the file dataformat.xlsx.
- Table of protein clusters
- 
    A single TSV-formatted file is used that is formed by the 
    protein clustering program. This table 
    contains one protein/gene per line, but these lines are grouped in clusters of orthologous 
    genes; the clusters are separated by blank line. Each gene occurs not more than once 
    according to the variant of splicing that yields a protein with the biggest likeness. 
    Clusters consisting of only one gene (singletons) are omitted. Except for protein/gene ids, 
    other fields are ignored; all necessary data on genes is obtained from independent tables 
    of genes (see above). An opening portion of this table imported to Excel is exemplified in 
    Gene_clusters sheet of the file 
    dataformat.xlsx.
- Table of orthogroups
- 
    A single TSV-formatted file is used that is formed from the 
    OrthoDB database. This table is similar 
    in contents and format to the union of the above-described tables of genes. It contains one 
    protein/gene per line, but these lines are grouped in the groups of orthologous genes; the 
    orthogroups are separated by blank line. Each protein/gene line must include a field with a 
    species name corresponding to those in the configuration file (unknown species are ignored). 
    Separate tables of genes are not required in this case because all necessary data is contained 
    in the table of orthogroups. An opening portion of this table imported to Excel is exemplified 
    in Orthogroups sheet of the file 
    dataformat.xlsx.
The program lossgainRSL outputs to the TSV-formatted file suitable for direct 
import to the Excel spreadsheet. The file contains all found genes of the reference species that 
satisfy the configured predicate and parameters. The output can optionally contain the entire 
synteny block for each gene (including witnesses) as well as homologous genes in other species. 
An example of output file — the list of mouse genes on Y chromosome that absent from Y chromosome 
in human but present in at least two species among capuchin, rhesus, pig and rat — from the test 
example, imported to Excel, is given in Result sheet of the file 
dataformat.xlsx.
Downloadable files:
- User's manual for lossgainRSL: English v6.31
- Input/output data formats (imported to Excel): 
  dataformat.xlsx
- lossgainRSLbinary executables for Windows: v6.31
- Test example for Windows with six species: example.zip
- lossgainRSLsource code and test example for Linux (under GNU GPL): v6.31
- Auxiliary utilities and scripts: 
  gbinputv6.31