lossgainRSL
: a program for prediction of gene losses and gains between several groups of species
The program lossgainRSL
allows the user, for a given reference species,
to identify its genes that are present in and/or absent from several groups of species in
accordance with a given logical function (predicate). The elementary condition is “gene
X is present in the group of species S”, and the predicate is a Boolean
function composed of such conditions using logical connectives AND, OR, NOT. A group of species
does not need to be a taxonomic group; it can be formed on the basis of arbitrary traits.
In particular, a species may belong to multiple groups and S may consist of a single
species. For a gene X to present in the group S, it is required that
X presents in at least p species from S (p is
a parameter of the group). The presence of gene X from the reference species in some
other species means that (1) such species includes gene X', which is an ortholog or
close homolog of X, and (2) there are several distinct genes (“witnesses”)
near the gene X, which have respective orthologs located near X' in that
species. Thus, synteny is taken into account. The groups of species, predicate, number
of witnesses and other synteny details, range of nearness, and extent of homology — all are the
program parameters. Thus, lossgainRSL
is applicable to the wide range of studies.
The program is written in C++ as a command line utility for Windows/Linux. The command syntax is
lossgainRSL [-x|--x] [-qN] [-gM] [-n]
[%name=value...] [config] [outfile]
and is case insensitive, except for file names in Linux. Short help on the command line
will be displayed by running the program with any of the arguments ?
-?
-h
. All lossgainRSL
arguments are optional and have the following
functions:
-x
-
Option
-x
specifies that the program will output only genes identified in the reference species along with their orthologs in other species, even if entire synteny blocks (i.e., including witnesses) were requested in the configuration file. And vice versa, if--x
is specified, the program will output the genes of any species together with the witnesses in that species irrespective of the configuration setting. -qN
-
This option allows shortening the program log output to console. Default N=0 is
a maximum detailed log. If
-q3
is specified, the program outputs only final result thus producing the shortest log. -gM
- Controls the frequency of output during the search of candidate genes. The value of M is a step of displaying serial numbers of genes in each genomic sequence under analysis (chromosome, scaffold, contig, etc.). Zero value cancels this output. The step of 10 is used by default.
-n
-
Prevents the program from using MPI. If
lossgainRSL
fails because MPI has not been installed in the OS, it is recommended first to specify this option. If the error persists, use the program version compiled without MPI support (see downloadable files below). %name=value
-
The program allows the user to specify numerical parameters in the configuration file
(in the form of
%name
, wherename
is an identifier); the effective values of such parameters to be specified in the command line by this argument(s). Separate argument is required for each named parameter; the order is arbitrary. Numerical values may be of integral or fractional types, written in fixed or scientific form. To use the command in Windows scripts, the percent character should be doubled. config
-
The name of configuration file including optional path. Default is file
config.ini
in the current directory. outfile
-
The name of output file including optional path. Default is file
result.txt
in the current directory.
The program outputs its log to the standard stream stdout
(console by default),
which can be redirected to a file if desired. If requested in the configuration file, debug
data will output to the standard stream stderr
(also console by default);
it can be redirected to a file by 2>logfile
parameter, where
logfile
is the file name.
If MPI support is provided by OS, the program allows multiprocessing. The following command line should be used:
mpirun -np P lossgainRSL [-x|--x] [-qN]
[-gM] [%name=value...] [config] [outfile]
where mpirun
is a name of the MPI launcher program (mpiexec
in Windows with Microsoft MPI installed), P
is the number of logical CPUs
(e.g. cores) to run lossgainRSL
. Other arguments are the same as in single CPU mode.
The configuration file controls all work of the program and is required for its use. This text
file consists of five sections; each one starts with a line containing the section name in brackets.
Lines beginning with semicolon (;
) or double slash (//
) characters are
ignored as well as a part of any line starting with double slash. Within a line, fields are
separated by tab character(s); line wrapping is prohibited. The configuration examples with short
comments are provided in the test example, the detailed
description is available in the user's manual.
The current version of lossgainRSL
is primarily orientated to species and genomic
data included in the Ensembl
database, but the program manual also describes other possibilities. Specifically, the program
addspecies
(downloadable below) can help in addition of few species from GenBank.
The program lossgainRSL
uses three types of input data:
- Table of genes
-
It contains IDs of proteins and genes, gene coordinates, names and annotations. A separate
text file in TSV (tab-separated values) format is required for each species involved in
the task under consideration. The file name must coincide with the name of species in the
configuration file; optional path and extension (common for all tables of genes) should be
specified there too. This table can be formed through a direct SQL query to the Ensembl
database or interactively using the
BioMart
data-mining tool (see the program manual for details).
A portion of the table of genes imported to Excel is exemplified in Rat_genes sheet of the file
dataformat.xlsx
. - Table of orthologs
-
A separate TSV-formatted file is required for the reference species and each substitute species
in a 3-species condition (see the program manual for
details). Each line of the file contains two IDs of orthologous genes, the first one belonging
to the reference/substitute species, and the second — to another species involved in the task.
Heading or empty lines are ignored. The file name must coincide with the name of
reference/substitute species in the configuration file; optional path and extension (common
for all tables of orthologs) should be specified there too. The table can be formed by merging
the sub-tables for each pair (reference_species, other_species) obtained from the
Ensembl database interactively or through a SQL query. An opening portion of the table of
orthologs imported to Excel is exemplified in Mouse_orthologs sheet of the file
dataformat.xlsx
. - Table of paralogs
-
This table is necessary if the selection of orthologous genes is to be performed with paralogs
taken into account. A separate TSV-formatted file is required for each species where paralogs
are to be considered. The file is similar to the table of orthologs, but each line contains
IDs of two paralogous genes from the same species. The file name must coincide with the name
of reference/substitute species in the configuration file; optional path and extension (common
for all tables of paralogs) should be specified there too. This table can be formed from the
Ensembl database interactively or through a SQL query. An opening portion of the table of
paralogs imported to Excel is exemplified in Mouse_paralogs sheet of the file
dataformat.xlsx
.
As an alternative to the collection of tables of orthologs and paralogs, lossgainRSL
allows for using three other types of the information on homologous genes (see the program manual
for details):
- Matrix of the protein likeness
-
A separate TSV-formatted file is required for each species involved in the task under
consideration. The file name must coincide with the species name in the configuration file;
optional path and extension (common for all such matrices) should be specified there too.
Each line of the file must contain four fields: two IDs of homologous proteins (first one
from the given species, second one from any species), Raw score, and E-value corresponding
to the optimal alignment of mentioned proteins. This file can be formed with BLASTP, provided
that amino acid sequences for all protein IDs have been prefetched from the Ensembl or GenBank.
An opening portion of the matrix imported to Excel is exemplified in Xenopus_scores sheet of
the file
dataformat.xlsx
. - Table of protein clusters
-
A single TSV-formatted file is used that is formed by the
protein clustering program. This table
contains one protein/gene per line, but these lines are grouped in clusters of orthologous
genes; the clusters are separated by blank line. Each gene occurs not more than once
according to the variant of splicing that yields a protein with the biggest likeness.
Clusters consisting of only one gene (singletons) are omitted. Except for protein/gene ids,
other fields are ignored; all necessary data on genes is obtained from independent tables
of genes (see above). An opening portion of this table imported to Excel is exemplified in
Gene_clusters sheet of the file
dataformat.xlsx
. - Table of orthogroups
-
A single TSV-formatted file is used that is formed from the
OrthoDB database. This table is similar
in contents and format to the union of the above-described tables of genes. It contains one
protein/gene per line, but these lines are grouped in the groups of orthologous genes; the
orthogroups are separated by blank line. Each protein/gene line must include a field with a
species name corresponding to those in the configuration file (unknown species are ignored).
Separate tables of genes are not required in this case because all necessary data is contained
in the table of orthogroups. An opening portion of this table imported to Excel is exemplified
in Orthogroups sheet of the file
dataformat.xlsx
.
The program lossgainRSL
outputs to the TSV-formatted file suitable for direct
import to the Excel spreadsheet. The file contains all found genes of the reference species that
satisfy the configured predicate and parameters. The output can optionally contain the entire
synteny block for each gene (including witnesses) as well as homologous genes in other species.
An example of output file — the list of mouse genes on Y chromosome that absent from Y chromosome
in human but present in at least two species among capuchin, rhesus, pig and rat — from the test
example, imported to Excel, is given in Result sheet of the file
dataformat.xlsx
.
Downloadable files:
- User's manual for
lossgainRSL
: English v6.31 - Input/output data formats (imported to Excel):
dataformat.xlsx
lossgainRSL
binary executables for Windows: v6.31- Test example for Windows with six species: example.zip
lossgainRSL
source code and test example for Linux (under GNU GPL): v6.31- Auxiliary utilities and scripts:
gbinput
v6.31