FinDense program

The program is a part of the iHCE software package aimed at identification of highly conserved elements (HCEs) in a set of genomes. It transforms the initial graph built by BldGraph into the final one and identifies m-dense subgraphs (clusters) in the latter graph. These clusters consist of vertices together with words assigned to them, that belong at least to m parts of the graph and are connected by edges with the biggest total weight. Each cluster (may be except for one giant cluster) is considered a predicted HCE. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.

FinDense is a command line utility written in C++. The command line syntax:

findense [options] [config_file]

(Here, findense is the program name in Linux; findense64 is a variant for Windows 64-bit with MPICH2, findense64ms — with Microsoft MPI, and findense64nompi — without MPI.)

The program options can be specified in any of the three formats: /x[ value], -x[ value], --x…xx[=value]; some options lack the value. Most options are used for changing the “usual” values of the parameters, which are set in the program configuration file or by default (a value in the command line has the highest priority).

The recognized command line options and their default values

-?
--help
Displays a brief help on command line options.
-a value
--annotate=value
Specifies how long shall be the intersection of a word with an annotated DNA region for the annotation to cover such word (and appear in the detailed cluster file). Thus, if the value equals 1, the word must have at least one common position with annotated region; if -1 — the word can be adjacent to the region and so on. The zero value means that annotations will not be used at all. The default value is 8.
-с filename
--config=filename
The configuration file name, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the config.ini file in the working directory.
-d value
--diversity=value
Sets the diversity threshold for identified clusters. The parameter does not affect the FinDense algorithm, it only manages the final result selection. For each HCE, an empirical estimate “diversity” is calculated, and if it is less than the threshold, the cluster is skipped. The diversity is calculated by addition of up to 3 kinds of summands for each word of the cluster: 10000 for a new genome of highest taxon (e.g., for each new genus), 100 for a new genome of lower taxon (e.g., new species of the same genus), 1 for a new genome of lowest taxon (e.g., new strain of the same species). Specific diversity levels are defined by the user who sets a taxonomic code of each organism in the [species] section of the configuration file. The taxonomic code is a string consisting of a maximum of 3 characters corresponding to the three levels; the characters are used to compare novelty of words from those genomes when calculating the HCE diversity. The zero value means that the program results will include all identified clusters irrespective of their diversity. The default value 20000 means that all clusters consisting of words from genomes of the same “genus”, which have the same first letter in the taxonomic code.
-e filename
--err=filename
The name of the error log file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The # character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name is fd_err#.log. The user can completely disable error logging through the configuration file.
-f name
--fastapath=name
The common prefix of the input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the fasta\ subdirectory of the working directory.
-g name
--gffpath=name
The common prefix of the files containing genome annotations in GFF v.3 formats; usually this is a directory path. The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the gff\ subdirectory of the working directory. The separate file with annotations is not required if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional.
-h filename
--hub=filename
The name of the initial graph primary file, which may include a directory path. The initial graph is built by the BldGraph program, it consists of the primary file (normally with .hub extension) and multiple secondary files with .star extension, which occur in the same directory (the default is graph\ in the working directory). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file; there is no default name.
-i filename
--clstat=filename
The name of the cluster statistics file, which may include a directory path. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the -m option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value is cluster\clstat_#.txt.
-j filename
--hcecount=filename
The name of the file to write the HCE statistics over genome pairs; the name may include a directory path. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the -m option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value is cluster\hcecount_#.txt.
-k filename
--cluster=filename
The name of the detailed cluster file, which may include a directory path. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the -m option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value is cluster\cluster_#.txt.
-m value
--minpart=value
The minimum allowable number of graph parts (i.e., different genomes) in a cluster. The default value is 3, so the cluster must contain words from at least three different genomes. The zero value means that clusters must contain words from all input genomes. To disable this control use the value of 1.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The # character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name is findense#.log.
-p value
--portion=value
The number of records in the portion used for data exchange between parallel processes within the program. The larger the portion, the higher the performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 2000 to, e.g., 100.
-q string
--seq=string
This option controls the output format of the genome sequence as an HCE word in the detailed cluster file (ref. the -k option). The string is case-sensitive and contains up to 3 different letters from the set {U, I, X, A, C} in any order without spaces. The default value is xU. The meaning of the letters:
U/u
At each vertex of the final graph, a region corresponding to the union of initial words merged at this vertex is shown in uppercase (U) or lowercase (u) letters.
I/i
At each vertex of the final graph, a region corresponding to the intersection of initial words merged at this vertex is shown in uppercase (I) or lowercase (i) letters.
If both union and intersection is specified in the string, the intersection region will be shown as per I/i specification, and its complement to the union region — as per U/u specification.
X/x
In addition to the requested region, N preceding and N following letters are shown in upper (X) or lower (x) case, unless the beginning or the end of the sequence is met. The value of N is set by the -x option or the extra parameter in the configaration file. The default value is 5.
A, C
Reserved for future development.
-r filename
--result=filename
The name of the result file, which may include a directory path. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the -m option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value is cluster\result_#.txt.
-s filename
--star=filename
The name of the initial graph secondary file, which may include a directory path (see also the -h option). The name must contain the * character, which means organism identifier. If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name can be specified in the configuration file, otherwise the default value graph\*.star will be used.
-t value
--stamp=value
Specifies the number of minutes after which the program reports its progress to the log file. This parameter helps to monitor the work of the program and predict its completion time. The default value of 0 makes the program report only on completion of the predefined stages.
-w value
--weight=value
Specifies minimum allowable edge weight (edges with lower weights are deleted). This option allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to eliminate a giant cluster or decrease its size. The default value of 0 means that all edges are kept.
-x value
--extra=value
Specifies the number of extra preceding or following letters added to each word of the cluster. See also the -q option. The default value is 5.

The first command line argument that is not an above option or its value will be considered a name of the configuration file (similar to the -c option).

The program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be wrapped; empty lines are ignored. A line that starts with a ; or # character is considered a comment (i.e., ignored). A part of the line starting with double slash (//) is also considered a comment.

Meaningful lines of the file can be of the two kinds: the section name which has the format [section_name], and a parameter setting in the format parameter = value. In the latter format, at least one space or tab character must be used as a delimiter before and after the equality sign. A part ot of the file from the section name until the next section name or the end of file is referred to as a configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of the [common] section, which must be the first one if present, and the [species] section, which must be the last one. FinDense uses only the sections [common], [findense], [species] and ignores any other sections of the configuration file.

The recognized configuration parameters in the [findense] section

splitlog
Sets logging mode for parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes, true, 1, +) or FALSE (possible forms are no, false, 0, -). True value makes each MPI process generate a separate log file with the name specified by parameters logname, logext or command line option -o (the latter has a higher priority). If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is TRUE.
logname
Specifies the log file name (without extension), which may include a directory path. If the value contains spaces, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is findense. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog, logname, logext at once.
logext
Specifies the extension of the log file name. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is .log. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname
Specifies the error log file name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is fd_err. If an empty value is specified, the error log will not be produced. The value of this parameter can be changed by the -e command line option, which modifies the two parameters errname, errext at once.
errext
Specifies the extension of the error log file name. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is .log. The value of this parameter can be changed by the -e command line option, which modifies the two parameters errname, errext at once.
fastapath
The common prefix of the input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the fasta\ subdirectory of the working directory.
gffpath
Specifies the common prefix of the files containing the genome annotations in GFF v.3 formats; usually this is a directory path. If the value contains spaces, it must be enclosed in double quotes. If the parameter has an empty value, annotations will not be used at all. The variable part of the names is specified in the [species] section of the configuration file. The default is the gff\ subdirectory of the working directory. The value of this parameter can be changed by the -g command line option. The separate file with annotations is not required if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional.
hubname
Specifies the name and path of the initial graph primary file built by BldGraph. If the value contains spaces, it must be enclosed in double quotes. If the parameter is not set by the -h command line option and absent from the [findense] section, it must be specified in [common] section; there is no default value.
starname
Specifies the name and path of the initial graph secondary file built by BldGraph. If the value contains spaces, it must be enclosed in double quotes. The name must contain the * character, which is replaced with the organism identifier. The value of this parameter can be changed by the -s command line option. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is graph\*.star.
length
The lower threshold of the word length (only words of greater length are considered). This option allows extra filtering by the length of candidate words, in addition to that already made by the PairHits' and/or BldGraph's -l options. The default value is 60.
keysize
Specifies the minimum length of the exactly matching portion of sought-for words (the same as for PairHits). Recommended value is a multiple of 4 in the range of 16 to 48. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is 16.
serial_del
Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). The value -1 means that there is no limitation on the number of deletions. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is 2.
maxscore
Specifies the maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is 17.5.
score_del
Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is 2.1.
score_mis
Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. If the [findense] section lacks this parameter, the setting from the [common] section will be used. The default value is 1.
clstat
Specifies the name of the cluster statistics file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the minpart parameter). The value of this parameter can be changed by -i command line option. The default value is cluster\clstat_#.txt.
cluster
Specifies the name of the detailed cluster file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The file is described among the output data below. If the name contains # character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the minpart parameter). The value of this parameter can be changed by -k command line option. The default value is cluster\cluster_#.txt.
result
Specifies the name of the result file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The file is described among the output data below. If the name contains the # character, it will be replaced with specified minimum number of different organisms in the cluster (ref. the minpart parameter). The value of this parameter can be changed by the -r command line option. The default value is cluster\result_#.txt.
hcecount
Specifies the name of the file to write the HCE statistics over genome pairs; the name may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The file is described among the output data below. If the name contains the # character, it will be replaced with specified minimum number of different organisms in the cluster (ref. the minpart parameter). The value of this parameter can be changed by the -j command line option. The default value is cluster\hcecount_#.txt.
minpart
Specifies the minimum allowed number of graph parts (i.e., different genomes) in a cluster. The default value is 3, so the cluster must contain words from at least three different genomes. The zero value means that clusters must contain words from all input genomes. To disable this limitation use the value of 1. The value of this parameter can be changed by the -m command line option.
minweight
Specifies the minimum allowable edge weight (edges with a lower weights are deleted). This option allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to eliminate a giant cluster or decrease its size. The default value of 0 means that all edges are kept. The value of this parameter can be changed by the -w command line option.
diversity
Sets the diversity threshold for identified clusters. The parameter does not affect the FinDense algorithm, it only manages the final result selection. See more detail in the description of the -d command line option, which also allows changing this parameter. The default value is 20000.
milestone
Specifies the number of minutes after which the program reports its progress to the log file. This parameter helps to monitor the work of the program and predict its completion time. The value of this parameter can be changed by the -t command line option. The default value of 0 makes the program report only on completion of the predefined stages.
extra
Specifies the number of extra preceding or following letters added to each word of the cluster. See the -q option for detail. The value of this parameter can be changed by the -x command line option. The default value is 5.
seqmode
The parameter controls output format of the genome sequence as an HCE word in the detailed cluster file (ref. the cluster parameter). The string is case-sensitive and contains up to 3 different letters from the set {U, I, X, A, C} in any order without spaces. See detail in the -q option, which can be used to change this parameter. The default value is xU.
numnode
Specifies a boolean value in the form like the splitlog parameter. If true value is specified, the detailed cluster file will include for each vertex its number as well as numbers of all incident vertices of the final graph. By default, the program uses false value and does not include vertex numbers in output data.
annotate
Specifies how long must be the intersection of a word with an annotated DNA region for the annotation to cover such word (and appear in the detailed cluster file). Thus, if the value equals 1, the word must have at least one common position with annotated region; if -1 — the word can be adjacent to the region and so on. The zero value means that annotations will not be used at all. The value of this parameter can be changed by -a command line option. The default value is 8.
outgene
Specifies a boolean value in the form like the splitlog parameter. If true value is specified (the default), for each word that overlaps with annotated gene region but does not entirely lie inside it, the detailed cluster file will include the number of positions the number of positions by which the word extends beyond the gene region. If false value is specified, the mutual location of words and genes is not considered or reported.
toptype
Specifies the list of top level genomic sequence types to be recognized in annotations. The entire list must be in a single line. The types are separated by the | character; this character must also be in the start and the end of the list. The parameter is required if annotation files are used.
lowtypeX (X=1,2,3)
Each of these three parameters specifies the list of genomic sequence types at lower levels to be recognized in annotations. Every word in the detailed cluster file can be accompanied by annotations of up to three levels, e.g. gene, translated region (CDS) and RNA annotations. In order to recognize the annotation type, a separate list should be provided for each level. Such list is similar to that of the the toptype parameter: it occurs in a single line, starts and ends with the | character, and uses it as delimiter. All these three parameters are required if annotation files are used.
portion
The number of records in the portion used for data exchange between parallel processes within the program. The larger the portion, the higher the performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 2000 to, e.g., 100.

The list of input data in the [species] section

This section establishes a correspondence between the organism name, identifier, number and data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting with a ; or # character. Leading spaces or tab characters in the line are skipped, then at least 5 field must follow delimited by one or more tab characters. A field may include spaces. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. the -d option above);
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by the fastapath parameter in the configuration file or by the -f command line option.
  6. file name of the genome annotation in GFF v.3 format; this field is optional. Moreover, the separate file with annotations is not necessary if the genome is given in the GenBank format, which normally includes annotations. A path to the file is specified by the gffpath parameter in the configuration file or by the -g command line option.

A template list of input data is provided in the text example for Windows and Linux.

Output data

FinDense produces a number of output text files, the names and locations of which are specified in the configuration file and/or command line options. These files are described below using the names resulting from an execution of the test example; the relevant configuration parameters and command line options are given in parentheses. For convenience, these file templates are provided in the myfiles\ directory of the test example.

findense.log: the program log file (splitlog, logname, logext, -o)

This file is mostly self-explanatory. At the beginning, there is the program version number and the number of parallel processes, if FinDense was run in MPI mode, as well as the command line arguments if any. In parallel mode, each process usually produces a separate log file; the most substantial is the log of the root process (with number 0), which finally identifies the clusters. If the initial graph was built by BldGraph in the mode insentitive to DNA strand, FinDense tries to bind each cluster word to the specific strand. In rare cases, such binding cannot be done and the log contains warning “Cannot reconcile cluster…” with the number of unsuccessfully binded cluster.

matrix.fas: the result file (result, minpart, -r, -m)

The current FinDense version produces the result file for subsequent building of a phylogenetic tree by RAxML program. The file has FASTA format and contains one sequence for each organism; the organism name appears in the header line of the sequence after the > character. All sequences have the same length equal to the number of clusters identified and consist of characters 0 and 1. The character 1 in j-th position of i-th string indicates that HCE with the number j contains a word from the i-th genome; otherwise 0 appears in that position.

hcecount_3.txt: the file with HCE statistics over genome pairs (hcecount, minpart, -j, -m)

The strings of this text file contain a number of fields delimited by the tab character, which allows its easy import to a spreadsheet, e.g. Excel. The test example includes example.xlsx file whose sheet “hcecount” was produced this way; we shall describe the HCE statistics file with reference to this spreadsheet. It contains two square matrices with the order equal to the number of genomes. The organism names are show to the left and top of each matrix. The upper matrix relates to the initial graph, and the lower one — to the final graph. The cell figures are the number of pairs of similar words for each pair of genomes. In other words, these are numbers of edges with high weights between vertices of each two parts of the graph. The bottom row of the matrix contains the total amounts for each column.

clstat_3.txt: the cluster statistics file (clstat, minpart, -i, -m)

The lines of this text file contain a number of fields delimited by the tab character, which allows to easily import it to a spreadsheet, e.g. Excel. The test example includes example.xlsx file whose sheet “clstat” was produced this way; we shall describe the cluster statistics file with reference to this spreadsheet. First line of the file provides field headings for the second line, which contains the following data (per column): A — the total number of graph parts, i.e., input genomes; B — the minimum number of parts represented in a cluster (the value of the minpart parameter); C — the number of vertices in the initial graph; D — the number of edges in the initial graph; E — the number of steps done until convergence of the dense subgraphs identification algorithm; F — the number of vertices in the final graph; G — the number of edges in the final graph; H — the total number of clusters with sufficient diversity; I — the maximum number of vertices (i.e., words) in a cluster; J — the minimum number of vertices in a cluster; K — the average number of vertices in a cluster (rounded to the nearest integer); L–O — the number of clusters that contain words from the number of genomes shown in the respective heading.

The third line contains the field headings for subsequent lines of the file, namely (per column): A — serial number of the cluster; B — total number of words in the cluster; C - the number of different genomes these words belong to; D–I — the number of words from each genome; K — the number of words overlapping with a gene; L — the number of words overlapped with translated region (CDS); M — the number of words overlapping with known RNA. The subsequent lines contain these values for each cluster. The clusters are listed in descending order of their diversity (see the -d option).

cluster_3.txt: the detailed cluster file (cluster, minpart, -k, -m)

The lines of this text file contain a number of fields delimited by the tab character, which allows to easily import it to a spreadsheet, e.g. Excel. The test example includes example.xlsx file whose sheet “cluster” was produced this way; we shall describe the detailed cluster file with reference to this spreadsheet. The first line of the file provides field headings, the second line is left empty for setting filters in Excel. The subsequent lines describe the clusters in the same order as in the cluster statistics file. Each cluster is presented by a heading line (in specific format) followed by a data line for each word of the cluster. The data linws have common format corresponding to the headline of the file.

Notes:

  1. The word positions in columns G and H are approximate; exact coordinates can be determined by multiple alignment of all words of the cluster (to be implemented in future versions of FinDense).
  2. The word in column J is the genome part corresponding to the values in columns G, H, I, and for negative strand the reverse complement is taken. Since in general case this word is the result of merging a group of multiple words assigned to vertices of the source graph, the common intersection of those words is shown in uppercase letters, and the union — in lowercase letters, in accordance with the configuration specified in the test example. Other output formats are also possible, see the seqmode parameter and the -q command line option for details.
  3. When the initial graph is buildint by BldGraph in the mode of indistinguishable DNA strands (strand = false), in rare cases the DNA strand indicator in column I cannot be recovered; this is indicated by value 0. In such cases a word in column J can unmatch with other words of the cluster, and the warning about that appears in the log file (see above).

Downloadable files

See the corresponding section of the iHCE page.