Lab.6 IITP RAS logo
09/09/24
10:51:33

Laboratory of Mathematic methods and models in bioinformatics,
Institute for Information Transmission Problems,
Russian Academy of Sciences

« back

FinDense program

The program is a part of the iHCE software package aimed at the identification of highly conserved elements (HCEs) in a set of genomes. It transforms the initial graph built by BldGraph into the final one and identifies m-dense subgraphs (clusters) in the latter graph. These clusters consist of vertices together with words assigned to them, that belong at least to m parts of the graph and are connected by edges with biggest total weight. Each cluster (may be except for one giant cluster) is considered as predicted HCE. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.
FinDense is a command line utility written in C++. The command line syntax:

findense [options] [config_file]

(Here, findense is the program name in Linux; findense64 is a variant for Windows 64-bit with MPICH2, findense64ms -- with Microsoft MPI, and findense64nompi -- without MPI.)

The program options can be specified in any of the three formats: /x[ value] -x[ value] --x...xx[=value]; some options lack the value. Most options are used for changing "usual" values of the parameters, which are set in the program configuration file or by default (a value in the command line has the highest priority).

Recognizable command line options and default values
-?
--help
Display brief help on command line options.
-a value
--annotate=value
Specifies how long shall be the intersection of a word with an annotated DNA region for the annotation to cover such word (and appear in the detailed cluster file). Thus, if the value equals 1, the word must have at least one common position with annotated region; if -1 -- the word can adjoin with the region and so on. Zero value means that annotations will not be used at all. Default value is 8.
-ñ filename
--config=filename
The configuration file name, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is "config.ini" file in the working directory.
-d value
--diversity=value
Sets the diversity threshold for identified clusters. The parameter does not affect the FinDense algorithm, it only manages the final result selection. For each HCE, an empirical estimate ("diversity") is calculated, and if it is less than the threshold, the cluster is skipped. The diversity is calculated by addition of up to 3 kinds of summands for each word of the cluster: 10000 for a new genome of highest taxon (e.g., for each new genus), 100 for a new genome of lower taxon (e.g., new species of the same genus), 1 for a new genome of lowest taxon (e.g. new strain of the same species). Specific diversity levels are defined by the user who sets a taxonomic code of each organism in section [species] of the configuration file. The taxonomic code is a string consisting of maximum 3 characters corresponding to the three levels; the characters are used to compare newness of words from those genomes when calculating the HCE diversity. Zero value means that the program results will include all identified clusters irrespective of their diversity. Default value 20000 means that all clusters consisting of words from genomes of the same "genus", which have the same first letter in the taxonomic code.
-e filename
--err=filename
The name of error log file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Character '#' in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. Default filename is "fd_err#.log". The user can completely disable error logging through the configuration file.
-f name
--fastapath=name
The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "fasta\" of the working directory.
-g name
--gffpath=name
The common prefix of files containing the genome annotations in GFF v.3 formats; usually this is a directory path. Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "gff\" of the working directory. The separate file with annotations is not necessary if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional.
-h filename
--hub=filename
The name of the initial graph primary file, which may include a directory path. The initial graph is built by the BldGraph program, it consists of the primary file (normally with .hub extension) and multiple secondary files with .star extension, which occur in the same directory (default is "graph\" in the working directory). If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file, there is no default name.
-i filename
--clstat=filename
The name of the cluster statistics file, which may include a directory path. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. option -m). If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. Default value is "cluster\clstat_#.txt".
-j filename
--hcecount=filename
The name of the file to write the HCE statistics over genome pairs; the name may include a directory path. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. option -m). If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. Default value is "cluster\hcecount_#.txt".
-k filename
--cluster=filename
The name of the detailed cluster file, which may include a directory path. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. option -m). If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. Default value is "cluster\cluster_#.txt".
-m value
--minpart=value
The minimum allowable number of graph parts (i.e., different genomes) in a cluster. Default value is 3, so the cluster must contain words from at least three different genomes. Zero value means that clusters must contain words from all input genomes. To disable this control use the value of 1.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Character '#' in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. Default filename is "findense#.log".
-p value
--portion=value
The number of records in the portion used for data exchange between parallel processes within the program. The bigger portion, the higher performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 2000 to eg 100.
-q string
--seq=string
This option controls output format of the genome sequence as an HCE word in the detailed cluster file (ref. option -k). The string is case-sensitive and contains up to 3 different letters from the set {U,I,X,A,C} in any order without spaces. Default value is "xU". The sense of the letters:
U
At each vertex of the final graph, a region corresponding to the union of initial words merged at this vertex is shown in uppercase letters. If u is specified, the region is shown in lowercase letters.
I
At each vertex of the final graph, a region corresponding to the intersection of initial words merged at this vertex is shown in uppercase letters (or lowercase letters if i is specified). If both union and intersection is specified in the string, the intersection region will be as per i/I specification, and its complement to the union region -- as per u/U specification.
X
In addition to the requested region, N preceding and N following letters are shown in upper case (or lower case if x is specified), unless the beginning or end of the sequence is met. The value of N is set by the option -x or parameter extra in the configaration file. Default value is 5.
A,C
Reserved for future development.
-r filename
--result=filename
The name of the result file, which may include a directory path. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. option -m). If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. Default value is "cluster\result_#.txt".
-s filename
--star=filename
The name of the initial graph secondary file, which may include a directory path (see also option -h). The name must contain character '*', which means organism identifier. If the name contains blanks, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name can be specified in the configuration file, otherwise default value "graph\*.star" will be used.
-t value
--stamp=value
Specifies the number of minutes after which the program reports its progress to the log file. This parameter helps to monitor the work of the program and predict its completion. Default value of 0 makes the program reporting only on completion of the predefined stages.
-w value
--weight=value
Specifies minimum allowable edge weight (edges with lower weights are deleted). This option allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to eliminate a giant cluster or decrease its size. Default value of 0 means that all edges are kept.
-x value
--extra=value
Specifies the number of extra preceding or following letters added to each word of the cluster. See also -q option. Default value is 5.

The first command line argument that is not an above option or its value will be considered as a name of the configuration file (similar to option -c).

Program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be continued; empty lines are ignored. A line that starts with ';' or '#' character is considered as a comment (i.e., ignored). A part of the line starting with double slash (//) is also considered as a comment.

Meaningful lines of the file can be of the two kinds: the section header which has the format [section_name], and parameter setting in the format parameter = value. In the latter format, at least one blank or tab character must be used as a delimiter before and after the equality sign. A part ot of the file from the section header until the next section header or end of file is referred to as configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of [common] section, which must be the first one if present, and [species] section, which must be the last one. FinDense uses only sections [common], [findense], [species], and ignores any other sections of the configuration file.

Recognizable configuration parameters in section [findense]
splitlog Sets logging mode for parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes true 1 +) or FALSE (possible forms are no false 0 -). True value makes each MPI process generating a separate log file with the name specified by parameters logname, logext or command line option -o (the latter has higher priority). If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is TRUE.
logname Specifies the log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is "findense". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
logext Specifies the extension of the log file name. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is ".log". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname Specifies the error log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is "fd_err". If empty value is specified, the error log will not be produced. The value of this parameter can be changed by -e command line option, which modifies the two parameters errname, errext at once.
errext Specifies the extension of the error log file name. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is ".log". The value of this parameter can be changed by -e command line option, which modifies the two parameters errname, errext at once.
fastapath The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "fasta\" of the working directory.
gffpath Specifies the common prefix of files containing the genome annotations in GFF v.3 formats; usually this is a directory path. If the value contains blanks, it must be enclosed in double quotes. If the parameter has empty value, annotations will not be used at all. Variable part of the names to be specified in the [species] section of the configuration file. Default is the sub-directory "gff\" of the working directory. The value of this parameter can be changed by -g command line option. The separate file with annotations is not necessary if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional.
hubname Specifies the name and path of the initial graph primary file built by BldGraph. If the value contains blanks, it must be enclosed in double quotes. If the parameter is not set by the command line option -h and absent from the section [findense], it must be specified in section [common]; there is no default value.
starname Specifies the name and path of the initial graph secondary file built by BldGraph. If the value contains blanks, it must be enclosed in double quotes. The name must contain '*' character, which is replaced with the organism identifier. The value of this parameter can be changed by -s command line option. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is "graph\*.star".
length The lower threshold of the word length (only words of greater length are considered). This option allows extra filtering by the length of candidate words, in addition to that already made by the PairHits' and/or BldGraph's -l options. Default value is 60.
keysize Specifies the minimum length of the exactly matching portion of sought-for words (the same as for PairHits). Recommended value is a multiple of 4 in the range from 16 to 48. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is 16.
serial_del Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). Value -1 means that there is no limitation on the number of deletions. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is 2.
maxscore Specifies the maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is 17.5.
score_del Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is 2.1.
score_mis Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. If the section [findense] lacks this parameter, a setting from the section [common] will be used. Default value is 1.
clstat Specifies the name of the cluster statistics file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. minpart parameter). The value of this parameter can be changed by -i command line option. Default value is "cluster\clstat_#.txt".
cluster Specifies the name of the detailed cluster file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. minpart parameter). The value of this parameter can be changed by -k command line option. Default value is "cluster\cluster_#.txt".
result Specifies the name of the result file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. minpart parameter). The value of this parameter can be changed by -r command line option. Default value is "cluster\result_#.txt".
hcecount Specifies the name of the file to write the HCE statistics over genome pairs; the name may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. The file is described among output data below. If the name contains '#' character, it will be replaced with specified minimum number of different organisms in the cluster (ref. minpart parameter). The value of this parameter can be changed by -j command line option. Default value is "cluster\hcecount_#.txt".
minpart Specifies the minimum allowable number of graph parts (i.e., different genomes) in a cluster. Default value is 3, so the cluster must contain words from at least three different genomes. Zero value means that clusters must contain words from all input genomes. To disable this control use the value of 1. The value of this parameter can be changed by -m command line option.
minweight Specifies minimum allowable edge weight (edges with lower weights are deleted). This option allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to eliminate a giant cluster or decrease its size. Default value of 0 means that all edges are kept. The value of this parameter can be changed by -w command line option.
diversity Sets the diversity threshold for identified clusters. The parameter does not affect the FinDense algorithm, it only manages the final result selection. See more detail in the description of -d command line option, which also allows changing this parameter. Default value is 20000.
milestone Specifies the number of minutes after which the program reports its progress to the log file. This parameter helps to monitor the work of the program and predict its completion. The value of this parameter can be changed by -t command line option. Default value of 0 makes the program reporting only on completion of the predefined stages.
extra Specifies the number of extra preceding or following letters added to each word of the cluster. See -q option for detail. The value of this parameter can be changed by -x command line option. Default value is 5.
seqmode The parameter controls output format of the genome sequence as an HCE word in the detailed cluster file (ref. parameter cluster). The string is case-sensitive and contains up to 3 different letters from the set {U,I,X,A,C} in any order without spaces. See detail in -q option, which can be used to change this parameter. Default value is "xU".
numnode Specifies a boolean value in the form like splitlog parameter. If true value is specified, the detailed cluster file will include for each vertex its number as well as numbers of all incident vertices of the final graph. By default, the program uses false value and does not include vertex numbers in output data.
annotate Specifies how long must be the intersection of a word with an annotated DNA region for the annotation to cover such word (and appear in the detailed cluster file). Thus, if the value equals 1, the word must have at least one common position with annotated region; if -1 -- the word can adjoin with the region and so on. Zero value means that annotations will not be used at all. The value of this parameter can be changed by -a command line option. Default value is 8.
outgene Specifies a boolean value in the form like splitlog parameter. If true value is specified (the default), for each word that overlaps with annotated gene region but does not entirely lie inside it, the detailed cluster file will include the number of positions such word juts out the gene. If false value is specified, the mutual location of words and genes is not considered or reported.
toptype Specifies the list of top level genomic sequence types to be recognized in annotations. The entire list must be in a single line. The types are separated one from another by '|' character; this character also must be in the beginning and end of the list. The parameter is required if annotation files are used.
lowtypeX (X=1,2,3) Each of these three parameters specifies the list of genomic sequence types at lower levels to be recognized in annotations. Every word in the detailed cluster file can be accompanied by annotations of up to three levels, e.g. gene, translated region (CDS) and RNA annotations. In order to recognize the annotation type, a separate list should be provided for each level. Such list is similar to that of the toptype parameter: it occurs in a single line, starts and finishes with '|' character, and uses '|' as delimiter between the types. All these three parameters are required if annotation files are used.
portion The number of records in the portion used for data exchange between parallel processes within the program. The bigger portion, the higher performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 2000 to eg 100.

The list of input data in section [species]

This section establishes a correspondence between the organism name, identifier, number and data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting from the character ';' or '#'. Leading blanks or tab characters in the line are skipped, then at least 5 field must follow delimited by one or more tab character. A field may include blanks. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. option -d above);
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by fastapath parameter in the configuration file or by -f command line option.
  6. file name of the genome annotation in GFF v.3 format; this field is optional. Moreover, the separate file with annotations is not necessary if the genome is given in the GenBank format, which normally includes annotations. A path to the file is specified by gffpath parameter in the configuration file or by -g command line option.
Template list of input data is provided in the text example for Windows and Linux.

Output data

FinDense produces a number of output text files, which names and locations are specified in the configuration file and/or command line options. Below these files are described using the names resulting from execution of the test example; relevant configuration parameters and command line options are given in parentheses. For convenience, these file templates are provided in "myfiles\" directory of the test example.

findense.log: the program log file (splitlog logname logext -o<)
This file is self-explanatory. In its beginning, shown are the program version number and the number of parallel processes if FinDense was run in MPI mode as well as the command line arguments if any. In the parallel mode, each process usually produces a separate log file; the most substantial is the log of the root process (with number 0), which finally identifies the clusters. If the initial graph was built by BldGraph in the mode insentitive to DNA strand, FinDense tries to bind each cluster word to the specific strand. In rare cases, such binding cannot be done and the log contains a warning "Cannot reconcile cluster..." with the number of unsuccessfully binded cluster.

matrix.fas: the result file (result minpart -r -m)
The current FinDense version produces the result file for subsequent building of a phylogenetic tree by RAxML program. The file has FASTA format and contains one sequence for each organism; the organism name appears in the header line of the sequence after the '>' character. All sequences have the same length equal to the number of clusters identified and consist of characters '0' and '1'. The character '1' in j-th position of i-th string indicates that HCE with the number j contains a word from the i-th genome; otherwise '0' appears in that position.

hcecount_3.txt: the file with HCE statistics over genome pairs (hcecount minpart -j -m)
The strings of this text file contain a number of fields delimited by the tab character, which allows its easy import to a spreadsheet eg. Excel. The test example includes file "example.xlsx" whose sheet "hcecount" was produced this way; we shall describe the HCE statistics file with reference to this spreadsheet. It contains two square matrices with the order equal to the number of genomes. The organism names are show in the left and above of each matrix. The upper matrix pertains to the initial graph, and the lower one -- to the final graph. The cell figures are the number of pairs of similar words for each pair of genomes. In other words, these are numbers of edges with high weights between vertices of each two parts of the graph. The bottom row of the matrix contains total amounts for each column.

clstat_3.txt: the cluster statistics file (clstat minpart -i -m)
The strings of this text file contain a number of fields delimited by the tab character, which allows its easy import to a spreadsheet eg. Excel. The test example includes file "example.xlsx" whose sheet "clstat" was produced this way; we shall describe the cluster statistics file with reference to this spreadsheet. First line of the file provides field headings for the second line, which contains the following data (per column): A - the total number of graph parts, i.e., input genomes; B - the minimum number of parts represented in a cluster (the value of minpart parameter); C - the number of vertices in the initial graph; D - the number of edges in the initial graph; E - the number of steps done until convergence of the dense subgraphs identification algorithm; F - the number of vertices in the final graph; G - the number of edges in the final graph; H - the total number of clusters with sufficient diversity; I - the maximum number of vertices (i.e., words) in a cluster; J - the minimum number of vertices in a cluster; K - the average number of vertices in a cluster (rounded to nearest integer); L...O - the number of clusters that contain words from the number of genomes shown in the respective heading.
The third line contains the field headings for subsequent lines of the file, namely (per column): A - serial number of the cluster; B - total number of words in the cluster; C - the number of different genomes these words belong to; D...I - the number of words from each genome; K - the number of words overlapping with a gene; L - the number of words overlapped with translated region (CDS); M - the number of words overlapping with known RNA. The subsequent lines contain these values for each cluster. The clusters are listed in the diversity descending order (see option -d).

cluster_3.txt: the detailed cluster file (cluster minpart -k -m)
The strings of this text file contain a number of fields delimited by the tab character, which allows its easy import to a spreadsheet eg. Excel. The test example includes file "example.xlsx" whose sheet "cluster" was produced this way; we shall describe the detailed cluster file with reference to this spreadsheet. First line of the file provides field headings, second line is left empty for setting filters in Excel. The subsequent lines describe the clusters in the same order as in the cluster statistics file. Each cluster is presented by a heading line (in specific format) followed by a data string for each word of the cluster. The data strings have common format corresponding to the headline of the file.
All lines of the file contain the cluster number in column A, and the number of genomes in the cluster in column B, which allow easily selecting clusters of interest in the big file.
Heading line of each cluster contains the constant values 99999 and 999 in columns C and D respectively, which allow easily selecting only heading lines. Column E contains the number of words in the cluster, and column F -- its diversity (see option -d).
Data lines follow the cluster heading line and contain the following data on each vertex of the cluster (per column): C - the number of vertices incident to given vertex (vertex degree); D - the number of graph parts (i.e., genomes) given vertex is directly connected to (vertex density); E - the organism name; F - identifier of the genomic sequence containing a word assigned to given vertex; G - position of the word in the sequence (the less of word begin and end positions); H - the word length; I - DNA strand indicator (1/-1/0); J - the word itself, i.e., a region corresponding to the values of fields G, H, I; K - the name of a gene if overlapped with the word; L - gene begin; M - gene end; N - gene strand; O - gene description; P - the name of CDS if overlapped with the word; Q - CDS begin; R - CDS end; S - CDS strand; T - CDS description; U - the name of RNA if overlapped with the word; V - RNA begin; W - RNA end; X - RNA strand; Y - RNA description; Z - the number of positions the word juts out intersecting gene (see parameter annotate) -- this column appears if outgene parameter equals TRUE; AA - the vertex number; AB... - the numbers of incident vertices of the graph. Columns AA etc. appear if numnode parameter equals TRUE.

Notes:
  1. The word positions in columns G and H are approximate; exact coordinates can be determined by multiple alignment of all words of the cluster (to be implemented in future versions of FinDense).
  2. The word in column J is the genome part corresponding to the values in columns G, H, I, and for negative strand the reverse complement is taken. Since in general case this word is the result of merging a group of multiple words assigned to vertices of the source graph, the common intersection of those words is shown in uppercase letters, and the union -- in lowercase letters, in accordance with the configuration specified in the test example. Other output formats are also possible, see parameter seqmode and command line option -q for details.
  3. When building the initial graph by BldGraph in the mode of indistinguishable DNA strands (strand = false), in rare cases the DNA strand indicator in column I cannot be recovered; this is indicated by value 0. In such cases a word in column J can unmatch with other words of the cluster, and the warning about that appears in the log file (see above).
Downloadable files
« back