FinDense
program
The program is a part of the iHCE
software package aimed at
identification of highly conserved elements (HCEs) in a set of genomes. It transforms the initial
graph built by BldGraph
into the final one and identifies m-dense
subgraphs (clusters) in the latter graph. These clusters consist of vertices together with words
assigned to them, that belong at least to m parts of the graph and are connected by
edges with the biggest total weight. Each cluster (may be except for one giant cluster)
is considered a predicted HCE. The program is intended for parallel computing under MPI
on a high-performance cluster operated by 64-bit Windows or Linux system.
FinDense
is a command line utility written in C++. The command line syntax:
findense [options] [config_file]
(Here, findense
is the program name in Linux; findense64
is a
variant for Windows 64-bit with MPICH2, findense64ms
— with Microsoft MPI, and
findense64nompi
— without MPI.)
The program options can be specified in any of the three formats:
/x[ value]
, -x[ value]
,
--x…xx[=value]
; some options lack the value. Most options
are used for changing the “usual” values of the parameters, which are set in the program
configuration file or by default (a value in the command line has the highest priority).
The recognized command line options and their default values
- -?
- --help
- Displays a brief help on command line options.
- -a value
- --annotate=value
- Specifies how long shall be the intersection of a word with an annotated DNA region for
the annotation to cover such word (and appear in the detailed cluster file). Thus, if the value
equals
1
, the word must have at least one common position with annotated region; if-1
— the word can be adjacent to the region and so on. The zero value means that annotations will not be used at all. The default value is8
. - -с filename
- --config=filename
- The configuration file name, which may include a directory path.
If the name contains spaces, the whole argument must be enclosed in double quotes.
The default is the
config.ini
file in the working directory. - -d value
- --diversity=value
- Sets the diversity threshold for identified clusters. The parameter does not affect the
FinDense
algorithm, it only manages the final result selection. For each HCE, an empirical estimate “diversity” is calculated, and if it is less than the threshold, the cluster is skipped. The diversity is calculated by addition of up to 3 kinds of summands for each word of the cluster: 10000 for a new genome of highest taxon (e.g., for each new genus), 100 for a new genome of lower taxon (e.g., new species of the same genus), 1 for a new genome of lowest taxon (e.g., new strain of the same species). Specific diversity levels are defined by the user who sets a taxonomic code of each organism in the[species]
section of the configuration file. The taxonomic code is a string consisting of a maximum of 3 characters corresponding to the three levels; the characters are used to compare novelty of words from those genomes when calculating the HCE diversity. The zero value means that the program results will include all identified clusters irrespective of their diversity. The default value20000
means that all clusters consisting of words from genomes of the same “genus”, which have the same first letter in the taxonomic code. - -e filename
- --err=filename
- The name of the error log file, which may include a directory path. If the name contains
spaces, the whole argument must be enclosed in double quotes. The
#
character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name isfd_err#.log
. The user can completely disable error logging through the configuration file. - -f name
- --fastapath=name
- The common prefix of the input file names such as a directory path. These files contain
the complete genome sequences in either FASTA or GenBank format (the latter is possible only
if the genome consists of a single sequence). The variable part of the names is specified
in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is thefasta\
subdirectory of the working directory. - -g name
- --gffpath=name
- The common prefix of the files containing genome annotations in GFF v.3 formats; usually this
is a directory path. The variable part of the names is specified in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is thegff\
subdirectory of the working directory. The separate file with annotations is not required if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional. - -h filename
- --hub=filename
- The name of the initial graph primary file, which may include a directory path. The initial
graph is built by the
BldGraph
program, it consists of the primary file (normally with.hub
extension) and multiple secondary files with.star
extension, which occur in the same directory (the default isgraph\
in the working directory). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file; there is no default name. - -i filename
- --clstat=filename
- The name of the cluster statistics file, which may include a directory path. The file is
described among the output data below. If the name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the-m
option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value iscluster\clstat_#.txt
. - -j filename
- --hcecount=filename
- The name of the file to write the HCE statistics over genome pairs; the name may include a
directory path. The file is described among the output data below. If the
name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the-m
option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value iscluster\hcecount_#.txt
. - -k filename
- --cluster=filename
- The name of the detailed cluster file, which may include a directory path. The file
is described among the output data below. If the name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the-m
option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value iscluster\cluster_#.txt
. - -m value
- --minpart=value
- The minimum allowable number of graph parts (i.e., different genomes) in a cluster.
The default value is
3
, so the cluster must contain words from at least three different genomes. The zero value means that clusters must contain words from all input genomes. To disable this control use the value of1
. - -n
- --nompi
- If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
- -o filename
- --log=filename
- The name of the program log file, which may include a directory path. If the name contains
spaces, the whole argument must be enclosed in double quotes. The
#
character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name isfindense#.log
. - -p value
- --portion=value
- The number of records in the portion used for data exchange between parallel processes within
the program. The larger the portion, the higher the performance, the more buffer memory and risk
of deadlocks. If a deadlock occurs, we suggest to decrease the default value of
2000
to, e.g.,100
. - -q string
- --seq=string
- This option controls the output format of the genome sequence as an HCE word in the detailed
cluster file (ref. the
-k
option). The string is case-sensitive and contains up to 3 different letters from the set {U
,I
,X
,A
,C
} in any order without spaces. The default value isxU
. The meaning of the letters:U
/u
- At each vertex of the final graph, a region corresponding to the union
of initial words merged at this vertex is shown in uppercase (
U
) or lowercase (u
) letters. I
/i
- At each vertex of the final graph, a region corresponding to the intersection
of initial words merged at this vertex is shown in uppercase (
I
) or lowercase (i
) letters. - If both union and intersection is specified in the string, the intersection region will be
shown as per
I
/i
specification, and its complement to the union region — as perU
/u
specification. X
/x
- In addition to the requested region, N preceding and N following
letters are shown in upper (
X
) or lower (x
) case, unless the beginning or the end of the sequence is met. The value of N is set by the-x
option or theextra
parameter in the configaration file. The default value is5
. A
,C
- Reserved for future development.
- -r filename
- --result=filename
- The name of the result file, which may include a directory path. The file is described among
the output data below. If the name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. the-m
option). If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name must be specified in the configuration file. The default value iscluster\result_#.txt
. - -s filename
- --star=filename
- The name of the initial graph secondary file, which may include a directory path (see also
the
-h
option). The name must contain the*
character, which means organism identifier. If the name contains spaces, the whole argument must be enclosed in double quotes. If this option is absent from the command line, the file name can be specified in the configuration file, otherwise the default valuegraph\*.star
will be used. - -t value
- --stamp=value
- Specifies the number of minutes after which the program reports its progress to the log file.
This parameter helps to monitor the work of the program and predict its completion time.
The default value of
0
makes the program report only on completion of the predefined stages. - -w value
- --weight=value
- Specifies minimum allowable edge weight (edges with lower weights are deleted). This option
allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to eliminate
a giant cluster or decrease its size. The default value of
0
means that all edges are kept. - -x value
- --extra=value
- Specifies the number of extra preceding or following letters added to each word of the cluster.
See also the
-q
option. The default value is5
.
The first command line argument that is not an above option or its value will be considered
a name of the configuration file (similar to the -c
option).
The program configuration file
The configuration file is required; it is a text file having the traditional
structure of such files (an example of the file is included in the test examples for
Windows and
Linux). Lines of the file must not be wrapped;
empty lines are ignored. A line that starts with a ;
or #
character
is considered a comment (i.e., ignored). A part of the line starting with double slash
(//
) is also considered a comment.
Meaningful lines of the file can be of the two kinds: the section name which has
the format [section_name]
, and a parameter setting in the format
parameter = value
. In the latter format, at least one space or tab character
must be used as a delimiter before and after the equality sign. A part ot of the file
from the section name until the next section name or the end of file is referred to as
a configuration section. Parameters can appear in any order inside the section.
The configuration file can include several sections in any order with the exception of
the [common]
section, which must be the first one if present, and the
[species]
section, which must be the last one. FinDense
uses
only the sections [common]
, [findense]
, [species]
and ignores any other sections of the configuration file.
The recognized configuration parameters in the [findense]
section
- splitlog
- Sets logging mode for parallel variant of the program. On the right hand, a Boolean value
must be given as TRUE (possible forms are
yes
,true
,1
,+
) or FALSE (possible forms areno
,false
,0
,-
). True value makes each MPI process generate a separate log file with the name specified by parameterslogname
,logext
or command line option-o
(the latter has a higher priority). If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is TRUE. - logname
- Specifies the log file name (without extension), which may include a directory path. If the
value contains spaces, it must be included in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value isfindense
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog
,logname
,logext
at once. - logext
- Specifies the extension of the log file name. If the
[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.log
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog
,logname
,logext
at once. - errname
- Specifies the error log file name (without extension), which may include a directory path. If
the value contains spaces, it must be enclosed in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value isfd_err
. If an empty value is specified, the error log will not be produced. The value of this parameter can be changed by the-e
command line option, which modifies the two parameterserrname
,errext
at once. - errext
- Specifies the extension of the error log file name. If the
[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.log
. The value of this parameter can be changed by the-e
command line option, which modifies the two parameterserrname
,errext
at once. - fastapath
- The common prefix of the input file names such as a directory path. These files contain the
complete genome sequences in either FASTA or GenBank format (the latter is possible only if
the genome consists of a single sequence). The variable part of the names is specified in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is thefasta\
subdirectory of the working directory. - gffpath
- Specifies the common prefix of the files containing the genome annotations in GFF v.3
formats; usually this is a directory path. If the value contains spaces, it must be enclosed
in double quotes. If the parameter has an empty value, annotations will not be used at all.
The variable part of the names is specified in the
[species]
section of the configuration file. The default is thegff\
subdirectory of the working directory. The value of this parameter can be changed by the-g
command line option. The separate file with annotations is not required if the genome is given in the GenBank format, which normally includes annotations. The annotation files are optional. - hubname
- Specifies the name and path of the initial graph primary file built by
BldGraph
. If the value contains spaces, it must be enclosed in double quotes. If the parameter is not set by the-h
command line option and absent from the[findense]
section, it must be specified in[common]
section; there is no default value. - starname
- Specifies the name and path of the initial graph secondary file built by
BldGraph
. If the value contains spaces, it must be enclosed in double quotes. The name must contain the*
character, which is replaced with the organism identifier. The value of this parameter can be changed by the-s
command line option. If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value isgraph\*.star
. - length
- The lower threshold of the word length (only words of greater length are considered).
This option allows extra filtering by the length of candidate words, in addition to that already
made by the
PairHits
' and/orBldGraph
's-l
options. The default value is60
. - keysize
- Specifies the minimum length of the exactly matching portion of sought-for words (the same
as for
PairHits
). Recommended value is a multiple of 4 in the range of 16 to 48. If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is16
. - serial_del
- Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack
of insertions (
0
means that deletions are not permitted). The value-1
means that there is no limitation on the number of deletions. If the[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is2
. - maxscore
- Specifies the maximum allowed total penalty for the words mismatch. The value is rounded to
one decimal place. If the
[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is17.5.
- score_del
- Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion).
The value is rounded to one decimal place. If the
[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is2.1
. - score_mis
- Specifies the penalty for each mismatch (the positive cost of a letter substitution).
The value is rounded to one decimal place. If the
[findense]
section lacks this parameter, the setting from the[common]
section will be used. The default value is1
. - clstat
- Specifies the name of the cluster statistics file, which may include a directory path. If the
name contains spaces, the whole argument must be enclosed in double quotes. The file is described
among the output data below. If the name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. theminpart
parameter). The value of this parameter can be changed by-i
command line option. The default value iscluster\clstat_#.txt
. - cluster
- Specifies the name of the detailed cluster file, which may include a directory path. If the
name contains spaces, the whole argument must be enclosed in double quotes. The file is described
among the output data below. If the name contains
#
character, it will be replaced with the specified minimum number of different organisms in the cluster (ref. theminpart
parameter). The value of this parameter can be changed by-k
command line option. The default value iscluster\cluster_#.txt
. - result
- Specifies the name of the result file, which may include a directory path. If the name
contains spaces, the whole argument must be enclosed in double quotes. The file is described among
the output data below. If the name contains the
#
character, it will be replaced with specified minimum number of different organisms in the cluster (ref. theminpart
parameter). The value of this parameter can be changed by the-r
command line option. The default value iscluster\result_#.txt
. - hcecount
- Specifies the name of the file to write the HCE statistics over genome pairs; the name may
include a directory path. If the name contains spaces, the whole argument must be enclosed in
double quotes. The file is described among the output data below. If the
name contains the
#
character, it will be replaced with specified minimum number of different organisms in the cluster (ref. theminpart
parameter). The value of this parameter can be changed by the-j
command line option. The default value iscluster\hcecount_#.txt
. - minpart
- Specifies the minimum allowed number of graph parts (i.e., different genomes) in a cluster.
The default value is
3
, so the cluster must contain words from at least three different genomes. The zero value means that clusters must contain words from all input genomes. To disable this limitation use the value of1
. The value of this parameter can be changed by the-m
command line option. - minweight
- Specifies the minimum allowable edge weight (edges with a lower weights are deleted). This
option allows a user to thin out the graph by deleting low-weight edges. Sometimes it helps to
eliminate a giant cluster or decrease its size. The default value of
0
means that all edges are kept. The value of this parameter can be changed by the-w
command line option. - diversity
- Sets the diversity threshold for identified clusters. The parameter does not affect the
FinDense
algorithm, it only manages the final result selection. See more detail in the description of the-d
command line option, which also allows changing this parameter. The default value is20000
. - milestone
- Specifies the number of minutes after which the program reports its progress to the log file.
This parameter helps to monitor the work of the program and predict its completion time. The value
of this parameter can be changed by the
-t
command line option. The default value of0
makes the program report only on completion of the predefined stages. - extra
- Specifies the number of extra preceding or following letters added to each word of the
cluster. See the
-q
option for detail. The value of this parameter can be changed by the-x
command line option. The default value is5
. - seqmode
- The parameter controls output format of the genome sequence as an HCE word in the detailed
cluster file (ref. the
cluster
parameter). The string is case-sensitive and contains up to 3 different letters from the set {U
,I
,X
,A
,C
} in any order without spaces. See detail in the-q
option, which can be used to change this parameter. The default value isxU
. - numnode
- Specifies a boolean value in the form like the
splitlog
parameter. If true value is specified, the detailed cluster file will include for each vertex its number as well as numbers of all incident vertices of the final graph. By default, the program uses false value and does not include vertex numbers in output data. - annotate
- Specifies how long must be the intersection of a word with an annotated DNA region for the
annotation to cover such word (and appear in the detailed cluster file). Thus, if the value equals
1
, the word must have at least one common position with annotated region; if-1
— the word can be adjacent to the region and so on. The zero value means that annotations will not be used at all. The value of this parameter can be changed by-a
command line option. The default value is8
. - outgene
- Specifies a boolean value in the form like the
splitlog
parameter. If true value is specified (the default), for each word that overlaps with annotated gene region but does not entirely lie inside it, the detailed cluster file will include the number of positions the number of positions by which the word extends beyond the gene region. If false value is specified, the mutual location of words and genes is not considered or reported. - toptype
- Specifies the list of top level genomic sequence types to be recognized in annotations. The
entire list must be in a single line. The types are separated by the
|
character; this character must also be in the start and the end of the list. The parameter is required if annotation files are used. - lowtypeX (X=1,2,3)
- Each of these three parameters specifies the list of genomic sequence types at lower levels
to be recognized in annotations. Every word in the detailed cluster file can be accompanied by
annotations of up to three levels, e.g. gene, translated region (CDS) and RNA annotations.
In order to recognize the annotation type, a separate list should be provided for each level.
Such list is similar to that of the the
toptype
parameter: it occurs in a single line, starts and ends with the|
character, and uses it as delimiter. All these three parameters are required if annotation files are used. - portion
- The number of records in the portion used for data exchange between parallel processes within
the program. The larger the portion, the higher the performance, the more buffer memory and risk
of deadlocks. If a deadlock occurs, we suggest to decrease the default value of
2000
to, e.g.,100
.
The list of input data in the [species]
section
This section establishes a correspondence between the organism name, identifier, number and
data file names. Each line corresponds to an organism. The program ignores blank lines and lines
starting with a ;
or #
character. Leading spaces or tab characters in
the line are skipped, then at least 5 field must follow delimited by one or more tab characters.
A field may include spaces. The fields contain the following data:
- the organism number (unique within the input data set);
- short denotation (identifier) of the organism, less than 36 characters in the current version;
- name of the organism;
- taxonomic code (ref. the
-d
option above); - file name of the full genome in FASTA or GenBank format (however, the whole genome must be
in a single file). A path to the file is specified by the
fastapath
parameter in the configuration file or by the-f
command line option. - file name of the genome annotation in GFF v.3 format; this field is optional. Moreover, the
separate file with annotations is not necessary if the genome is given in the GenBank format,
which normally includes annotations. A path to the file is specified by the
gffpath
parameter in the configuration file or by the-g
command line option.
A template list of input data is provided in the text example for Windows and Linux.
Output data
FinDense
produces a number of output text files, the names and locations
of which are specified in the configuration file and/or command line options. These files
are described below using the names resulting from an execution of the
test example; the relevant configuration
parameters and command line options are given in parentheses. For convenience, these file templates
are provided in the myfiles\
directory of the test example.
findense.log
: the program log file
(splitlog
, logname
, logext
,
-o
)
This file is mostly self-explanatory. At the beginning, there is the program version number
and the number of parallel processes, if FinDense
was run in MPI mode, as well
as the command line arguments if any. In parallel mode, each process usually produces a
separate log file; the most substantial is the log of the root process (with number 0),
which finally identifies the clusters. If the initial graph was built by
BldGraph
in the mode insentitive to DNA strand,
FinDense
tries to bind each cluster word to the specific strand. In rare cases,
such binding cannot be done and the log contains warning “Cannot reconcile cluster…”
with the number of unsuccessfully binded cluster.
matrix.fas
: the result file
(result
, minpart
, -r
,
-m
)
The current FinDense
version produces the result file for subsequent building
of a phylogenetic tree by RAxML program. The file has FASTA format and contains one sequence
for each organism; the organism name appears in the header line of the sequence after the
>
character. All sequences have the same length equal to the number of clusters
identified and consist of characters 0
and 1
. The character
1
in 0
appears in that position.
hcecount_3.txt
: the file with HCE statistics over genome pairs
(hcecount
, minpart
, -j
,
-m
)
The strings of this text file contain a number of fields delimited by the
tab character, which allows its easy import to a spreadsheet, e.g. Excel.
The test example includes
example.xlsx
file whose sheet “hcecount” was produced this way;
we shall describe the HCE statistics file with reference to this spreadsheet.
It contains two square matrices with the order equal to the number of genomes.
The organism names are show to the left and top of each matrix. The upper matrix
relates to the initial graph, and the lower one — to the final graph. The cell figures
are the number of pairs of similar words for each pair of genomes. In other words, these
are numbers of edges with high weights between vertices of each two parts of the graph.
The bottom row of the matrix contains the total amounts for each column.
clstat_3.txt
: the cluster statistics file
(clstat
, minpart
, -i
,
-m
)
The lines of this text file contain a number of fields delimited by the
tab character, which allows to easily import it to a spreadsheet, e.g. Excel.
The test example includes
example.xlsx
file whose sheet “clstat” was produced this way; we shall
describe the cluster statistics file with reference to this spreadsheet. First line
of the file provides field headings for the second line, which contains the following
data (per column): A — the total number of graph parts, i.e., input genomes; B — the minimum
number of parts represented in a cluster (the value of the minpart
parameter);
C — the number of vertices in the initial graph; D — the number of edges in the initial graph;
E — the number of steps done until convergence of the dense subgraphs identification algorithm;
F — the number of vertices in the final graph; G — the number of edges in the final graph;
H — the total number of clusters with sufficient diversity; I — the maximum number of vertices
(i.e., words) in a cluster; J — the minimum number of vertices in a cluster; K — the average
number of vertices in a cluster (rounded to the nearest integer); L–O — the number of clusters
that contain words from the number of genomes shown in the respective heading.
The third line contains the field headings for subsequent lines of the file, namely (per
column): A — serial number of the cluster; B — total number of words in the cluster; C - the
number of different genomes these words belong to; D–I — the number of words from each genome;
K — the number of words overlapping with a gene; L — the number of words overlapped with
translated region (CDS); M — the number of words overlapping with known RNA. The subsequent
lines contain these values for each cluster. The clusters are listed in descending order
of their diversity (see the -d
option).
cluster_3.txt
: the detailed cluster file
(cluster
, minpart
, -k
,
-m
)
The lines of this text file contain a number of fields delimited by the
tab character, which allows to easily import it to a spreadsheet, e.g. Excel.
The test example includes
example.xlsx
file whose sheet “cluster” was produced this way; we shall
describe the detailed cluster file with reference to this spreadsheet. The first line
of the file provides field headings, the second line is left empty for setting filters
in Excel. The subsequent lines describe the clusters in the same order as in the cluster
statistics file. Each cluster is presented by a heading line (in specific format) followed
by a data line for each word of the cluster. The data linws have common format corresponding
to the headline of the file.
- All lines of the file contain the cluster number in column A, and the number of genomes in the cluster in column B, which allow to easily select clusters of interest in a big file.
- Heading line of each cluster contains the constant values
99999
and999
in columns C and D respectively, which allow to easily select only the heading lines. Column E contains the number of words in the cluster, and column F — its diversity (see the-d
option). - Data lines follow the cluster heading line and contain the following data on each
vertex of the cluster (per column): C — the number of vertices incident to the given vertex
(vertex degree); D — the number of graph parts (i.e., genomes) the given vertex is directly
connected to (vertex density); E — the organism name; F — identifier of the genomic sequence
containing a word assigned to the given vertex; G — position of the word in the sequence
(the smaller of the word start and end positions); H — the word length; I — DNA strand
indicator (
1
/-1
/0
); J — the word itself, i.e., a region corresponding to the values of fields G, H, I; K — the name of the gene, if the word overlaps with it; L — gene start; M — gene end; N — gene strand; O — gene description; P — the name of the CDS, if the word overlaps with it; Q — CDS start; R — CDS end; S — CDS strand; T — CDS description; U — the name of the RNA, if the word overlaps with it; V — RNA start; W — RNA end; X — RNA strand; Y — RNA description; Z — the number of positions by which the word extends beyond the gene region, if they overlap (see theannotate
parameter); this column appears if theoutgene
parameter equals TRUE; AA — the vertex number; AB… — the numbers of incident vertices of the graph. Columns from AA on appear if thenumnode
parameter equals TRUE.
Notes:
- The word positions in columns G and H are approximate; exact coordinates can be
determined by multiple alignment of all words of the cluster (to be implemented in future
versions of
FinDense
). - The word in column J is the genome part corresponding to the values in columns G, H, I,
and for negative strand the reverse complement is taken. Since in general case this word
is the result of merging a group of multiple words assigned to vertices of the source graph,
the common intersection of those words is shown in uppercase letters, and the union — in
lowercase letters, in accordance with the configuration specified in the test example.
Other output formats are also possible, see the
seqmode
parameter and the-q
command line option for details. - When the initial graph is buildint by
BldGraph
in the mode of indistinguishable DNA strands (strand = false
), in rare cases the DNA strand indicator in column I cannot be recovered; this is indicated by value0
. In such cases a word in column J can unmatch with other words of the cluster, and the warning about that appears in the log file (see above).
Downloadable files
See the corresponding section of the iHCE
page.