BldGraph
program
The program is a part of the iHCE
software package aimed at the
identification of highly conserved elements (HCEs) in a set of genomes. BldGraph
compacts the source graph built by the PairHits
program,
thus generating the initial multipartite graph with each part corresponding to a genome.
The program is intended for parallel computing under MPI on a high-performance cluster operated
by 64-bit Windows or Linux system.
BldGraph
is a command line utility written in C++. The command line syntax:
bldgraph [options] [config_file]
(Here, bldgraph
is the program name in Linux; bldgraph64
is a variant
for Windows 64-bit with MPICH2, bldgraph64ms
— with Microsoft MPI, and
bldgraph64nompi
— without MPI.)
The program options can be specified in any of the three formats:
/x[ value]
, -x[ value]
,
--x…xx[=value]
; some options lack the value. Most options are used
for changing the “usual” values of the parameters, which are set in the program configuration
file or used by default (a value in the command line has the highest priority).
The recognized command line options and their default values
- -?
- --help
- Display brief help on command line options.
- -с filename
- --config=filename
- The configuration file name, which may include a directory path.
If the name contains spaces, the whole argument must be enclosed in double quotes.
The default is
config.ini
file in the working directory. - -d boolean
- --strand=boolean
- A Boolean value is specified to determine if the program distinguishes between DNA
strands when merging overlapped candidate words. If TRUE value is specified in any of
the froms
true
,yes
,1
,+
, then only words on the same strand are merged. If FALSE value is specified in any of the fromsfalse
,no
,0
,-
, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the programFinDense
outputs list of those HCEs). The default value is TRUE. - -e boolean
- --statonly=boolean
- A Boolean value is specified similar to
-d
option. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). The default value is FALSE. - -f name
- --fastapath=name
- The common prefix of input file names such as a directory path. These files contain
the complete genome sequences in either FASTA or GenBank format (the latter is possible only
if the genome consists of a single sequence). The variable part of the names is specified
in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is thefasta\
sub-directory of the working directory. - -h filename
- --hits=filename
- Specifies the file, which is the list of result
files produced by
PairHits
program. The name may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. This text file contains the name and path of a result file in each line; it can be generated by OS commands such asdir hits\*.txt /b /l /on /s >hits\file.lst
in Windows orls -1 hits/*.txt >hits/file.lst
in Linux. The default name ishits\file.lst
. - -l value
- --length=value
- The lower threshold of the word length (only words of greater length are considered).
This option allows to perform extra filtering by the length of candidate words, in addition
to that already made by the
PairHits
'-l
option. The default value is60
. - -n
- --nompi
- If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
- -o filename
- --log=filename
- The name of the program log file, which may include a directory path. If the name contains
spaces, the whole argument must be enclosed in double quotes. The
#
character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default filename isbld_log#.txt
. - -p value
- --portion=value
- The number of records in the portion used for data exchange between parallel processes within
the program. The larger the portion, the higher the performance, the more buffer memory and risk
of deadlocks. If a deadlock occurs, we suggest to decrease the default value of
1000
to, e.g.,100
. - -r value
- --ratio=value
- The maximum allowed gzip compression ratio of sought-for words; if a word is compressed
more, it is skipped. This option allows extra filtering by the complexity of candidate words,
in addition to that already made by the
PairHits
'-z
option. The value0
means no check. The default value is2.2
. - -s value
- --common=value
- The minimum length of the common part of several words to be merged into a single word during the compaction of the source graph. The default value equals 32.
- -u value
- --usedump=value
- The value indicates a number of parts in the intermediate data (a dump). In order to execute
BldGraph
in a single run, one should specify a zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g., one per genome), but use less processors (or just 1) at the final stage. The results of the initial stage at each CPU are separately dumped to a special directory (set bydump
parameter in the configuration file), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as an input. For this purpose, the value must be equal to the number of processors used to write the dump. The default value equals0
.
The first command line argument that is not an above option or its value will be
considered a name of the configuration file (similar to the -c
) option.
The program configuration file
The configuration file is required; it is a text file having the traditional
structure of such files (an example of the file is included in the test examples
for Windows and
Linux). Lines of the file must not be wrapped;
empty lines are ignored. A line that starts with ;
or #
character is
considered a comment (i.e., ignored). A part of the line starting with a double slash
(//
) is also considered a comment.
Meaningful lines of the file can be of two kinds: the section name, which
has the format [section_name]
, and parameter setting in the format
parameter = value
. In the latter format, at least one space or tab character must
be used as a delimiter before and after the equality sign. A part ot of the file from the section
name until the next section name or the end of file is referred to as configuration
section. Parameters can appear in any order inside the section. The configuration file can
include several sections in any order with the exception of the [common]
section,
which must be the first one if present, and the [species]
section, which must
be the last one. BldGraph
uses only the sections [common]
,
[bldgraph]
, [species]
and ignores any other sections
of the configuration file.
The recognized configuration parameters in the [bldgraph]
section
- splitlog
- Sets logging mode for parallel variant of the program. On the right hand, a Boolean value
must be given as TRUE (possible forms are
yes
,true
,1
,+
) or FALSE (possible forms areno
,false
,0
-
). True value makes each MPI process generating a separate log file with the name specified by the parameterslogname
,logext
or the command line option-o
(the latter has a higher priority). If the[bldgraph]
section lacks this parameter, the setting from the[common]
section will be used. The default value is TRUE. - logname
- Specifies the log file name (without extension), which may include a directory path. If the
value contains spaces, it must be enclosed in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the[bldgraph]
section lacks this parameter, the setting from the[common]
section will be used. The default value isbldg_log
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog
,logname
,logext
at once. - logext
- Specifies the extension of the log file name. If the
[bldgraph]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.txt
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog
,logname
,logext
at once. - errname
- Specifies the error log file name (without extension), which may include a directory path. If
the value contains spaces, it must be enclosed in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the[bldgraph]
section lacks this parameter, the setting from the[common]
section will be used. The default value isbldg_err
. If empty value is specified, the error log will not be produced. - errext
- Specifies the extension of the error log file name. If the
[bldgraph]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.log
. - fastapath
- The common prefix of the input file names such as a directory path. These files contain
the complete genome sequences in either FASTA or GenBank format (the latter is possible only
if the genome consists of a single sequence). The variable part of the names is specified
in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is thefasta\
subdirectory of the working directory. - hits
- Specifies the file, which is the list of result
files produced by
PairHits
program. The name may include a directory path. If the value contains spaces, it must be enclosed in double quotes. This text file contains the name and path of a result file in each line; it can be generated by OS commands such asdir hits\*.txt /b /l /on /s >hits\file.lst
in Windows orls -1 hits/*.txt >hits/file.lst
in Linux. The default name ishits\file.lst
. The value of this parameter can be changed by the-h
command line option. - common
- The minimum length of the common part of several words to be merged into a single
word during the compaction of the source graph. The default value equals
32
. The value of this parameter can be changed by the-s
command line option. - length
- The lower threshold of the word length (only words of greater length are considered).
This option allows to perform extra filtering by the length of candidate words, in addition
to that already made by the
PairHits
'-l
option. The default value is60
. The value of this parameter can be changed by the-l
command line option. - ratio
- The maximum allowed gzip compression ratio of sought words; if a word is compressed more,
it is skipped. This option allows to perform extra filtering by the complexity of candidate words,
in addition to that already made by the
PairHits
'-z
option. The value0
means no check. The default value is2.2
. The value of this parameter can be changed by the-r
command line option. - strand
- A Boolean value is specified to determine if the program distinguishes between DNA
strands when merging overlapped candidate words. If TRUE value is specified in any of
the forms
true
,yes
,1
,+
, then only words on the same strand are merged. If FALSE value is specified in any of the fromsfalse
,no
,0
-
, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the programFinDense
outputs a list of those HCEs). The default value is TRUE. The value of this parameter can be changed by the-d
command line option. - dump
- The required parameter, which specifies the directory path for writing
intermediate data files. See also the
usedump
parameter and the-u
command line option. - usedump
- The value indicates a number of parts in the intermediate data (a dump). In order to execute
BldGraph
in a single run, one should specify the zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g., one per genome), but use less processors (or just 1) at the final stage. The results of the initial stage at each CPU are separately dumped to a special directory (set by thedump
parameter), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as an input. For this purpose, the value must be equal to the number of processors used to write the dump. The default value equals0
. The value of this parameter can be changed by the-u
command line option. - statonly
- A Boolean value is specified similar to the
strand
parameter. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). The default value is FALSE. The value of this parameter can be changed by the-e
command line option. - hubname
- The required parameter specifying the name (with optional path) of the initial graph
primary file, which is the program result. If the
[bldgraph]
section, lacks this parameter, the setting from the[common]
section will be used. There is no default value for this parameter, and it cannot be modified through the command line. - starname
- The required parameter specifying the name (with optional path) of the initial
graph secondary files, which are the program result. The name must include the asterisk
character (
*
), which will be replaced with the organism identifier. If the[bldgraph]
section, lacks this parameter, the setting from the[common]
section will be used. There is no default value for this parameter, and it cannot be modified through the command line. - hcecount
- The optional parameter specifying the directory path and name for statistical data file containing edges and vertices count for each pair of genomes. If the parameter is absent or has an empty value, the statistical data are not computed.
- portion
- The number of records in the portion used for data exchange between parallel processes within
the program. The larger the portion, the higher the performance, the more buffer memory and risk
of deadlocks. If a deadlock occurs, we suggest to decrease the default value of
1000
to, e.g.,100
.
The list of input data in the [species]
section
This section establishes a correspondence between the organism name, identifier, number and the
data file names. Each line corresponds to an organism. The program ignores blank lines and lines
starting with a ;
or #
character. Leading spaces or tab characters in
the line are skipped, then at least 5 field must follow delimited by one or more tab characters.
A field may include spaces. The fields contain the following data:
- the organism number (unique within the input data set);
- short denotation (identifier) of the organism, less than 36 characters in the current version;
- name of the organism;
- taxonomic code (ref. the
diversity
parameter of theFinDense
) program; - file name of the full genome in FASTA or GenBank format (however, the whole genome must be in
a single file). A path to the file is specified by the
fastapath
parameter in the configuration file or command line. - (optional) file name of the genome annotation in GFF format;
BldGraph
does not use this field.
A template list of input data is provided in the text example for Windows and Linux.
Results
The results of BldGraph
execution are the log file(s) (ref. the parameters
splitlog
, logname
, logext
), intermediate data files in
the subdirectory specified by the dump
parameter, and initial graph in the form of
primary and secondary files (their normal extensions are .hub
and .star
respectively). The initial graph will be used by FinDense
program. We do not describe the structure of the output files, though their templates are included
in the test examples.
Important information: Since some of these files contain numerical data in
binary formats, it is necessary for their correct interpretation that BldGraph
and
FinDense
program runs are performed on systems with the same byte order, i.e., big or
little endian. Although it does not concern PairHits
program, which may run
on either system irrespective of subsequent stages of the algorithm.
Downloadable files
See the corresponding section of the iHCE
page.