Lab.6 IITP RAS logo
28/03/24
21:37:17

Laboratory of Mathematic methods and models in bioinformatics,
Institute for Information Transmission Problems,
Russian Academy of Sciences

« back

BldGraph program

The program is a part of the iHCE software package aimed at the identification of highly conserved elements (HCEs) in a set of genomes. BldGraph compacts the source graph built by the PairHits program, thus generating the initial multipartite graph whose each part corresponds to the genome. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.
BldGraph is a command line utility written in C++. The command line syntax:

bldgraph [options] [config_file]

(Here, bldgraph is the program name in Linux; bldgraph64 is a variant for Windows 64-bit with MPICH2, bldgraph64ms -- with Microsoft MPI, and bldgraph64nompi -- without MPI.)

The program options can be specified in any of the three formats: /x[ value] -x[ value] --x...xx[=value]; some options lack the value. Most options are used for changing "usual" values of the parameters, which are set in the program configuration file or by default (a value in the command line has the highest priority).

Recognizable command line options and default values
-?
--help
Display brief help on command line options.
-ñ filename
--config=filename
The configuration file name, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is "config.ini" file in the working directory.
-d boolean
--strand=boolean
A Boolean value is specified to determine if the program distinguishes between DNA strands when merging overlapped candidate words. If TRUE value is specified in any form from true yes 1 +, then only words on the same strand are merged. If FALSE value is specified in any form from false no 0 -, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the program FinDense outputs list of those HCEs). Default value is TRUE.
-e boolean
--statonly=boolean
A Boolean value is specified similar to -d option. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). Default value is FALSE.
-f name
--fastapath=name
The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "fasta\" of the working directory.
-h filename
--hits=filename
Specifies the file, which is the list of result files produced by PairHits program. The name may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Default name is "hits\file.lst". This text file contains the name and path of a result file in each line; it can be generated by OS commands such as dir hits\*.txt /b /l /on /s >hits\file.lst in Windows or ls -1 hits/*.txt >hits/file.lst in Linux.
-l value
--length=value
The lower threshold of the word length (only words of greater length are considered). This option allows extra filtering by the length of candidate words, in addition to that already made by the PairHits' -l option. Default value is 60.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Character '#' in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. Default filename is "bld_log#.txt".
-p value
--portion=value
The number of records in the portion used for data exchange between parallel processes within the program. The bigger portion, the higher performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 1000 to eg 100.
-r value
--ratio=value
The maximum allowed gzip compression ratio of sought-for words; if a word is compressed greater, it is skipped. This option allows extra filtering by the complexity of candidate words, in addition to that already made by the PairHits' -z option. Value 0 means no check. Default value is 2.2.
-s value
--common=value
The minimum length of the common part of several words to be merged into a single word during the compaction of the source graph. Default value equals 32.
-u value
--usedump=value
The value indicates a number of parts in the intermediate data (a dump). In order to execute BldGraph in a single run, one should specify zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g. one per genome), but use less processors (or just 1) at the final stage. Results of the initial stage at each CPU are separately dumped to a special directory (set by dump parameter in the configuration file), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as input. For this purpose, the value must be equal to the number of processors used to write the dump. Default value equals 0.

The first command line argument that is not an above option or its value will be considered as a name of the configuration file (similar to option -c).

Program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be continued; empty lines are ignored. A line that starts with ';' or '#' character is considered as a comment (i.e., ignored). A part of the line starting with double slash (//) is also considered as a comment.

Meaningful lines of the file can be of the two kinds: the section header which has the format [section_name], and parameter setting in the format parameter = value. In the latter format, at least one blank or tab character must be used as a delimiter before and after the equality sign. A part ot of the file from the section header until the next section header or end of file is referred to as configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of [common] section, which must be the first one if present, and [species] section, which must be the last one. BldGraph uses only sections [common], [bldgraph], [species], and ignores any other sections of the configuration file.

Recognizable configuration parameters in section [bldgraph]
splitlog Sets logging mode for parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes true 1 +) or FALSE (possible forms are no false 0 -). True value makes each MPI process generating a separate log file with the name specified by parameters logname, logext or command line option -o (the latter has higher priority). If the section [bldgraph] lacks this parameter, a setting from the section [common] will be used. Default value is TRUE.
logname Specifies the log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [bldgraph] lacks this parameter, a setting from the section [common] will be used. Default value is "bldg_log". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
logext Specifies the extension of the log file name. If the section [bldgraph] lacks this parameter, a setting from the section [common] will be used. Default value is ".txt". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname Specifies the error log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [bldgraph] lacks this parameter, a setting from the section [common] will be used. Default value is "bldg_err". If empty value is specified, the error log will not be produced.
errext Specifies the extension of the error log file name. If the section [bldgraph] lacks this parameter, a setting from the section [common] will be used. Default value is ".log".
fastapath The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "fasta\" of the working directory.
hits Specifies the file, which is the list of result files produced by PairHits program. The name may include a directory path. If the value contains blanks, it must be enclosed in double quotes. This text file contains the name and path of a result file in each line; it can be generated by OS commands such as dir hits\*.txt /b /l /on /s >hits\file.lst in Windows or ls -1 hits/*.txt >hits/file.lst in Linux. Default name is "hits\file.lst". The value of this parameter can be changed by -h command line option.
common The minimum length of the common part of several words to be merged into a single word during the compaction of the source graph. Default value equals 32. The value of this parameter can be changed by -s command line option.
length The lower threshold of the word length (only words of greater length are considered). This option allows extra filtering by the length of candidate words, in addition to that already made by the PairHits' -l option. Default value is 60. The value of this parameter can be changed by -l command line option.
ratio The maximum allowed gzip compression ratio of sought words; if a word is compressed greater, it is skipped. This option allows extra filtering by the complexity of candidate words, in addition to that already made by the PairHits' -z option. Value 0 means no check. Default value is 2.2. The value of this parameter can be changed by -r command line option.
strand A Boolean value is specified to determine if the program distinguishes between DNA strands when merging overlapped candidate words. If TRUE value is specified in any form from true yes 1 +, then only words on the same strand are merged. If FALSE value is specified in any form from false no 0 -, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the program FinDense outputs list of those HCEs). Default value is TRUE. The value of this parameter can be changed by -d command line option.
dump The required parameter, which specifies the directory path for writing intermediate data files. See also usedump parameter and command line option -u.
usedump The value indicates a number of parts in the intermediate data (a dump). In order to execute BldGraph in a single run, one should specify zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g. one per genome), but use less processors (or just 1) at the final stage. Results of the initial stage at each CPU are separately dumped to a special directory (set by dump parameter), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as input. For this purpose, the value must be equal to the number of processors used to write the dump. Default value equals 0. The value of this parameter can be changed by -u command line option.
statonly A Boolean value is specified similar to strand parameter. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). Default value is FALSE. The value of this parameter can be changed by -e command line option.
hubname The required parameter specifying the name (with optional path) of the initial graph primary file, which is the program result. If the section [bldgraph], lacks this parameter, a setting from the section [common] will be used. There is no default value of this parameter, and it cannot be modified through the command line.
starname The required parameter specifying the name (with optional path) of the initial graph secondary files, which are the program result. The name must include the asterisk (*) character, which will be replaced with the organism identifier. If the section [bldgraph], lacks this parameter, a setting from the section [common] will be used. There is no default value of this parameter, and it cannot be modified through the command line.
hcecount The optional parameter specifying the directory path and name for statistical data file containing edges and vertices count for each pair of genomes. If the parameter is absent or has empty value, the statistical data are not computed.
portion The number of records in the portion used for data exchange between parallel processes within the program. The bigger portion, the higher performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 1000 to eg 100.

The list of input data in section [species]

This section establishes a correspondence between the organism name, identifier, number and data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting from the character ';' or '#'. Leading blanks or tab characters in the line are skipped, then at least 5 field must follow delimited by one or more tab character. A field may include blanks. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. parameter diversity of the program FinDense);
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by fastapath parameter in the configuration file or command line.
  6. (optional) file name of the genome annotation in GFF format; BldGraph does not use this field.
Template list of input data is provided in the text example for Windows and Linux.

Results

The results of BldGraph execution are the log file(s) (ref. parameters splitlog, logname, logext), intermediate data files in the subdirectory specified by dump parameter, and initial graph in the form of primary and secondary files (their normal extensions are .hub and .star, respectively). The initial graph will be used by FinDense program. We do not describe the structure of the output files, though their templates are included in test examples.
Important information: Since some of these files contain numerical data in binary formats, it is necessary for correct interpretation of them that BldGraph and FinDense programs are performed on systems with same byte order, i.e., big or little endian. Emphasize that it does not concern PairHits program, which may run on either system irrespective of subsequent stages of the algorithm.

Downloadable files
« back