BldGraph program

The program is a part of the iHCE software package aimed at the identification of highly conserved elements (HCEs) in a set of genomes. BldGraph compacts the source graph built by the PairHits program, thus generating the initial multipartite graph with each part corresponding to a genome. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.

BldGraph is a command line utility written in C++. The command line syntax:

bldgraph [options] [config_file]

(Here, bldgraph is the program name in Linux; bldgraph64 is a variant for Windows 64-bit with MPICH2, bldgraph64ms — with Microsoft MPI, and bldgraph64nompi — without MPI.)

The program options can be specified in any of the three formats: /x[ value], -x[ value], --x…xx[=value]; some options lack the value. Most options are used for changing the “usual” values of the parameters, which are set in the program configuration file or used by default (a value in the command line has the highest priority).

The recognized command line options and their default values

-?
--help
Display brief help on command line options.
-с filename
--config=filename
The configuration file name, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is config.ini file in the working directory.
-d boolean
--strand=boolean
A Boolean value is specified to determine if the program distinguishes between DNA strands when merging overlapped candidate words. If TRUE value is specified in any of the froms true, yes, 1, +, then only words on the same strand are merged. If FALSE value is specified in any of the froms false, no, 0, -, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the program FinDense outputs list of those HCEs). The default value is TRUE.
-e boolean
--statonly=boolean
A Boolean value is specified similar to -d option. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). The default value is FALSE.
-f name
--fastapath=name
The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the fasta\ sub-directory of the working directory.
-h filename
--hits=filename
Specifies the file, which is the list of result files produced by PairHits program. The name may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. This text file contains the name and path of a result file in each line; it can be generated by OS commands such as dir hits\*.txt /b /l /on /s >hits\file.lst in Windows or ls -1 hits/*.txt >hits/file.lst in Linux. The default name is hits\file.lst.
-l value
--length=value
The lower threshold of the word length (only words of greater length are considered). This option allows to perform extra filtering by the length of candidate words, in addition to that already made by the PairHits' -l option. The default value is 60.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The # character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default filename is bld_log#.txt.
-p value
--portion=value
The number of records in the portion used for data exchange between parallel processes within the program. The larger the portion, the higher the performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 1000 to, e.g., 100.
-r value
--ratio=value
The maximum allowed gzip compression ratio of sought-for words; if a word is compressed more, it is skipped. This option allows extra filtering by the complexity of candidate words, in addition to that already made by the PairHits' -z option. The value 0 means no check. The default value is 2.2.
-s value
--common=value
The minimum length of the common part of several words to be merged into a single word during the compaction of the source graph. The default value equals 32.
-u value
--usedump=value
The value indicates a number of parts in the intermediate data (a dump). In order to execute BldGraph in a single run, one should specify a zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g., one per genome), but use less processors (or just 1) at the final stage. The results of the initial stage at each CPU are separately dumped to a special directory (set by dump parameter in the configuration file), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as an input. For this purpose, the value must be equal to the number of processors used to write the dump. The default value equals 0.

The first command line argument that is not an above option or its value will be considered a name of the configuration file (similar to the -c) option.

The program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be wrapped; empty lines are ignored. A line that starts with ; or # character is considered a comment (i.e., ignored). A part of the line starting with a double slash (//) is also considered a comment.

Meaningful lines of the file can be of two kinds: the section name, which has the format [section_name], and parameter setting in the format parameter = value. In the latter format, at least one space or tab character must be used as a delimiter before and after the equality sign. A part ot of the file from the section name until the next section name or the end of file is referred to as configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of the [common] section, which must be the first one if present, and the [species] section, which must be the last one. BldGraph uses only the sections [common], [bldgraph], [species] and ignores any other sections of the configuration file.

The recognized configuration parameters in the [bldgraph] section

splitlog
Sets logging mode for parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes, true, 1, +) or FALSE (possible forms are no, false, 0 -). True value makes each MPI process generating a separate log file with the name specified by the parameters logname, logext or the command line option -o (the latter has a higher priority). If the [bldgraph] section lacks this parameter, the setting from the [common] section will be used. The default value is TRUE.
logname
Specifies the log file name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the [bldgraph] section lacks this parameter, the setting from the [common] section will be used. The default value is bldg_log. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog, logname, logext at once.
logext
Specifies the extension of the log file name. If the [bldgraph] section lacks this parameter, the setting from the [common] section will be used. The default value is .txt. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname
Specifies the error log file name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the [bldgraph] section lacks this parameter, the setting from the [common] section will be used. The default value is bldg_err. If empty value is specified, the error log will not be produced.
errext
Specifies the extension of the error log file name. If the [bldgraph] section lacks this parameter, the setting from the [common] section will be used. The default value is .log.
fastapath
The common prefix of the input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the fasta\ subdirectory of the working directory.
hits
Specifies the file, which is the list of result files produced by PairHits program. The name may include a directory path. If the value contains spaces, it must be enclosed in double quotes. This text file contains the name and path of a result file in each line; it can be generated by OS commands such as dir hits\*.txt /b /l /on /s >hits\file.lst in Windows or ls -1 hits/*.txt >hits/file.lst in Linux. The default name is hits\file.lst. The value of this parameter can be changed by the -h command line option.
common
The minimum length of the common part of several words to be merged into a single word during the compaction of the source graph. The default value equals 32. The value of this parameter can be changed by the -s command line option.
length
The lower threshold of the word length (only words of greater length are considered). This option allows to perform extra filtering by the length of candidate words, in addition to that already made by the PairHits' -l option. The default value is 60. The value of this parameter can be changed by the -l command line option.
ratio
The maximum allowed gzip compression ratio of sought words; if a word is compressed more, it is skipped. This option allows to perform extra filtering by the complexity of candidate words, in addition to that already made by the PairHits' -z option. The value 0 means no check. The default value is 2.2. The value of this parameter can be changed by the -r command line option.
strand
A Boolean value is specified to determine if the program distinguishes between DNA strands when merging overlapped candidate words. If TRUE value is specified in any of the forms true, yes, 1, +, then only words on the same strand are merged. If FALSE value is specified in any of the froms false, no, 0 -, then the program merges words irrespective of their strands. In the latter case, more dense clusters are usually built, but some HCEs can include words binded to wrong strand (in such cases, the program FinDense outputs a list of those HCEs). The default value is TRUE. The value of this parameter can be changed by the -d command line option.
dump
The required parameter, which specifies the directory path for writing intermediate data files. See also the usedump parameter and the -u command line option.
usedump
The value indicates a number of parts in the intermediate data (a dump). In order to execute BldGraph in a single run, one should specify the zero value. Sometimes it is more convenient to initiate the program on many CPUs (e.g., one per genome), but use less processors (or just 1) at the final stage. The results of the initial stage at each CPU are separately dumped to a special directory (set by the dump parameter), and after that the program can be interrupted. Then it can run again on a different number of CPUs using the dumped data as an input. For this purpose, the value must be equal to the number of processors used to write the dump. The default value equals 0. The value of this parameter can be changed by the -u command line option.
statonly
A Boolean value is specified similar to the strand parameter. If TRUE value is specified, the program computes statistics of the source graph edges by its parts and terminates (the information-only mode). The default value is FALSE. The value of this parameter can be changed by the -e command line option.
hubname
The required parameter specifying the name (with optional path) of the initial graph primary file, which is the program result. If the [bldgraph] section, lacks this parameter, the setting from the [common] section will be used. There is no default value for this parameter, and it cannot be modified through the command line.
starname
The required parameter specifying the name (with optional path) of the initial graph secondary files, which are the program result. The name must include the asterisk character (*), which will be replaced with the organism identifier. If the [bldgraph] section, lacks this parameter, the setting from the [common] section will be used. There is no default value for this parameter, and it cannot be modified through the command line.
hcecount
The optional parameter specifying the directory path and name for statistical data file containing edges and vertices count for each pair of genomes. If the parameter is absent or has an empty value, the statistical data are not computed.
portion
The number of records in the portion used for data exchange between parallel processes within the program. The larger the portion, the higher the performance, the more buffer memory and risk of deadlocks. If a deadlock occurs, we suggest to decrease the default value of 1000 to, e.g., 100.

The list of input data in the [species] section

This section establishes a correspondence between the organism name, identifier, number and the data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting with a ; or # character. Leading spaces or tab characters in the line are skipped, then at least 5 field must follow delimited by one or more tab characters. A field may include spaces. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. the diversity parameter of the FinDense) program;
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by the fastapath parameter in the configuration file or command line.
  6. (optional) file name of the genome annotation in GFF format; BldGraph does not use this field.

A template list of input data is provided in the text example for Windows and Linux.

Results

The results of BldGraph execution are the log file(s) (ref. the parameters splitlog, logname, logext), intermediate data files in the subdirectory specified by the dump parameter, and initial graph in the form of primary and secondary files (their normal extensions are .hub and .star respectively). The initial graph will be used by FinDense program. We do not describe the structure of the output files, though their templates are included in the test examples.

Important information: Since some of these files contain numerical data in binary formats, it is necessary for their correct interpretation that BldGraph and FinDense program runs are performed on systems with the same byte order, i.e., big or little endian. Although it does not concern PairHits program, which may run on either system irrespective of subsequent stages of the algorithm.

Downloadable files

See the corresponding section of the iHCE page.