Lab.6 IITP RAS logo
26/02/20
20:17:30

Laboratory of Mathematic methods and models in bioinformatics,
Institute for Information Transmission Problems,
Russian Academy of Sciences

« back

PairHits program

The program is a part of the iHCE software package aimed at the identification of highly conserved elements (HCEs) in a set of genomes. It founds all pairs of approximately matching words in two sequences from different genomes, thus forming edges of the source graph. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.
PairHits is a command line utility written in C++. The command line syntax:

pairhits [options] [config_file]

(Here, pairhits is the program name in Linux; pairhits64 is a variant for Windows 64-bit with MPICH2, pairhits64ms -- with Microsoft MPI, and pairhits64nompi -- without MPI.)

The program options can be specified in any of the three formats: /x[ value] -x[ value] --x...xx[=value]; some options lack the value. Most options are used for changing "usual" values of the parameters, which are set in the program configuration file or by default (a value in the command line has the highest priority).

Recognizable command line options and default values
-?
--help
Display brief help on command line options.
-a
--append
Append results to the end of output file if it exists (see also option -w). Default setting is as per configuration file; if not specified there, the file will be overwritten.
-b value
--belt=value
The maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). Value -1 means that there is no limitation on the number of deletions. Default value is 2.
-с filename
--config=filename
The configuration file name, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is "config.ini" file in the working directory.
-d value
--score_del=value
The penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. Default value is 2.1.
-e value
--minletter=value
A two-digit number is specified. Most-significant digit is the minimum allowed number of different letters in a sought word (4 or less), and least-significant digit -- the same in a search key. Zero value means no requirement for different letters. Default value is 43.
-f name
--fastapath=name
The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). Variable part of the names to be specified in the [species] section of the configuration file. If the name contains blanks, the whole argument must be enclosed in double quotes. Default is the sub-directory "fasta\" of the working directory.
-i value
--score_mis=value
Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. Default value is 1.
-j filename
--job=filename
The name of job file(s) for the program, that may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. The character '#' in a file name means a serial number, provided that all job files are numbered in succession starting from a certain number (0 by default, but can be modified by -t option), and each file number consists of the same number of digits left-padded with zeroes as appropriate. The number of digits is specified by -y option or determined automatically. Default filename is "jobs\#.txt".
-k value
--key=value
The minimum length of the exactly matching portion of sought-for words. Default value is 16.
-l value
--length=value
The lower threshold of the word length (only words of greater length are considered). Default value is 60.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Character '#' in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. Default filename is "phs_log#.txt".
-p value
--step=value
The step of selecting a key from the "first" sequence when building a hash table. Default value is 1.
-q value
--frequency=value
The maximum number of a key occurrences in the "first" sequence. Frequent keys that occur more times will be skipped. Zero value means that all keys are considered irrespective of their frequency. Default value is 200.
-r filename
--result=filename
The name of the result file(s), which may include a directory path. If the name contains blanks, the whole argument must be enclosed in double quotes. Character '#' in the file name points where the job number will appear (ref. -j option). Default filename is "results\#.txt".
-s value
--maxscore=value
The maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. Default value is 17.5.
-t value
--start=value
The job file number the program to start from (see also options -j, -u). Default value is 0.
-u value
--use=value
The maximum number of job files the program will try to process (see also options -j, -t); limited by 100,000 in the current version. Default value is 10,000.
-w
--rewrite
Overwrite existing output file(s), see also option -a. Default setting is as per configuration file; if not specified there, the existing file(s) will be overwritten.
-y value
--width=value
The number of digits in the job file number, not greater than 5 in the current version (see also option -j). Default value 0 means that the number of digits will be determined automatically (which is not possible in some cases, see below description of the configuration parameter width).
-z value
--ratio=value
The maximum allowed gzip compression ratio of sought-for words; if a word is compressed greater, it is skipped. Value 0 means no check. Default value is 2.2.

The first command line argument that is not an above option or its value will be considered as a name of the configuration file (similar to option -c).

Program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be continued; empty lines are ignored. A line that starts with ';' or '#' character is considered as a comment (i.e., ignored). A part of the line starting with double slash (//) is also considered as a comment.

Meaningful lines of the file can be of the two kinds: the section header which has the format [section_name], and parameter setting in the format parameter = value. In the latter format, at least one blank or tab character must be used as a delimiter before and after the equality sign. A part ot of the file from the section header until the next section header or end of file is referred to as configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of [common] section, which must be the first one if present, and [species] section, which must be the last one. PairHits uses only sections [common], [pairhits], [species], and ignores any other sections of the configuration file.

Recognizable configuration parameters in section [pairhits]
splitlog Sets logging mode for parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes true 1 +) or FALSE (possible forms are no false 0 -). True value makes each MPI process generating a separate log file with the name specified by parameters logname, logext or command line option -o (the latter has higher priority). If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is TRUE.
logname Specifies the log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is "phs_log". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
logext Specifies the extension of the log file name. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is ".txt". The value of this parameter can be changed by -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname Specifies the error log file name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is "phs_err". If empty value is specified, the error log will not be produced.
errext Specifies the extension of the error log file name. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is ".log".
jobpath Specifies the job file(s) name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. Default value is "jobs\". The name will be appended with a serial number of the job file starting from zero (or a value set by start parameter). The job number is left-padded with zeroes up to the fixed number of digits specified by width parameter or determined automatically. The setting can be changed by -j command line option, which modifies the two parameters jobpath, jobext at once.
jobext Specifies the extension of the job file name. Default value is ".txt". The setting can be changed by -j command line option, which modifies the two parameters jobpath, jobext at once.
fastapath The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format. Variable part of the names to be specified in the [species] section of the configuration file. If the value contains blanks, it must be enclosed in double quotes. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default is the sub-directory "fasta\" of the working directory. The setting can be changed by -f command line option.
result Specifies the result file(s) name (without extension), which may include a directory path. If the value contains blanks, it must be included in double quotes. The name will be appended with a serial number of the job file that produced these results. For each job file, no more than one result file is produced. Default setting is "results\", i.e., the result files are written to the subdirectory "results" of the working directory using the job file number as the result file name. The setting can be changed by -r command line option, which modifies the two parameters result, resext at once.
resext Specifies the extension of the result file name. Default value is ".txt". ”казывает расширение имени файла результатов, по умолчанию ".txt". The setting can be changed by -r command line option, which modifies the two parameters result, resext at once.
append Sets a mode of how the program writes to existing log and result files. On the right hand, a Boolean value must be given in any form from yes|true|1|+ for TRUE or no|false|0|- for FALSE. True value means that the program begins writing at the end of existing file, thus preserving old data. Otherwise, a file will be overwritten from the very beginning. The setting can be modified later using the command line options -a or -w. If the parameter is not set, the files are overwritten by default.
width Specifies the number of digits in the job file number (no more than 5 in current version). Default value 0 means that the number of digits to be determined automatically, which is possible only if the very first (i.e., zero by default or set by start parameter) job file does present in the directory specified in jobpath parameter. The setting can be changed by -y command line option.
keysize Specifies the minimum length of the exactly matching portion of sought-for words. Recommended value is a multiple of 4 in the range from 16 to 48. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 16. The setting can be changed by -k command line option.
keystep Specifies the step of a key selection from "first" sequence to build the hash table. Default value is 1. The setting can be modified by -p command line option.
frequency Specifies the maximum number of a key occurrences in the "first" sequence. Frequent keys that occur more times will be skipped. Zero value means that all keys are considered irrespective of their frequency. Default value is 200. The setting can be modified by -q command line option.
length Specifies the lower threshold of the word length (only words of greater length are considered). If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 60. The setting can be changed by -l command line option.
ratio Specifies the maximum allowed gzip compression ratio of sought-for words; if a word is compressed greater, it is skipped. Value 0 means no complexity check. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 2.2. The setting can be changed by -z command line option.
serial_del Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). Value -1 means that there is no limitation on the number of deletions. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 2. The setting can be changed by -b command line option.
maxscore Specifies the maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 17.5. The setting can be changed by -s command line option.
score_del Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 2.1. The setting can be changed by -d command line option.
score_mis Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. If the section [pairhits] lacks this parameter, a setting from the section [common] will be used. Default value is 1. The setting can be changed by -i command line option.
start Specifies the job file number the program to start from. If there is no job file with such number, the program checks next one, etc. The setting can be changed by -t command line option. Default value is 0.
stop Specifies the maximum number of job files the program will try to process starting from the number set by start parameter. The setting can be changed by -u command line option. Default value is 10,000, limited by 100,000 in the current version.
minletter Specifies the two-digit number, whose most-significant digit is the minimum allowed number of different letters in a sought-for word (4 or less), and least-significant digit -- the same in a search key. Zero value means no requirement for different letters. The setting can be changed by -e command line option. Default value is 43.

The list of input data in section [species]

This section establishes a correspondence between the organism name, identifier, number and data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting from the character ';' or '#'. Leading blanks or tab characters in the line are skipped, then at least 5 field must follow delimited by one or more tab character. A field may include blanks. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. parameter diversity of the program FinDense);
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by fastapath parameter in the configuration file or command line.
  6. (optional) file name of the genome annotation in GFF format; PairHits does not use this field.
Template list of input data is provided in the text example for Windows and Linux.

Job files and parallel execution of PairHits

Once invoked, PairHits starts looking for candidate words in the pairs of strings from different genomes, doing so sequentially or in parallel depending on the program variant and parameters of the launch. The pairs of strings to be processed are specified in the set of job files through the configuration parameters jobpath, jobext, start, stop or respective options in the command line. An example set of job files is provided in the test examples for Windows and Linux (ref. "jobs\" directory). Each job file points to a string from the "first" genome (in first line of the file) and one or more strings from the "second" genome (in subsequent lines of the file). There may be multiple second genomes in one file. Together these lines form as many pairs as strings specified from the second genome(s). All lines of the job file have the same format despite different lines can pertain to different genomes. The line must begin with at least one blank or tab character, which is replaced with asterisk (*) upon the line has been worked. The three numeric fields follow delimited by at least one blank or tab character, these fields are:

  1. the organism number (from the list of input data),
  2. serial number of a sequence in the full genome file in FASTA format, or 1 in the case of GenBank file (these numbers will appear in results),
  3. zero-based offset of the sequence header (specifically '>' character) from the beginning of the genome file.

Each job file is always handled by one MPI process until the pairs are over, then the process switches to the next unprocessed job file, and so on. Processed lines are labeled with '*' in the first position; such lines of the job file are skipped during the processing, which allow the program to resume interrupted computation (for such cases, the append mode of writing to output files is provided, see command line options -a, -w and configuration parameter append). Uniprocessing variant of the PairHits handles job files in their number ascending order. As for the MPI variant of the program, each process starts from the job file having the same number as the processor (within a pool requested), and the next job file number is obtained by adding the total number of processors requested.

If few job files exist, there is no sense in using more processors than files. Total time of computation will be determined by that of the biggest job in process. The time is approximately linear in the string length of the first genome and total of the string lengths of the second genome(s), which occur in the job file. If there are few genomes and sequences, it is recommended to split long job files into parts and accordingly use more CPUs. On the contrary, if there are much more job files than available CPUs, the computational load will be self-balanced if the job files are numbered in the computation time decreasing order. Specifically, it may help to number the genomes in the total length descending order, and sequences of each genome in their lengths descending order as well.

Result files

If one or more suitable pairs of words are found in processing of a job file, PairHits creates a result file in the directory specified by the result configuration parameter; name of the result file includes the job file number. The result file is a text file, each its line contains 11 fields separated by the tab character. This file allows for direct import into a spreadsheet like Excel (if the file size allows). These fields contain:

  1. "first" genome number
  2. sequence number in the first genome
  3. anchor of a word in this sequence of the first genome (the "anchor" is a sum of begin and end positions of the word in the sequence, i.e., twofold center position of the word)
  4. word length in the first genome
  5. "second" genome number
  6. sequence number in the secong genome
  7. anchor of a word in this sequence of the second genome
  8. word length in the second genome
  9. indicator of the strand in the second genome (can be 1, -1 or 0)
  10. tenfold edit distance between the words (total penalty for mismatch)
  11. word compression ratio by gzip algorithm.
Important note: In current version of the program, positions on the complement (negative) strand are numbered from the sequence end.
Examples of the result files are provided in the test examples for Windows and Linux (ref. "myfiles\hits\" directory).

Downloadable files
« back