PairHits program

The program is a part of the iHCE software package aimed at identification of highly conserved elements (HCEs) in a set of genomes. It finds all pairs of approximately matching words in two sequences from different genomes, thus forming edges of the source graph. The program is intended for parallel computing under MPI on a high-performance cluster operated by 64-bit Windows or Linux system.

PairHits is a command line utility written in C++. The command line syntax:

pairhits [options] [config_file]

(Here, pairhits is the program name in Linux; pairhits64 is a variant for Windows 64-bit with MPICH2, pairhits64ms — with Microsoft MPI, and pairhits64nompi — without MPI.)

The program options can be specified in any of the three formats: /x[ value], -x[ value], --x…xx[=value]; some options lack the value. Most options are used for changing the “usual” values of the parameters, which are set in the program configuration file or by default (a value in the command line has the highest priority).

The recognized command line options and their default values

-?
--help
Display a brief help on command line options.
-a
--append
Append results to the end of output file if it exists (see also the -w option). The default setting is as per configuration file; if not specified there, the file will be overwritten.
-b value
--belt=value
The maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). Value -1 means that there is no limitation on the number of deletions. The default value is 2.
-с filename
--config=filename
The configuration file name, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is config.ini file in the working directory.
-d value
--score_del=value
The penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. The default value is 2.1.
-e value
--minletter=value
A two-digit number is specified. The most-significant digit is the minimum allowed number of different letters in a sought word (4 or less), and the least-significant digit — the same in a search key. Zero value means no requirement for different letters. The default value is 43.
-f name
--fastapath=name
The common prefix of the input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format (the latter is possible only if the genome consists of a single sequence). The variable part of the names is specified in the [species] section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the subdirectory fasta\ of the working directory.
-i value
--score_mis=value
Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. The default value is 1.
-j filename
--job=filename
The name of the job file(s) for the program, that may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. A # character in a file name means a serial number, provided that all job files are numbered in succession starting from a certain number (0 by default, but can be modified by the -t option), and each file number consists of the same number of digits left-padded with zeroes as appropriate. The number of digits is specified by the -y option or determined automatically. The default file name is jobs\#.txt.
-k value
--key=value
The minimum length of the exactly matching portion of sought-for words. The default value is 16.
-l value
--length=value
The lower threshold of the word length (only the words of greater length are considered). The default value is 60.
-n
--nompi
If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
-o filename
--log=filename
The name of the program log file, which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. A # character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name is phs_log#.txt.
-p value
--step=value
The step of selecting a key from the “first” sequence when building a hash table. The default value is 1.
-q value
--frequency=value
The maximum number of a key occurrences in the “first” sequence. Frequent keys that occur more times will be skipped. The zero value means that all keys are considered irrespective of their frequency. The default value is 200.
-r filename
--result=filename
The name of the result file(s), which may include a directory path. If the name contains spaces, the whole argument must be enclosed in double quotes. A # character in the file name points where the job number will appear (ref. the -j option). The default file name is results\#.txt.
-s value
--maxscore=value
The maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. The default value is 17.5.
-t value
--start=value
The job file number the program to start from (see also the options -j, -u). The default value is 0.
-u value
--use=value
The maximum number of job files the program will try to process (see also the options -j, -t); limited by 100000 in the current version. The default value is 10000.
-w
--rewrite
Overwrite existing output file(s), see also the -a option. The default setting is as per configuration file; if not specified there, the existing file(s) will be overwritten.
-y value
--width=value
The number of digits in the job file number, not greater than 5 in the current version (see also the -j option). Default value 0 means that the number of digits will be determined automatically (which is not possible in some cases, see below description of the configuration parameter width).
-z value
--ratio=value
The maximum allowed gzip compression ratio of sought-for words; if a word is compressed more, it is skipped. The value 0 means no check. The default value is 2.2.

The first command line argument that is not an above option or its value will be treated as the name of the configuration file (similar to the -c option).

The program configuration file

The configuration file is required; it is a text file having the traditional structure of such files (an example of the file is included in the test examples for Windows and Linux). Lines of the file must not be wrapped; empty lines are ignored. A line that starts with a ; or # character is considered a comment (i.e., ignored). A part of the line starting with a double slash (//) is also considered a comment.

Meaningful lines of the file can be of two kinds: the section name, which has the format [section_name], and parameter setting in the format parameter = value. In the latter format, at least one space or tab character must be used as a delimiter before and after the equality sign. A part of the file from the section name until the next section name or the end of the file is referred to as a configuration section. Parameters can appear in any order inside the section. The configuration file can include several sections in any order with the exception of the [common] section, which must be the first one if present, and the [species] section, which must be the last one. PairHits uses only the sections [common], [pairhits], [species] and ignores any other sections of the configuration file.

The recognized configuration parameters in the [pairhits] section

splitlog
Sets logging mode for the parallel variant of the program. On the right hand, a Boolean value must be given as TRUE (possible forms are yes, true, 1, +) or FALSE (possible forms are no, false, 0, -). True value makes each MPI process generating a separate log file with the name specified by the parameters logname, logext or the command line option -o (the latter has a higher priority). If the [pairhits] section lacks this parameter, a setting from the [common] section will be used. The default value is TRUE.
logname
Specifies the log file name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the [pairhits] section lacks this parameter, a setting from the [common] section will be used. The default value is phs_log. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog,, logname, logext at once.
logext
Specifies the extension of the log file name. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is .txt. The value of this parameter can be changed by the -o command line option, which modifies the three parameters splitlog, logname, logext at once.
errname
Specifies the error log file name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. If splitlog = true was specified, a zero-based number of the MPI process will be appended to the name. If the section [pairhits] lacks this parameter, the setting from the [common] section will be used. The default value is phs_err. If empty value is specified, the error log will not be produced.
errext
Specifies the extension of the error log file name. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is .log.
jobpath
Specifies the job file(s) name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. The default value is jobs\. The name will be appended with a serial number of the job file starting from zero (or a value set by the start parameter). The job number is left-padded with zeroes up to the fixed number of digits specified by the width parameter or determined automatically. The setting can be changed by the -j command line option, which modifies the two parameters jobpath, jobext at once.
jobext
Specifies the extension of the job file name. The default value is .txt. The setting can be changed by the -j command line option, which modifies the two parameters jobpath, jobext at once.
fastapath
The common prefix of input file names such as a directory path. These files contain the complete genome sequences in either FASTA or GenBank format. The variable part of the names is specified in the [species] section of the configuration file. If the value contains spaces, it must be enclosed in double quotes. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default is the subdirectory fasta\ of the working directory. The setting can be changed by the -f command line option.
result
Specifies the result file(s) name (without extension), which may include a directory path. If the value contains spaces, it must be enclosed in double quotes. The name will be appended with a serial number of the job file that produced these results. For each job file, no more than one result file is produced. The default setting is results\, i.e., the result files are written to the subdirectory results of the working directory using the job file number as the result file name. The setting can be changed by the -r command line option, which modifies the two parameters result, resext at once.
resext
Specifies the extension of the result file name. The default value is .txt. The setting can be changed by the -r command line option, which modifies the two parameters result, resext at once.
append
Sets a mode of how the program writes to existing log and result files. On the right hand, a Boolean value must be given in any of the forms yes, true, 1, + for TRUE or no, false, 0, - for FALSE. True value means that the program begins writing at the end of existing file, thus preserving the old data. Otherwise, a file will be overwritten from the very beginning. The setting can be modified using the command line options -a or -w. If the parameter is not set, the files are overwritten by default.
width
Specifies the number of digits in the job file number (no more than 5 in current version). The default value 0 means that the number of digits is determined automatically, which is possible only if the very first (i.e., zero by default or set by the start parameter) job file does present in the directory specified by the jobpath parameter. The setting can be changed by the -y command line option.
keysize
Specifies the minimum length of the exactly matching portion of sought-for words. Recommended value is a multiple of 4 in the range from 16 to 48. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 16. The setting can be changed by the -k command line option.
keystep
Specifies the step of a key selection from the “first” sequence to build the hash table. The default value is 1. The setting can be modified by the -p command line option.
frequency
Specifies the maximum number of a key occurrences in the “first” sequence. Frequent keys that occur more times will be skipped. The value 0 means that all keys are considered irrespective of their frequency. The default value is 200. The setting can be modified by the -q command line option.
length
Specifies the lower threshold of the word length (only words of greater length are considered). If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 60. The setting can be changed by the -l command line option.
ratio
Specifies the maximum allowed gzip compression ratio of sought-for words; if a word is compressed greater, it is skipped. The value 0 means no complexity check. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 2.2. The setting can be changed by the -z command line option.
serial_del
Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack of insertions (0 means that deletions are not permitted). The value -1 means that there is no limitation on the number of deletions. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 2. The setting can be changed by the -b command line option.
maxscore
Specifies the maximum allowed total penalty for the words mismatch. The value is rounded to one decimal place. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 17.5. The setting can be changed by the -s command line option.
score_del
Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion). The value is rounded to one decimal place. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 2.1. The setting can be changed by the -d command line option.
score_mis
Specifies the penalty for each mismatch (the positive cost of a letter substitution). The value is rounded to one decimal place. If the [pairhits] section lacks this parameter, the setting from the [common] section will be used. The default value is 1. The setting can be changed by the -i command line option.
start
Specifies the job file number the program to start from. If there is no job file with such number, the program checks next one, etc. The setting can be changed by the -t command line option. The default value is 0.
stop
Specifies the maximum number of job files the program will try to process starting from the number set by the start parameter. The setting can be changed by the -u command line option. The default value is 10000, limited by 100000 in the current version.
minletter
Specifies the two-digit number, whose most-significant digit is the minimum allowed number of different letters in a sought-for word (4 or less), and the least-significant digit — the same in a search key. The value 0 means no requirement for different letters. The setting can be changed by the -e command line option. The default value is 43.

The list of input data in the [species] section

This section establishes a correspondence between the organism name, identifier, number and data file names. Each line corresponds to an organism. The program ignores blank lines and lines starting with a ; or # character. Leading spaces or tab characters in the line are skipped, then at least 5 fields must follow delimited by one or more tab characters. A field may include spaces. The fields contain the following data:

  1. the organism number (unique within the input data set);
  2. short denotation (identifier) of the organism, less than 36 characters in the current version;
  3. name of the organism;
  4. taxonomic code (ref. the diversity parameter of the FinDense program);
  5. file name of the full genome in FASTA or GenBank format (however, the whole genome must be in a single file). A path to the file is specified by the fastapath parameter in the configuration file or command line.
  6. (optional) file name of the genome annotation in GFF format; PairHits does not use this field

Template list of input data is provided in the test example for Windows and Linux.

Job files and parallel execution of PairHits

Once invoked, PairHits starts looking for candidate words in the pairs of strings from different genomes, doing so sequentially or in parallel depending on the program variant and parameters of the launch. The pairs of strings to be processed are specified in the set of job files through the configuration parameters jobpath, jobext, start, stop or the respective options in the command line. An example set of job files is provided in the test examples for Windows and Linux (ref. the jobs\ directory). Each job file points to a string from the “first” genome (in first line of the file) and one or more strings from the “second” genome (in subsequent lines of the file). There may be multiple second genomes in one file. Together these lines form as many pairs as strings specified from the second genome(s). All lines of the job file have the same format despite different lines can pertain to different genomes. The line must begin with at least one space or tab character, which is replaced with an asterisk (*) when the line has been processed. The three numeric fields follow delimited by at least one space or tab character, these fields are:

  1. the organism number (from the list of input data),
  2. serial number of a sequence in the full genome file in FASTA format, or 1 in the case of GenBank file (these numbers will appear in results),
  3. zero-based offset of the sequence header (specifically the > character) from the beginning of the genome file.

Each job file is always handled by one MPI process until the pairs are over, then the process switches to the next unprocessed job file, and so on. Processed lines are labeled with * in the first position; such lines of the job file are skipped during the processing, which allows the program to resume interrupted computation (for such cases, the append mode of writing to output files is provided, see command line options -a, -w and configuration parameter append). Uniprocessing variant of the PairHits handles job files in their number ascending order. As for the MPI variant of the program, each process starts from the job file having the same number as the processor (within a pool requested), and the next job file number is obtained by adding the total number of processors requested.

If few job files exist, there is no point in using more processors than files. The total time of computation will be determined by that of the largest job in process. The time is approximately linear in the string length of the first genome and total of the string lengths of the second genome(s), which occur in the job file. If there are few genomes and sequences, it is recommended to split long job files into parts and accordingly use more CPUs. On the contrary, if there are much more job files than available CPUs, the computational load will be self-balanced if the job files are numbered in the computation time decreasing order. Specifically, it may help to number the genomes in the total length descending order, and sequences of each genome in their lengths descending order as well.

Result files

If one or more suitable pairs of words are found in processing of a job file, PairHits creates a result file in the directory specified by the result configuration parameter; the name of the result file includes the job file number. A result file is a text file, all lines of which contain the same set of 11 fields separated by a single tab character. This file allows for direct import into a spreadsheet like Excel (if the file size allows). These fields contain:

  1. “first” genome number
  2. sequence number in the first genome
  3. anchor of a word in this sequence of the first genome (the “anchor” is a sum of the start and end positions of the word in the sequence, i.e. twofold the center position of the word)
  4. word length in the first genome
  5. “second” genome number
  6. sequence number in the second genome
  7. anchor of a word in this sequence of the second genome
  8. word length in the second genome
  9. indicator of the strand in the second genome (can be 1, -1 or 0)
  10. tenfold edit distance between the words (total penalty for mismatch)
  11. word compression ratio by the gzip algorithm.

Important note: In the current version of the program, positions on the complementary (negative) strand are numbered from the sequence end.

Examples of the result files are provided in the test examples for Windows and Linux (ref. the myfiles\hits\ directory).

Downloadable files

See the corresponding section of the iHCE page.