PairHits
program
The program is a part of the iHCE
software package aimed
at identification of highly conserved elements (HCEs) in a set of genomes. It finds all pairs
of approximately matching words in two sequences from different genomes, thus forming edges
of the source graph. The program is intended for parallel computing under MPI on a
high-performance cluster operated by 64-bit Windows or Linux system.
PairHits
is a command line utility written in C++. The command line syntax:
pairhits [options] [config_file]
(Here, pairhits
is the program name in Linux; pairhits64
is
a variant for Windows 64-bit with MPICH2, pairhits64ms
— with Microsoft MPI,
and pairhits64nompi
— without MPI.)
The program options can be specified in any of the three formats:
/x[ value]
, -x[ value]
,
--x…xx[=value]
; some options lack the value. Most options are used
for changing the “usual” values of the parameters, which are set in the program configuration
file or by default (a value in the command line has the highest priority).
The recognized command line options and their default values
- -?
- --help
- Display a brief help on command line options.
- -a
- --append
- Append results to the end of output file if it exists (see also the
-w
option). The default setting is as per configuration file; if not specified there, the file will be overwritten. - -b value
- --belt=value
- The maximum allowed number of consecutive deletions, and total of deletions at lack of
insertions (
0
means that deletions are not permitted). Value-1
means that there is no limitation on the number of deletions. The default value is2
. - -с filename
- --config=filename
- The configuration file name, which may include a directory path.
If the name contains spaces, the whole argument must be enclosed in double quotes.
The default is
config.ini
file in the working directory. - -d value
- --score_del=value
- The penalty for each deletion (the positive cost of a letter insertion or deletion).
The value is rounded to one decimal place. The default value is
2.1
. - -e value
- --minletter=value
- A two-digit number is specified. The most-significant digit is the minimum allowed number
of different letters in a sought word (4 or less), and the least-significant digit —
the same in a search key. Zero value means no requirement for different letters. The default
value is
43
. - -f name
- --fastapath=name
- The common prefix of the input file names such as a directory path. These files contain
the complete genome sequences in either FASTA or GenBank format (the latter is possible only
if the genome consists of a single sequence). The variable part of the names is specified
in the
[species]
section of the configuration file. If the name contains spaces, the whole argument must be enclosed in double quotes. The default is the subdirectoryfasta\
of the working directory. - -i value
- --score_mis=value
- Specifies the penalty for each mismatch (the positive cost of a letter substitution).
The value is rounded to one decimal place. The default value is
1
. - -j filename
- --job=filename
- The name of the job file(s) for the program, that may include a
directory path. If the name contains spaces, the whole argument must be enclosed in double quotes.
A
#
character in a file name means a serial number, provided that all job files are numbered in succession starting from a certain number (0
by default, but can be modified by the-t
option), and each file number consists of the same number of digits left-padded with zeroes as appropriate. The number of digits is specified by the-y
option or determined automatically. The default file name isjobs\#.txt
. - -k value
- --key=value
- The minimum length of the exactly matching portion of sought-for words. The default value
is
16
. - -l value
- --length=value
- The lower threshold of the word length (only the words of greater length are considered).
The default value is
60
. - -n
- --nompi
- If specified, the multiprocessor variant of the program will forcedly operate in uniprocessor mode even in MPI environment. This option cannot help on some systems without MPI where the program crashes; the uniprocessor variant of the program should be used in such cases.
- -o filename
- --log=filename
- The name of the program log file, which may include a directory path. If the name contains
spaces, the whole argument must be enclosed in double quotes. A
#
character in the file name will be replaced by the CPU number, provided that separate logging of each MPI process is set in the configuration file. The default file name isphs_log#.txt
. - -p value
- --step=value
- The step of selecting a key from the “first” sequence when building a hash table.
The default value is
1
. - -q value
- --frequency=value
- The maximum number of a key occurrences in the “first” sequence. Frequent keys that occur more
times will be skipped. The zero value means that all keys are considered irrespective of their
frequency. The default value is
200
. - -r filename
- --result=filename
- The name of the result file(s), which may include a directory path.
If the name contains spaces, the whole argument must be enclosed in double quotes.
A
#
character in the file name points where the job number will appear (ref. the-j
option). The default file name isresults\#.txt
. - -s value
- --maxscore=value
- The maximum allowed total penalty for the words mismatch. The value is rounded to one decimal
place. The default value is
17.5
. - -t value
- --start=value
- The job file number the program to start from (see also the options
-j
,-u
). The default value is0
. - -u value
- --use=value
- The maximum number of job files the program will try to process (see
also the options
-j
,-t
); limited by100000
in the current version. The default value is10000
. - -w
- --rewrite
- Overwrite existing output file(s), see also the
-a
option. The default setting is as per configuration file; if not specified there, the existing file(s) will be overwritten. - -y value
- --width=value
- The number of digits in the job file number, not greater than 5 in the current version
(see also the
-j
option). Default value0
means that the number of digits will be determined automatically (which is not possible in some cases, see below description of the configuration parameterwidth
). - -z value
- --ratio=value
- The maximum allowed gzip compression ratio of sought-for words; if a word is compressed more,
it is skipped. The value
0
means no check. The default value is2.2
.
The first command line argument that is not an above option or its value will be treated
as the name of the configuration file (similar to the -c
option).
The program configuration file
The configuration file is required; it is a text file having the traditional
structure of such files (an example of the file is included in the test examples
for Windows and
Linux). Lines of the file must not be wrapped;
empty lines are ignored. A line that starts with a ;
or #
character
is considered a comment (i.e., ignored). A part of the line starting with a double slash
(//
) is also considered a comment.
Meaningful lines of the file can be of two kinds: the section name, which has
the format [section_name]
, and parameter setting in the format
parameter = value
. In the latter format, at least one space or tab character must
be used as a delimiter before and after the equality sign. A part of the file from the section
name until the next section name or the end of the file is referred to as a configuration
section. Parameters can appear in any order inside the section. The configuration file can
include several sections in any order with the exception of the [common]
section,
which must be the first one if present, and the [species]
section, which must
be the last one. PairHits
uses only the sections [common]
,
[pairhits]
, [species]
and ignores any other sections
of the configuration file.
The recognized configuration parameters in the [pairhits]
section
- splitlog
- Sets logging mode for the parallel variant of the program. On the right hand, a Boolean value
must be given as TRUE (possible forms are
yes
,true
,1
,+
) or FALSE (possible forms areno
,false
,0
,-
). True value makes each MPI process generating a separate log file with the name specified by the parameterslogname
,logext
or the command line option-o
(the latter has a higher priority). If the[pairhits]
section lacks this parameter, a setting from the[common]
section will be used. The default value is TRUE. - logname
- Specifies the log file name (without extension), which may include a directory path. If the
value contains spaces, it must be enclosed in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the[pairhits]
section lacks this parameter, a setting from the[common]
section will be used. The default value isphs_log
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog,
,logname
,logext
at once. - logext
- Specifies the extension of the log file name. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.txt
. The value of this parameter can be changed by the-o
command line option, which modifies the three parameterssplitlog
,logname
,logext
at once. - errname
- Specifies the error log file name (without extension), which may include a directory path. If
the value contains spaces, it must be enclosed in double quotes. If
splitlog = true
was specified, a zero-based number of the MPI process will be appended to the name. If the section[pairhits]
lacks this parameter, the setting from the[common]
section will be used. The default value isphs_err
. If empty value is specified, the error log will not be produced. - errext
- Specifies the extension of the error log file name. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is.log
. - jobpath
- Specifies the job file(s) name (without extension), which may include
a directory path. If the value contains spaces, it must be enclosed in double quotes. The default
value is
jobs\
. The name will be appended with a serial number of the job file starting from zero (or a value set by thestart
parameter). The job number is left-padded with zeroes up to the fixed number of digits specified by thewidth
parameter or determined automatically. The setting can be changed by the-j
command line option, which modifies the two parametersjobpath
,jobext
at once. - jobext
- Specifies the extension of the job file name. The default value is
.txt
. The setting can be changed by the-j
command line option, which modifies the two parametersjobpath
,jobext
at once. - fastapath
- The common prefix of input file names such as a directory path. These files contain the
complete genome sequences in either FASTA or GenBank format. The variable part of the names is
specified in the
[species]
section of the configuration file. If the value contains spaces, it must be enclosed in double quotes. If the[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default is the subdirectoryfasta\
of the working directory. The setting can be changed by the-f
command line option. - result
- Specifies the result file(s) name (without extension), which may include
a directory path. If the value contains spaces, it must be enclosed in double quotes. The name
will be appended with a serial number of the job file that produced these results. For each job
file, no more than one result file is produced. The default setting is
results\
, i.e., the result files are written to the subdirectoryresults
of the working directory using the job file number as the result file name. The setting can be changed by the-r
command line option, which modifies the two parametersresult
,resext
at once. - resext
- Specifies the extension of the result file name. The default value is
.txt
. The setting can be changed by the-r
command line option, which modifies the two parametersresult
,resext
at once. - append
- Sets a mode of how the program writes to existing log and result files. On the right hand,
a Boolean value must be given in any of the forms
yes
,true
,1
,+
for TRUE orno
,false
,0
,-
for FALSE. True value means that the program begins writing at the end of existing file, thus preserving the old data. Otherwise, a file will be overwritten from the very beginning. The setting can be modified using the command line options-a
or-w
. If the parameter is not set, the files are overwritten by default. - width
- Specifies the number of digits in the job file number (no more than 5 in current version).
The default value
0
means that the number of digits is determined automatically, which is possible only if the very first (i.e., zero by default or set by thestart
parameter) job file does present in the directory specified by thejobpath
parameter. The setting can be changed by the-y
command line option. - keysize
- Specifies the minimum length of the exactly matching portion of sought-for words. Recommended
value is a multiple of 4 in the range from 16 to 48. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is16
. The setting can be changed by the-k
command line option. - keystep
- Specifies the step of a key selection from the “first” sequence to build the hash table.
The default value is
1
. The setting can be modified by the-p
command line option. - frequency
- Specifies the maximum number of a key occurrences in the “first” sequence. Frequent keys that
occur more times will be skipped. The value
0
means that all keys are considered irrespective of their frequency. The default value is200
. The setting can be modified by the-q
command line option. - length
- Specifies the lower threshold of the word length (only words of greater length are
considered). If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is60
. The setting can be changed by the-l
command line option. - ratio
- Specifies the maximum allowed gzip compression ratio of sought-for words; if a word is
compressed greater, it is skipped. The value
0
means no complexity check. If the[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is2.2
. The setting can be changed by the-z
command line option. - serial_del
- Specifies the maximum allowed number of consecutive deletions, and total of deletions at lack
of insertions (
0
means that deletions are not permitted). The value-1
means that there is no limitation on the number of deletions. If the[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is2
. The setting can be changed by the-b
command line option. - maxscore
- Specifies the maximum allowed total penalty for the words mismatch. The value is rounded
to one decimal place. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is17.5
. The setting can be changed by the-s
command line option. - score_del
- Specifies the penalty for each deletion (the positive cost of a letter insertion or deletion).
The value is rounded to one decimal place. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is2.1
. The setting can be changed by the-d
command line option. - score_mis
- Specifies the penalty for each mismatch (the positive cost of a letter substitution).
The value is rounded to one decimal place. If the
[pairhits]
section lacks this parameter, the setting from the[common]
section will be used. The default value is1
. The setting can be changed by the-i
command line option. - start
- Specifies the job file number the program to start from. If there
is no job file with such number, the program checks next one, etc. The setting can be changed
by the
-t
command line option. The default value is0
. - stop
- Specifies the maximum number of job files the program will try to
process starting from the number set by the
start
parameter. The setting can be changed by the-u
command line option. The default value is10000
, limited by100000
in the current version. - minletter
- Specifies the two-digit number, whose most-significant digit is the minimum allowed number
of different letters in a sought-for word (4 or less), and the least-significant digit —
the same in a search key. The value
0
means no requirement for different letters. The setting can be changed by the-e
command line option. The default value is43
.
The list of input data in the [species]
section
This section establishes a correspondence between the organism name, identifier, number and
data file names. Each line corresponds to an organism. The program ignores blank lines and lines
starting with a ;
or #
character. Leading spaces or tab characters
in the line are skipped, then at least 5 fields must follow delimited by one or more tab
characters. A field may include spaces. The fields contain the following data:
- the organism number (unique within the input data set);
- short denotation (identifier) of the organism, less than 36 characters in the current version;
- name of the organism;
- taxonomic code (ref. the
diversity
parameter of the FinDense program); - file name of the full genome in FASTA or GenBank format (however, the whole genome must be
in a single file). A path to the file is specified by the
fastapath
parameter in the configuration file or command line. - (optional) file name of the genome annotation in GFF format;
PairHits
does not use this field
Template list of input data is provided in the test example for Windows and Linux.
Job files and parallel execution of PairHits
Once invoked, PairHits
starts looking for candidate words in the pairs of strings
from different genomes, doing so sequentially or in parallel depending on the program variant and
parameters of the launch. The pairs of strings to be processed are specified in the set of job
files through the configuration parameters jobpath
, jobext
,
start
, stop
or the respective options in the command line.
An example set of job files is provided in the test examples for
Windows and
Linux (ref. the jobs\
directory).
Each job file points to a string from the “first” genome (in first line of the file) and one or
more strings from the “second” genome (in subsequent lines of the file). There may be multiple
second genomes in one file. Together these lines form as many pairs as strings specified from
the second genome(s). All lines of the job file have the same format despite different lines can
pertain to different genomes. The line must begin with at least one space or tab character, which
is replaced with an asterisk (*
) when the line has been processed. The three numeric
fields follow delimited by at least one space or tab character, these fields are:
- the organism number (from the list of input data),
- serial number of a sequence in the full genome file in FASTA format, or
1
in the case of GenBank file (these numbers will appear in results), - zero-based offset of the sequence header (specifically the
>
character) from the beginning of the genome file.
Each job file is always handled by one MPI process until the pairs are over, then the
process switches to the next unprocessed job file, and so on. Processed lines are labeled
with *
in the first position; such lines of the job file are skipped during the
processing, which allows the program to resume interrupted computation (for such cases, the
append mode of writing to output files is provided, see command line options -a
,
-w
and configuration parameter append
). Uniprocessing variant of the
PairHits
handles job files in their number ascending order. As for the MPI variant
of the program, each process starts from the job file having the same number as the processor
(within a pool requested), and the next job file number is obtained by adding the total number
of processors requested.
If few job files exist, there is no point in using more processors than files. The total time of computation will be determined by that of the largest job in process. The time is approximately linear in the string length of the first genome and total of the string lengths of the second genome(s), which occur in the job file. If there are few genomes and sequences, it is recommended to split long job files into parts and accordingly use more CPUs. On the contrary, if there are much more job files than available CPUs, the computational load will be self-balanced if the job files are numbered in the computation time decreasing order. Specifically, it may help to number the genomes in the total length descending order, and sequences of each genome in their lengths descending order as well.
Result files
If one or more suitable pairs of words are found in processing of a job file,
PairHits
creates a result file in the directory specified by the
result
configuration parameter; the name of the result file includes the job file
number. A result file is a text file, all lines of which contain the same set of 11 fields
separated by a single tab character. This file allows for direct import into a spreadsheet
like Excel (if the file size allows). These fields contain:
- “first” genome number
- sequence number in the first genome
- anchor of a word in this sequence of the first genome (the “anchor” is a sum of the start and end positions of the word in the sequence, i.e. twofold the center position of the word)
- word length in the first genome
- “second” genome number
- sequence number in the second genome
- anchor of a word in this sequence of the second genome
- word length in the second genome
- indicator of the strand in the second genome (can be
1
,-1
or0
) - tenfold edit distance between the words (total penalty for mismatch)
- word compression ratio by the gzip algorithm.
Important note: In the current version of the program, positions on the complementary (negative) strand are numbered from the sequence end.
Examples of the result files are provided in the test examples
for Windows and
Linux
(ref. the myfiles\hits\
directory).
Downloadable files
See the corresponding section of the iHCE
page.