Example of the algorithm application

The idea of the algorithm (“The second algorithm” below) is explained in connection with its application for the search of bacterial type promoters.

In this study we took leader regions of up to 1000 bp length from upstream all protein-coding genes (90 genes per species at average) in plastomes of all Streptophyta, Glaucocystophyceae and Bigelowiella natans available from GenBank. In de novo predictions of PEP-promoters we constructed multiple alignments of up to 1000 bp long non-coding regions from upstream these genes. The alignment was constructed to reveal the two bacterial type boxes and to cover the taxonomic diversity of the above mentioned lineages as wide as possible. In a positive prediction, the alignment of the boxes, linker and some flanking regions was required to have a good quality (see below). Otherwise, a negative prediction is produced and a PEP-promoter is not detected with our method.

Notably, the positive predictions contained experimentally proved PEP-promoters and often their TG-extensions, which indicates that these are not false positives. Also, in all negative predictions the alignment had a considerably lower quality compared to the minimal quality among all positive predictions. All predicted РЕР-promoters were located within approximately 40 bp long highly conserved regions flanked by less conserved 3′-areas and highly variable 5′-areas. Our approach does not apply to comparisons within the same species, genus or family, because high similarity of the leader regions at this taxonomic level precludes meaningful promoter predictions. Therefore, in the positive cases, our alignments contained widely sampled representatives from across several taxonomic divisions; only such alignments were considered as positives. The approach might be applicable to NEP-promoters, but those were not considered in this study. Building a multiple alignment and revealing a two-boxed structure are NP-hard problems in terms of computation. We applied two original algorithms to solve it: one to pre-select leader regions with candidate PEP-promoters (usually several candidates were found per region), and the other to build a multiple alignment keeping one of the candidate promoters in each of the regions.

The first algorithm

Given is a set of n leader regions. The goal is to find a subset of the set with one potential promoter in each region such that their total pair-wise similarity is maximal comparing to any other collection of potential promoters in that subset; the subset size is simultaneously maximized.

In order to increase search speed, randomly selected regions are set as “linked” and the promoter similarity is estimated only within linked pairs of regions. It formally means that we consider a graph with n vertices, each assigned a leader region, but only linked regions are connected by an edge in the graph. As a result, the complexity of comparing all pairs of candidate promoters to determine their total similarity is reduced in our algorithm by considering a large number of randomly defined sets of edges, i.e. randomly constructed graphs with n vertices assigned the same way by the same regions but connected by different edges. By doing so, the computing time becomes square to number n of the regions and cubic to their average length.

The algorithm is designed for effective parallelization to enable mass processing of large amounts of long regions in feasible time. The enhanced performance of the parallel implementation allows to compute a solution closer to the maximum quality of the alignment. The algorithm is highly scalable and provides for the approximately linear growth of performance with the number of available processors up to 1000. The parallel implementation of the algorithm uses MPI v.1.1. The software and user's manual are available for download from TwoBox page.

The second algorithm

Along a fixed phylogenetic species tree, the algorithm aligns leader regions with respect to one of the candidate promoters selected by the first algorithm, from the promoter start up to the start codon. It uses a common observation that promoters, as well as transcribed regions, can be well aligned, in contrast to the region upstream the promoter. The algorithm takes a non-binary (which is often the case) species tree and during the run reduces it to a binary tree in a variety (or even all) possible ways.

Each leaf of the tree bears an orthologous gene leader region from the corresponding species. The alignment is constructed as follows. First, each leaf is assigned a nucleotide frequency distribution at each position of the sequence: the distribution contains a unity for the observed nucleotide type and three zeros for the unobserved. A zero distribution contains four zeros. Then, at each inner node, two distribution sequences at its descendant nodes are aligned by any applicable algorithm, with an award for matching two distributions not pre-defined, but calculated anew at each position j taking into account the length of each descendant branch. The award is estimated as a scalar square of the difference between two nonzero distributions weighted for different nucleotide types. The penalty for inserting a gap symbol (i.e., for the alignment of zero and nonzero distributions) is a decreasing function of the number of contiguous gaps: the longer the gap region, the lower the penalty. Two zero distributions are forbidden to align.

At each position of the alignment, the distribution in the ancestral sequence is a half-sum of the two distributions in the descendants. When the root distribution sequence is constructed, the algorithm projects the gaps along the tree to its leaves onto the extant sequences, thus obtaining the final multiple alignment. The complexity is linear to the number of leaves. Different binary tree resolutions are compared on the basis of the corresponding alignments quality, which is estimated as follows: $(N_a + N_s)b + \sum_{i=1}^{N_s}(b + s)(l_i - 1) + N_b c$ where N_a is the number of totally conserved (containing the same character) columns, N_s is the number of totally conserved regions (two or more contiguous totally conserved columns, l_i is the number of columns), N_b is the number of “nearly” conserved columns (with one non-matching character); b, c and s are parameters. Computing an alignment of 16 sequences with the length of 120–223 bases requires less than one second on a 3 GHz Pentium-4 PC. The automatically computed alignments were manually checked and minor corrections were introduced if so required. The software and user's manual are available for download from TreeAl page.

Both algorithms are implemented as 32-bit command line utilities written in ANSI С, which can be compiled with many popular compilers and run under Windows or Linux.

← Back