28/11/22
12:42:37 Laboratory of Mathematic methods and models in bioinformatics,


Ðóññêèé :: English

« back
Example of the algorithm applicationIdea of the algorithm ("The first algorithm" below) is explained in connection with its application for the search of bacterial type promotersIn this study we took leader regions of up to 1000 bp length from upstream all proteincoding genes (90 genes per species at average) in plastomes of all Streptophyta, Glaucocystophyceae and Bigelowiella natans available from GenBank. In de novo predictions of PEPpromoters we constructed multiple alignments of up to 1000 bp long noncoding regions from upstream these genes. The alignment was constructed to reveal the two bacterial type boxes and to cover the taxonomic diversity of the above mentioned lineages as wide as possible. In a positive prediction, the alignment of the boxes, linker and some flanking regions was required to have a good quality (see below). Otherwise, a negative prediction is produced and a PEPpromoter is not detected with our method. Notably, the positive predictions contained experimentally proved PEPpromoters and often their TGextensions, which indicates that these are not false positives. Also, in all negative predictions the alignment had a considerably lower quality compared to the minimal quality among all positive predictions. All predicted ÐÅÐpromoters were located within approximately 40 bp long highly conserved regions flanked by less conserved 3'areas and highly variable 5'areas. Our approach does not apply to comparisons within the same species, genus or family, because high similarity of the leader regions at this taxonomic level precludes meaningful promoter predictions. Therefore, in the positive cases, our alignments contained widely sampled representatives from across several taxonomic divisions; only such alignments were considered as positives. The approach might be applicable to NEPpromoters, but those were not considered in this study. Building a multiple alignment and revealing a twoboxed structure are NPhard problems in terms of computation. We applied two original algorithms to solve it: one to preselect leader regions with candidate PEPpromoters (usually several candidates were found per region), and the other to build a multiple alignment keeping one of the candidate promoters in each of the regions. The first algorithmGiven is a set of n leader regions. The goal is to find a subset of the set with one potential promoter in each region such that their total pairwise similarity is maximal comparing to any other collection of potential promoters in that subset; the subset size is simultaneously maximized. In order to increase search speed, randomly selected regions are set as "linked" and the promoter similarity is estimated only within linked pairs of regions. It formally means that we consider a graph with n vertices, each assigned a leader region, but only linked regions are connected by an edge in the graph. As a result, the complexity of comparing all pairs of candidate promoters to determine their total similarity is reduced in our algorithm by considering a large number of randomly defined sets of edges, i.e. randomly constructed graphs with n vertices assigned the same way by the same regions but connected by different edges. By doing so, the computing time becomes square to number n of the regions and cubic to their average length. The algorithm is designed for effective parallelization to enable mass processing of large amounts of long regions in feasible time. The enhanced performance of the parallel implementation allows to compute a solution closer to the maximum quality of the alignment. The algorithm is highly scalable and provides for the approximately linear growth of performance with the number of available processors up to 1000. The parallel implementation of the algorithm uses MPI v.1.1. The software and user's manual are available for download from TwoBox page. The second algorithmAlong a fixed phylogenetic species tree, the algorithm aligns leader regions with respect to one of the candidate promoters selected by the first algorithm, from the promoter start up to the start codon. It uses a common observation that promoters, as well as transcribed regions, can be well aligned, in contrast to the region upstream the promoter. The algorithm takes a nonbinary (which is often the case) species tree and during the run reduces it to a binary tree in a variety (or even all) possible ways. Each leaf of the tree bears an orthologous gene leader region from the corresponding species. The alignment is constructed as follows. First, each leaf is assigned a nucleotide frequency distribution at each position of the sequence: the distribution contains a unity for the observed nucleotide type and three zeros for the unobserved. A zero distribution contains four zeros. Then, at each inner node, two distribution sequences at its descendant nodes are aligned by any applicable algorithm, with an award for matching two distributions not predefined, but calculated anew at each position j taking into account the length of each descendant branch. The award is estimated as a scalar square of the difference between two nonzero distributions weighted for different nucleotide types. The penalty for inserting a gap symbol (i.e., for the alignment of zero and nonzero distributions) is a decreasing function of the number of contiguous gaps: the longer the gap region, the lower the penalty. Two zero distributions are forbidden to align. At each position of the alignment, the distribution in the ancestral sequence is a halfsum of the two distributions in the descendants. When the root distribution sequence is constructed, the algorithm projects the gaps along the tree to its leaves onto the extant sequences, thus obtaining the final multiple alignment. The complexity is linear to the number of leaves. Different binary tree resolutions are compared on the basis of the corresponding alignments quality, which is estimated as follows: Both algorithms are implemented as 32bit command line utilities written in ANSI Ñ, which can be compiled with many popular compilers and run under Windows or Linux. 