Finding of Multi-Box Regulatory Signal in a Set of Unaligned Sequences

In the Program description we present an original fast algorithm for local multiple alignment of sequences. Such alignments usually lead to several sites in each source sequence, and the sites correspond to different local alignments of approximately the same quality. In the Application example and TwoBox distribution we provide several examples of the local alignement found by our algorithm. The optimal local alignment is usually found on the basis of calculating a quality which is a sum of pairwise likenesses for all constituent sites. Our algorithm is based on an idea of the sum calculation only for some constituent sites selected by special procedure of random choice.


General information

The program TwoBox V3.17 is developed for finding a system of similar sites in a given set of sequences. One site (or none) is chosen from each sequence, and those sites should be as close as possible to each other. The program primarily tries to find sites in all sequences, but it may reject few sequences if such system is better in terms of the functionals used.

The sought-for sites can consist of a single box, or of multiple boxes spaced by linkers of the length which is fixed or varies within the given interval. The lengths of the boxes and intervals may be specified independently of each other. The current version 3.17 supports only sites consisting of one or two boxes, though the same algorithm can also be generalized for greater number of boxes in the site.

In addition, the program allows a user to search for signal in presence of known data about conserved positions within a box. Such data may be submitted in the form of a “motif” for a box (or boxes). Detailed information about input data and arguments of the program can be found in the program documentation.

The TwoBox program implements an elaboration of previously developed algorithm for finding one-box regulatory signal [1–3] by global optimization of predefined functional of the signal quality. The result is a quasi-optimal solution with the greatest value of the functional reached during the search, which is limited by internal criteria or/and duration or/and number of the algorithm steps.

Due to computational complexity of the algorithm (that grows as the number of boxes increase), the TwoBox program was from the very beginning intended for parallel cluster with intra-processor communications via MPI. Any number of processors will do; the program can busy all CPU available, and calculation time will decrease approximately s–1 times, where s is the number of CPUs. At least two logical processors are required, therefore, the program is capable of working on typical dual-core PC.

The program is provided as an executable for x86 architecture. It is intended for a cluster consisting of several PCs with Windows, interconnected by TCP/IP LAN. The MPI environment is to be established with use of the public domain software MPICH2 v.1.2 (by Argonne National Laboratory). This product (or its later version) have to be installed at all computers of the cluster. (If it is undesirable to install MPICH2 at the computer, the user can copy mpich2 libraries into the program folder. However, we cannot guarantee operation and functionality of the program in such case.) See the program documentation for additional details.

TwoBox uses a command line interface; it has to be run from a command shell of the operating system. The programming language is С, the compiler is Microsoft Visual Studio 2005 Service Pack 1, the executable file name is twobox.exe. Target CPU is Intel 32-bit. The operating systems tested were Microsoft Windows XP Service Pack 3 and Microsoft Windows Server 2003 Service Pack 2. Using TwoBox on other processors and/or operating systems is possible, but may require to re-compile the program or to carry out additional testing.

Software developer: Dr. L.I. Rubanov, leading scientist, IITP RAS (Kharkevich Institute)
E-mail: rubanov@iitp.ru

References

  1. L.V. Danilova, K.Yu. Gorbunov, M.S. Gelfand, V.A. Lyubetskii. Algorithm of regulatory signal recognition in DNA sequences. Molecular Biology, 2001, Vol. 35, No. 6, P. 841–848. DOI: 10.1023/A:1013282101105
  2. S.N. Istomina, L.I. Rubanov. Parallel algorithm of regulatory signal search in bacterial genomes. Information Processes, 2002, том 2, № 1, стр. 85–90 (in Russian). text
  3. L.V. Danilova, V.A. Lyubetsky, M.S. Gelfand. An algorithm for identification of regulatory signals in unaligned DNA sequences, its testing and parallel implementation. In Silico Biology, 2003, Vol. 3, No. 1,2, P. 33–47. текст