![]() |
14/03/25
19:06:22 Laboratory of Mathematic methods and models in bioinformatics,
|
Русский :: English
|
Aligning a Hidden PalindromeAlgorithms have been developed to optimize the edit distance between sequences, where the first sequence is known and the second is computed as a hidden palindrome of arbitrary length. Importantly, the length of the desired palindrome is determined as a result of the optimization. In another variant of the problem, such a palindrome can be partial, where only the prefix is complementary to the suffix. Such a partial palindrome forms a hairpin. The new algorithms run in quadratic time, which is faster than a full enumeration of possible palindromes. Below is an example of their software implementation in Python. For two sequences \(x\) and \(y\), the program finds the optimal length of the prefix \(w\) of sequence \(y\) such that the minimum edit distance between \(x\) and the palindrome \(w\mathtt{c}(w)\) in task 1 or the partial palindrome \(y\mathtt{c}(w)\) in task 2 is achieved, where \(\mathtt{c}(w)\) denotes the reverse complement of \(w\). The function palindrome_alignment takes as input strings x and y, as well as a boolean value loop, which allows choosing one of the two variants: task 1 when loop=False (default), task 2 when loop=True. The function returns the minimum edit distance alignment and a list of prefix \(w\) lengths for which it is achieved.
Usage example
>>> palindrome_alignment("ACGT", "ACC")
(0, [2])
>>> palindrome_alignment("ACTG", "ACC", loop=True)
(1, [1, 2])
Implementation: palial.pyShow/hide the codefrom typing import List, Sequence, Tuple
import numpy as np
def palindrome_alignment(x: str, y: str, loop: bool = False) -> Tuple[int, List[int]]:
"""For a given pair of nucleotide sequences `x`, `y`, find the optimal
alignment between x and ww' (or yw' if `loop` is `True`) where w is
some prefix of y and w' is the reverse complement of w.
Return the edit distance and the list of optimal w lengths.
>>> palindrome_alignment("ACGT", "ACC")
(0, [2])
>>> palindrome_alignment("ACTG", "ACC", loop=True)
(1, [1, 2])
"""
x = preprocess_nucleotide_sequence(x)
y = preprocess_nucleotide_sequence(y)
forward = prefix_edit_distances(x, y)
reverse = prefix_edit_distances(reverse_complement(x), y)
if loop:
forward = forward[:, -1:]
total = forward + reverse[::-1]
minima = total.min(axis=0)
edit_distance = minima.min()
stem_lengths = np.where(minima == edit_distance)[0].tolist()
return edit_distance, stem_lengths
def preprocess_nucleotide_sequence(s: str) -> str:
"""Preprocess a nucleotide sequence for the Needleman-Wunsch algorithm."""
s = s.upper().replace("U", "T").replace("-", "")
if set(s) - set("ATCG"):
raise ValueError(f"invalid nucleotide sequence: {s}")
return s
def reverse_complement(s: str) -> str:
"""Return the reverse complement of a nucleotide sequence."""
return s[::-1].translate(str.maketrans("ATCGatcg", "TAGCtagc"))
def prefix_edit_distances(s0: str, s1: str) -> np.ndarray:
"""Return the matrix of edit distances between all prefixes of `s0`
and all prefixes of `s1` indexed by the lengths of the prefixes."""
n0, n1 = len(s0), len(s1)
distance = np.zeros((n0 + 1, n1 + 1), np.int32)
distance[:, 0] = np.arange(n0 + 1)
distance[0, :] = np.arange(n1 + 1)
for i in range(1, n0 + 1):
for j in range(1, n1 + 1):
distance[i, j] = min(
distance[i, j - 1] + 1,
distance[i - 1, j] + 1,
distance[i - 1, j - 1] + (s0[i - 1] != s1[j - 1]),
)
return distance
|