![]() |
14/03/25
13:49:53 Laboratory of Mathematic methods and models in bioinformatics,
|
Русский :: English
|
Aligning a Hidden PalindromeAlgorithms have been developed to optimize the edit distance between sequences, where the first sequence is known and the second is computed as a hidden palindrome of arbitrary length. Importantly, the length of the desired palindrome is determined as a result of the optimization. In another variant of the problem, such a palindrome can be partial, where only the prefix is complementary to the suffix. Such a partial palindrome forms a hairpin. The new algorithms run in quadratic time, which is faster than a full enumeration of possible palindromes. Below is an example of their software implementation in Python. For two sequences \(x\) and \(y\), the program finds the optimal length of the prefix \(w\) of sequence \(y\) such that the minimum edit distance between \(x\) and the palindrome \(w\mathtt{c}(w)\) in task 1 or the partial palindrome \(y\mathtt{c}(w)\) in task 2 is achieved, where \(\mathtt{c}(w)\) denotes the reverse complement of \(w\).
The function The new version implements the \(\mathrm{imp}(x)\) function equal to the ratio of the edit distance from task 1 with \(x=y\) to the length of the sequence: $$\mathrm{imp}(x)=\frac{\min\{\mathrm{dist}(x,w\mathrm{c}(w))|x=wz\}}{|x|}.$$ The ratio shows how imperfect the palindrome is. Usage example
>>> palindrome_alignment("ACGT", "ACC")
(0, [2])
>>> palindrome_alignment("ACTG", "ACC", loop=True)
(1, [1, 2])
>>> imp("ACTG")
0.5
Implementation: palial.pyShow/hide the codefrom typing import List, Tuple
import numpy as np
def dist(x: str, y: str) -> int:
return palindrome_alignment(x, y)[0]
def imp(x: str) -> float:
return palindrome_self_alignment(x)[0] / len(x)
def palindrome_alignment(x: str, y: str, loop: bool = False) -> Tuple[int, List[int]]:
"""For a given pair of nucleotide sequences `x`, `y`, find the optimal
alignment between x and ww' (or yw' if `loop` is `True`) where w is
some prefix of y and w' is the reverse complement of w.
Return the edit distance and the list of optimal w lengths.
>>> palindrome_alignment("ACGT", "ACC")
(0, [2])
>>> palindrome_alignment("ACTG", "ACC", loop=True)
(1, [1, 2])
"""
x = preprocess_nucleotide_sequence(x)
y = preprocess_nucleotide_sequence(y)
forward = prefix_edit_distances(x, y)
reverse = prefix_edit_distances(reverse_complement(x), y)
if loop:
forward = forward[:, -1:]
total = forward + reverse[::-1]
minima = total.min(axis=0)
edit_distance = minima.min()
prefix_lengths = np.where(minima == edit_distance)[0].tolist()
return int(edit_distance), prefix_lengths
def palindrome_self_alignment(x: str) -> Tuple[int, List[int]]:
"""A special case of `palindrome_alignment` for self-alignment.
For a given nucleotide sequence `x`, find the optimal alignment
between x and ww' where w is some prefix of x and w' is the
reverse complement of w.
Return the edit distance and the list of optimal w lengths.
>>> palindrome_self_alignment("ACGC")
(1, [2])
"""
x = preprocess_nucleotide_sequence(x)
forward = np.abs(np.arange(len(x) + 1) - np.arange(len(x) + 1)[:, None])
reverse = prefix_edit_distances(reverse_complement(x), x)
total = forward + reverse[::-1]
minima = total.min(axis=0)
edit_distance = minima.min()
prefix_lengths = np.where(minima == edit_distance)[0].tolist()
return int(edit_distance), prefix_lengths
def preprocess_nucleotide_sequence(s: str) -> str:
"""Preprocess a nucleotide sequence for the Needleman-Wunsch algorithm."""
s = s.upper().replace("U", "T").replace("-", "")
if set(s) - set("ATCG"):
raise ValueError(f"invalid nucleotide sequence: {s}")
return s
def reverse_complement(s: str) -> str:
"""Return the reverse complement of a nucleotide sequence."""
return s[::-1].translate(str.maketrans("ATCGatcg", "TAGCtagc"))
def prefix_edit_distances(s0: str, s1: str) -> np.ndarray:
"""Return the matrix of edit distances between all prefixes of `s0`
and all prefixes of `s1` indexed by the lengths of the prefixes."""
n0, n1 = len(s0), len(s1)
distance = np.zeros((n0 + 1, n1 + 1), np.int32)
distance[:, 0] = np.arange(n0 + 1)
distance[0, :] = np.arange(n1 + 1)
for i in range(1, n0 + 1):
for j in range(1, n1 + 1):
distance[i, j] = min(
distance[i, j - 1] + 1,
distance[i - 1, j] + 1,
distance[i - 1, j - 1] + (s0[i - 1] != s1[j - 1]),
)
return distance
The page was updated on 16.12.2024 (previous version). |