14/03/25
19:06:22

Laboratory of Mathematic methods and models in bioinformatics,
Institute for Information Transmission Problems,
Russian Academy of Sciences

Aligning a Hidden Palindrome

Algorithms have been developed to optimize the edit distance between sequences, where the first sequence is known and the second is computed as a hidden palindrome of arbitrary length. Importantly, the length of the desired palindrome is determined as a result of the optimization. In another variant of the problem, such a palindrome can be partial, where only the prefix is complementary to the suffix. Such a partial palindrome forms a hairpin. The new algorithms run in quadratic time, which is faster than a full enumeration of possible palindromes. Below is an example of their software implementation in Python.

For two sequences \(x\) and \(y\), the program finds the optimal length of the prefix \(w\) of sequence \(y\) such that the minimum edit distance between \(x\) and the palindrome \(w\mathtt{c}(w)\) in task 1 or the partial palindrome \(y\mathtt{c}(w)\) in task 2 is achieved, where \(\mathtt{c}(w)\) denotes the reverse complement of \(w\). The function palindrome_alignment takes as input strings x and y, as well as a boolean value loop, which allows choosing one of the two variants: task 1 when loop=False (default), task 2 when loop=True. The function returns the minimum edit distance alignment and a list of prefix \(w\) lengths for which it is achieved.

Usage example
    >>> palindrome_alignment("ACGT", "ACC")
    (0, [2])
    >>> palindrome_alignment("ACTG", "ACC", loop=True)
    (1, [1, 2])
Implementation: palial.py
Show/hide the code
from typing import List, Sequence, Tuple

import numpy as np


def palindrome_alignment(x: str, y: str, loop: bool = False) -> Tuple[int, List[int]]:
    """For a given pair of nucleotide sequences `x`, `y`, find the optimal
    alignment between x and ww' (or yw' if `loop` is `True`) where w is
    some prefix of y and w' is the reverse complement of w.
    Return the edit distance and the list of optimal w lengths.
    >>> palindrome_alignment("ACGT", "ACC")
    (0, [2])
    >>> palindrome_alignment("ACTG", "ACC", loop=True)
    (1, [1, 2])
    """
    x = preprocess_nucleotide_sequence(x)
    y = preprocess_nucleotide_sequence(y)
    forward = prefix_edit_distances(x, y)
    reverse = prefix_edit_distances(reverse_complement(x), y)
    if loop:
        forward = forward[:, -1:]
    total = forward + reverse[::-1]
    minima = total.min(axis=0)
    edit_distance = minima.min()
    stem_lengths = np.where(minima == edit_distance)[0].tolist()
    return edit_distance, stem_lengths


def preprocess_nucleotide_sequence(s: str) -> str:
    """Preprocess a nucleotide sequence for the Needleman-Wunsch algorithm."""
    s = s.upper().replace("U", "T").replace("-", "")
    if set(s) - set("ATCG"):
        raise ValueError(f"invalid nucleotide sequence: {s}")
    return s


def reverse_complement(s: str) -> str:
    """Return the reverse complement of a nucleotide sequence."""
    return s[::-1].translate(str.maketrans("ATCGatcg", "TAGCtagc"))


def prefix_edit_distances(s0: str, s1: str) -> np.ndarray:
    """Return the matrix of edit distances between all prefixes of `s0`
    and all prefixes of `s1` indexed by the lengths of the prefixes."""
    n0, n1 = len(s0), len(s1)
    distance = np.zeros((n0 + 1, n1 + 1), np.int32)
    distance[:, 0] = np.arange(n0 + 1)
    distance[0, :] = np.arange(n1 + 1)
    for i in range(1, n0 + 1):
        for j in range(1, n1 + 1):
            distance[i, j] = min(
                distance[i, j - 1] + 1,
                distance[i - 1, j] + 1,
                distance[i - 1, j - 1] + (s0[i - 1] != s1[j - 1]),
            )
    return distance