Trimming

Algorithms have been developed to optimize \(\mathrm{imp}(x)\) function: $$\mathrm{imp}(x)=\frac{\min\{\mathrm{dist}(x,w\mathrm{c}(w))|x=wz\}}{|x|}$$ on a certain substrings of string \(x\). Substring is a contiguous sequence of characters within a string. The main idea behind those algorithms is checking, whether one of the optimal lengths of the prefix \(w\) of sequence \(x\) differs significantly from halfed length of \(x\). If such case occurs, program trims \(x\) by difference between lengths of \(w\) and half of \(x\). The new algorithms complexity is linear due to linear complexity of string trimming operation. Also, new recursive trimming methods have been developed. Their main idea is computing several consecutive trims of \(x\) for finding long subword of \(x\) with small value of \(\mathrm{imp}(x)\). Below is an example of their software implementation in Python using Numba library and program to compute \(\mathrm{imp}(x)\) developed in our laboratory.

Functions \(pref\_trimmer\), \(suff\_trimmer\) and \(double\_trimmer\) take as input string \(x\), sorted list \(optimal\_lengths\) of optimal lengths of the prefix \(w\) and float \(cutoff\_condition\). Note that \(\mathrm{min}(optimal\_lengths)\) and \(\mathrm{max}(optimal\_lengths)\) are first and last elements of \(optimal\_lengths\) respectively.

The function \(pref\_trimmer\) trims first \(rd\) symbols of \(x\), where $$rd = \mathrm{max}(optimal\_lengths)-\left\lfloor\frac{|x|}{2}\right\rfloor$$ if $$rd \ge |x|\cdot cutoff\_condition.$$

The function \(suff\_trimmer\) trims last \(ld\) symbols of \(x\), where $$ld = \left\lfloor\frac{|x|}{2}\right\rfloor - \mathrm{min}(optimal\_lengths)$$ if $$ld \ge |x|\cdot cutoff\_condition.$$

The function \(double\_trimmer\) initially computes $$rd = \mathrm{max}(optimal\_lengths)-\left\lfloor\frac{|x|}{2}\right\rfloor$$ and $$ld = \left\lfloor\frac{|x|}{2}\right\rfloor - \mathrm{min}(optimal\_lengths).$$ Subsequently it checks if $$rd \ge |x|\cdot cutoff\_condition$$ and $$ld \ge |x|\cdot cutoff\_condition.$$ If the first inequality is satisfied, the function trims the first \(rd\) symbols from the string \(x\). Similarly, if the second inequality is satisfied, the function trims the last \(ld\) symbols from \(x\).

Функции \(pref\_GRT\), \(suff\_GRT\) и \(double\_GRT\) получают на вход строку \(x\), числа с плавающей запятой \(cutoff\_value\), \(imp\_given\) и целое число \(recursion\_depth\).

The function \(pref\_GRT\) trims first \(cutoff\_value\cdot|x|\) symbols of \(x\), if it is possible, and writes the result in \(x'\). After that it computes \(imp(x')\). If it's value is decreased compared to \(imp\_given\), then function returns \(pref\_GRT\) from parameters \(x'\), \(imp(x')\),\(cutoff\_value\) and \(recursion\_depth-1\). Else, function returns \(pref\_GRT\) from parameters \(x\), \(imp\_given\), \(0.5\cdot cutoff\_value\) and \(recursion\_depth-1\). Terminal condition is $$recursion\_depth=0.$$ Then no trimming is performed, and the function returns \(x\) and \(imp\_given\).

The function \(suff\_GRT\) trims last \(cutoff\_value\cdot|x|\) symbols of \(x\), if it is possible, and writes the result in \(x'\). After that it computes \(imp(x')\). If it's value is decreased compared to \(imp\_given\), then function returns \(suff\_GRT\) from parameters \(x'\), \(imp(x')\),\(cutoff\_value\) and \(recursion\_depth-1\). Else, function returns \(suff\_GRT\) from parameters \(x\), \(imp\_given\), \(0.5\cdot cutoff\_value\) and \(recursion\_depth-1\). Terminal condition is $$recursion\_depth=0.$$ Then no trimming is performed, and the function returns \(x\) and \(imp\_given\).

The function \(double\_GRT\) trims first \(cutoff\_value\cdot|x|\) symbols of \(x\), if it is possible, and writes the result in \(x'\). After that it computes \(imp(x')\). If it's value is decreased compared to \(imp\_given\), then function returns \(pref\_GRT\) from parameters \(x'\), \(imp(x')\),\(cutoff\_value\) and \(recursion\_depth-1\). Else, function returns \(double\_GRT\) from parameters \(x\), \(imp\_given\), \(0.5\cdot cutoff\_value\) and \(recursion\_depth-1\). Terminal condition is $$recursion\_depth=0.$$ Then no trimming is performed, and the function returns \(x\) and \(imp\_given\).

Usage example

>>> pref_trimmer("AUGCUGACAUAUUUACUAGAGGGUAAAAUUAAUAACCUUCUAGUAAGAGUGGCAGUCG", [28, 30], cutoff_condition = 0.01)
"UGCUGACAUAUUUACUAGAGGGUAAAAUUAAUAACCUUCUAGUAAGAGUGGCAGUC"
>>> suff_trimmer("CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU", [28, 29, 30], cutoff_condition = 0.01)
"CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU"
>>> double_trimmer("UGCCAGGCCCGGGAGUGCAGGGAAGCUCAGGGCCUCCUCUCUUGUCUCCUGGCAGG", [25, 26, 27, 28, 29], cutoff_condition = 0.01)
"GCCAGGCCCGGGAGUGCAGGGAAGCUCAGGGCCUCCUCUCUUGUCUCCUGGC"
>>> pref_GRT("AAAAAAAAAACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU", cutoff_value=0.05, recursion_depth=3, imp_given=0.27536)
('ACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU', 0.18333333333333332)
>>> suff_GRT("CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAAAAAAAA", cutoff_value=0.05, recursion_depth=3, imp_given=0.28985)
('CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUA', 0.21666666666666667)
>>> double_GRT("AAAAAAACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAAAAAAAA", cutoff_value=0.05, recursion_depth=3, imp_given=0.276315)
('ACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAA', 0.234375)

Implementation: trimmers.py

Show/hide the code
from palial import imp
import numpy as np
from numba import njit


@njit
def endTrimmer(
    seq: str, optimal_lengths: np.array, cutoff_condition: float = 0.01
) -> str:
    """computes possible right-side-trimming of "seq"
    for a given string "seq" and array "optimal_lengths" of optimal w lengths where w is
    some prefix of seq such that ww' is Levenshtein-closest palindrome to seq(w' is the reverse complement of w).
    Trimming is performed if minimal length of w is smaller than half of the length of seq for more than "cutoff_condition" percents of len(seq)
    >>> endTrimmer("CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU", [28, 29, 30], cutoff_condition = 0.01)
    "CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU"
    """
    left_dist = int(np.floor(len(seq) / 2) - optimal_lengths[0])
    if (left_dist) > cutoff_condition * len(seq):
        return seq[: len(seq) - left_dist]
    return seq


@njit
def begTrimmer(
    seq: str, optimal_lengths: np.array, cutoff_condition: float = 0.01
) -> str:
    """computes possible left-side-trimming  of "seq"
    for a given string "seq" and array "optimal_lengths" of optimal w lengths where w is
    some prefix of seq such that ww' is Levenshtein-closest palindrome to seq(w' is the reverse complement of w).
    Trimming is performed if maximal length of w is smaller than half of the length of seq for more than "cutoff_condition" percents of len(seq)
    >>> begTrimmer("AUGCUGACAUAUUUACUAGAGGGUAAAAUUAAUAACCUUCUAGUAAGAGUGGCAGUCG", [28, 30], cutoff_condition = 0.01)
    "UGCUGACAUAUUUACUAGAGGGUAAAAUUAAUAACCUUCUAGUAAGAGUGGCAGUC"
    """
    right_dist = int(optimal_lengths[-1] - np.floor(len(seq) / 2))
    if (right_dist) > cutoff_condition * len(seq):
        return seq[right_dist:]
    return seq


@njit
def doubleTrimmer(
    seq: str, optimal_lengths: np.array, cutoff_condition: float = 0.01
) -> str:
    """computes possible both-sides-trimming of "seq"
    for a given string "seq" and array "optimal_lengths" of optimal w lengths where w is
    some prefix of seq such that ww' is Levenshtein-closest palindrome to seq(w' is the reverse complement of w).
    left-side-trimming is performed if maximal length of w is smaller than half of the length of seq for more than "cutoff_condition" percents of len(seq)
    right-side-trimming is performed if minimal length of w is smaller than half of the length of seq for more than "cutoff_condition" percents of len(seq)
    >>> doubleTrimmer("UGCCAGGCCCGGGAGUGCAGGGAAGCUCAGGGCCUCCUCUCUUGUCUCCUGGCAGG", [25, 26, 27, 28, 29], cutoff_condition = 0.01)
    "GCCAGGCCCGGGAGUGCAGGGAAGCUCAGGGCCUCCUCUCUUGUCUCCUGGC"
    """
    left_dist = int(np.floor(len(seq) / 2) - optimal_lengths[0])
    right_dist = int(optimal_lengths[-1] - np.floor(len(seq) / 2))
    new_seq = seq
    if (left_dist) > cutoff_condition * len(seq):
        new_seq = seq[: len(seq) - left_dist]
    if (right_dist) > cutoff_condition * len(seq):
        new_seq = new_seq[right_dist:]
    return new_seq


def pref_GRT(
    seq: str, cutoff_value: float = 0.05, recursion_depth: int = 3, imp_given=0.5
) -> Tuple[str, float]:
    """computes possible right-side-trimming of "seq" using recursion.
    If "recursion_depth"==0, then function returns it's input parameters "seq" and "imp_given"
    From a given "seq" it trims first "cutoff_value"*len("seq") symbols and computes
    imp() function for it. If imp value is lower than "imp_given", then pref_GRT returns itself
    with trimmed sequence, newely computed imp and decremented "recursion_depth" value as the parameters.
    If imp value is higher than "imp_given", then function returns itself with same "seq" and "imp_given"
    parameters but decremented "recursion_depth" and halfed "cutoff_value" ones.

    >>> pref_GRT("AAAAAAAAAACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU", cutoff_value=0.05, recursion_depth=3, imp_given=0.27536)
    ('ACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAU', 0.18333333333333332)
    """
    if recursion_depth == 0:
        return seq, imp_given
    if len(seq) > int(
        np.floor(len(seq) * cutoff_value)
    ):  
        new_seq = seq[int(np.floor(len(seq) * cutoff_value)) :]
        new_imp = imp(new_seq)
    else:
        new_imp = imp_given

    if new_imp < imp_given:
        return pref_GRT(
            new_seq, cutoff_value, recursion_depth - 1, new_imp
        )
    else:
        return pref_GRT(
            seq, cutoff_value / 2, recursion_depth - 1, imp_given
        )


def suff_GRT(
    seq: str, cutoff_value: float = 0.05, recursion_depth: int = 3, imp_given=0.5
) -> Tuple[str, float]:
    """computes possible left-side-trimming of "seq" using recursion.
    If "recursion_depth"==0, then function returns it's input parameters "seq" and "imp_given"
    From a given "seq" it trims last "cutoff_value"*len("seq") symbols and computes
    imp() function for it. If imp value is lower than "imp_given", then suff_GRT returns itself
    with trimmed sequence, newely computed imp and decremented "recursion_depth" value as the parameters.
    If imp value is higher than "imp_given", then function returns itself with same "seq" and "imp_given"
    parameters but decremented "recursion_depth" and halfed "cutoff_value" ones.

    >>> suff_GRT("CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAAAAAAAA", cutoff_value=0.05, recursion_depth=3, imp_given=0.28985)
    ('CUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUA', 0.21666666666666667)
    """
    if recursion_depth == 0:
        return seq, imp_given
    if len(seq) > int(
        np.floor(len(seq) * cutoff_value)
    ):
        new_seq = seq[: len(seq) - int(np.floor(len(seq) * cutoff_value))]
        new_imp = imp(new_seq)
    else:
        new_imp = imp_given

    if new_imp < imp_given:
        return suff_GRT(
            new_seq, cutoff_value, recursion_depth - 1, new_imp
        )
    else:
        return suff_GRT(
            seq, cutoff_value / 2, recursion_depth - 1, imp_given
        )


def double_GRT(
    seq: str, cutoff_value: float = 0.05, recursion_depth: int = 3, imp_given=0.5
) -> Tuple[str, float]:
    """computes possible double-side-trimming of "seq" using recursion.
    If "recursion_depth"==0, then function returns it's input parameters "seq" and "imp_given"
    From a given "seq" it trims first and last "cutoff_value"*len("seq") symbols and computes
    imp() function for it. If imp value is lower than "imp_given", then double_GRT returns itself
    with trimmed sequence, newely computed imp and decremented "recursion_depth" value as the parameters.
    If imp value is higher than "imp_given", then function returns itself with same "seq" and "imp_given"
    parameters but decremented "recursion_depth" and halfed "cutoff_value" ones.

    >>> double_GRT("AAAAAAACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAAAAAAAA", cutoff_value=0.05, recursion_depth=3, imp_given=0.276315)
    ('ACUACUUUGCCCCAUUGGGCCUUGUGGUAUAAUUCCAAGGCCCAGCUCGGCAGAGUACAUAAAA', 0.234375)
    """
    if recursion_depth == 0:
        return seq, imp_given
    if len(seq) > 2 * int(
        np.floor(len(seq) * cutoff_value)
    ):
        new_seq = seq[
            int(np.floor(len(seq) * cutoff_value)) : len(seq)
            - int(np.floor(len(seq) * cutoff_value))
        ]
        new_imp = imp(new_seq)
    else:
        new_imp = imp_given

    if new_imp < imp_given:
        return double_GRT(
            new_seq, cutoff_value, recursion_depth - 1, new_imp
        )
    else:
        return double_GRT(
            seq, cutoff_value / 2, recursion_depth - 1, imp_given
        )

Publications

G.A. Khaziev, A.V. Seliverstov, O.A. Zverkov. Searching for an imperfect palindrome. Computer algebra: 6th International Conference Materials, Moscow, Russia, June 23–25 2025, Ed. A. A. Ryabenko, D. S. Kulyabov, Moscow: RUDN University, 2025, P. 62–65. DOI: 10.22363/12585-2025-6-000