如何找到序列中特定字母旁边的回文?

How to find palindromes next to specific letters in a sequence?

我有一个脚本 returns DNA 序列中的回文子串。

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 6 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if pal:
        print(sequence[i : i+6])

它returns:

704
GATCTC
GAGCTC
GCAGGC
GTTCAC
GAGATC
TCAAGA
AAATTT
GACGTC
CAGTTG
TGGACA
AAGATT
CTTCAG
CCAAGG
CGACCG
TTGGAA
CTCGAG
TCTTGA
CTTGAG
TGAGCA
CGGGCG
ATAGAT
ACGGGT
TCCAGA
CTCGAG
TCGAGA
TGTTCA
GTTCAC
GGCACC
AGATCT
CACCTG
GCCTGC
GACTTC
CAGATG
AGAACT
TCAAGA
GAAGTC
TCAGGA
AGGACT
TCTGGA
TGTTCA
TTCAAA
TCAAGA
GAGGTC
AGGTCT
TAGATA
AGTTCT
AGTTCT

我想查找这些子字符串是否位于“[ATCG]CC”或“[ATCG]GG”旁边 我想找到这些回文在序列中的位置(例如从第 i 到第 (i+5) 个,因为回文的长度为 6),然后检查是否第 (i+6) 个到第 (i+8) )th 字母是 [ATCG]CC 或 [ATCG]GG。 你知道我怎么写这样的脚本吗?还是您有更好的逻辑? 谢谢

我不确定我是否能够正确回答你的问题,但假设你得到的值是某种基因回文,然后你想要找到的每个值的下两个值(请纠正我,如果我弄错了),简单的解决方案有点像这样:

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"

pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}

keeper = []
for i in range(len(sequence) - 6 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if pal:
        the_sequence = sequence[i : i+6]
#         print(the_sequence)
        keeper.append((the_sequence, (i, i+6)))
        
possible_ends = [a+'CC' for a in "ATCG"]
possible_ends.extend([a+'GG' for a in "ATCG"])

final = []

for val in keeper:
    temp = val[0]+sequence[val[1][1]:val[1][1]+3]
    
    temp_list = [temp.endswith(a) for a in possible_ends]
    
    if any(temp_list):
        final.append(temp)
    else:
        pass
    
print(final)

输出:

['GCCTGCCCC', 'GAAGTCAGG']

我希望并相信这是期望的输出。

只需添加一些额外的检查。

sequence="GATCTCTATACCAACTCAAAATGAAGACTCTTCTTTACACTTTCGAGCTCAGCAGGCTTACCGAGAAGAGTCGTCGTTCACATCCCCCCCTGTGCGAGATCAAGAAATTTGGCGACGTCGGCTTATTATCCTCCGCTGTCAATCAGTTGGACACATCTCTCCGGTCACTGCCGGACAAGCCAACCGAAGATTCGATTCTTCAGCAGCTTATCGACATTGCTGGTGGTGAAAAGCCAAGGCACAGCATCATAGTTGCGACCAATACGTCATACGACCGAGAGACATTGGTAAAGATCCTTCAACGATTCCCATACACCATACCTGGTCTGTCAGATTCAGGCTTGGAATCAGAAACACTCGAGGCTCTTGAGCACATCGCTTTTGCATTAGCCGGGCGATTAGCTCATAGATTTGACTACGGGTTCAATCCAGAGGCCAGTATCGTTCAACACCTCGAGATGTTCACCACCCTTTGGCACCAAAGATCTGCATTACCACCTGCGCCTGCCCCGTATCGACTTCCCGTTCCCGTCAATCAAGGAAGAGTCTCCTCATCAGATGATGGCTCTGATACTGAGTCAGAACTGGATGAAAAATACCACAACATCAAGAAGTCAGGACTTTGGAGGTTTCTGGATATGTTCAAAATGAACTTCAAGAGGTCTTAGATAACGGTCTAGTTCTAGTTCTGCAACTCACACTGA"
print(len(sequence))
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
ans = []
for i in range(len(sequence) - 9 + 1):
    pal = True
    for j in range(2):
        if pairs[ sequence[i+j] ] != sequence[i+5-j]:
            pal = False
            break
    if not pal:
        continue

    if (sequence[i+7] == sequence[i+8]) and (sequence[i+7] in ('C', 'G')):
        print(sequence[i : i+9])
        ans.append(sequence[i : i+9])
    else:
        print(sequence[i : i+6] + " (X)")
print("Count of answer: %d" % len(ans))

输出:

704
GATCTC (X)
GAGCTC (X)
GCAGGC (X)
GTTCAC (X)
GAGATC (X)
TCAAGA (X)
AAATTT (X)
GACGTC (X)
CAGTTG (X)
TGGACA (X)
AAGATT (X)
CTTCAG (X)
CCAAGG (X)
CGACCG (X)
TTGGAA (X)
CTCGAG (X)
TCTTGA (X)
CTTGAG (X)
TGAGCA (X)
CGGGCG (X)
ATAGAT (X)
ACGGGT (X)
TCCAGA (X)
CTCGAG (X)
TCGAGA (X)
TGTTCA (X)
GTTCAC (X)
GGCACC (X)
AGATCT (X)
CACCTG (X)
GCCTGCCCC
GACTTC (X)
CAGATG (X)
AGAACT (X)
TCAAGA (X)
GAAGTCAGG
TCAGGA (X)
AGGACT (X)
TCTGGA (X)
TGTTCA (X)
TTCAAA (X)
TCAAGA (X)
GAGGTC (X)
AGGTCT (X)
TAGATA (X)
AGTTCT (X)
AGTTCT (X)
Count of answer: 2