在 FASTA 文件中查找长度为 18 的回文序列?
Finding palindromics sequences of length 18 in FASTA file?
我想设计向导 RNA 以在 FASTA 文件中查找回文序列。我想编写一个 python 脚本来查找整个序列中长度为 18 的所有回文序列。我心里有一个逻辑,但我不知道如何用 Python 字来表达。我的逻辑是:
1)If i is [ATCG] and i+17 is [TAGC] then check:
2)if i+1 is [ATCG] and i+16 is [TAGC] then check:
3)if i+2 is [ATCG] and i+15 is [TAGC] then check"
.
.
.
10)if i+9 is [ATCG] and i+10 is [TAGC] and all the above are true,
然后将i到i+17的序列识别为回文。但我需要确保对于 i 的 A,它只考虑 i+17 的 T。
知道我如何在 python 中编写此逻辑吗?
谢谢,
所以你想匹配A+T和G+C。我们可以为此使用字典。然后我们只检查相对的边是否成对。
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 18 + 1):
pal = True
for j in range(9):
if pairs[ sequence[i+j] ] != sequence[i+17-j]:
pal = False
break
if pal:
print(sequence[i : i+18])
对于任何长度为 n 的回文(包括奇数 n):
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
n=18
for i in range(len(sequence) - n + 1):
pal = True
for j in range(n//2):
if pairs[ sequence[i+j] ] != sequence[i-j+n-1]:
pal = False
break
if pal:
print(sequence[i : i+n])
逐个循环遍历字符串需要太多时间。 Python.
中的字符串处理效率更高
#create random test sequence
import random
random.seed(1234)
seq = "".join(random.choices(["A", "T", "C", "G"], k=99))
n = 4 #not exactly 18 but good enough as a test case
print(seq)
>>>GTAGGCCAGAAGTCCAAAATGACTCACTCCTTAGTCACAATTACACAGGGATATGAAGAGATTTGTGTGGTGGTAATACGTGCCTCGAGTAGCGTATAT
#dictionary because translation
bp = {"A":"T", "T":"A", "G":"C", "C":"G"}
#checks if first half translates into reversed second half
#returns False if not, e.g., if the length ls of s is not an even number
def palin(s):
ls = len(s)
if ls%2:
return False
return s[:ls//2]=="".join([bp[i] for i in s[ls:ls//2-1:-1]])
#now to the actual test, checking all substrings of length n in our test sequence seq
#returns tuples of the index within seq and the found substring
res = [(i, seq[i:i+n]) for i in range(len(seq)-n+1) if palin(seq[i:i+n])]
print(res)
>>>[(3, 'GGCC'), (38, 'AATT'), (50, 'ATAT'), (77, 'ACGT'), (84, 'TCGA'), (94, 'TATA'), (95, 'ATAT')]
我想设计向导 RNA 以在 FASTA 文件中查找回文序列。我想编写一个 python 脚本来查找整个序列中长度为 18 的所有回文序列。我心里有一个逻辑,但我不知道如何用 Python 字来表达。我的逻辑是:
1)If i is [ATCG] and i+17 is [TAGC] then check:
2)if i+1 is [ATCG] and i+16 is [TAGC] then check:
3)if i+2 is [ATCG] and i+15 is [TAGC] then check"
.
.
.
10)if i+9 is [ATCG] and i+10 is [TAGC] and all the above are true,
然后将i到i+17的序列识别为回文。但我需要确保对于 i 的 A,它只考虑 i+17 的 T。 知道我如何在 python 中编写此逻辑吗?
谢谢,
所以你想匹配A+T和G+C。我们可以为此使用字典。然后我们只检查相对的边是否成对。
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
for i in range(len(sequence) - 18 + 1):
pal = True
for j in range(9):
if pairs[ sequence[i+j] ] != sequence[i+17-j]:
pal = False
break
if pal:
print(sequence[i : i+18])
对于任何长度为 n 的回文(包括奇数 n):
pairs = {"A":"T", "T":"A", "G":"C", "C":"G"}
n=18
for i in range(len(sequence) - n + 1):
pal = True
for j in range(n//2):
if pairs[ sequence[i+j] ] != sequence[i-j+n-1]:
pal = False
break
if pal:
print(sequence[i : i+n])
逐个循环遍历字符串需要太多时间。 Python.
中的字符串处理效率更高#create random test sequence
import random
random.seed(1234)
seq = "".join(random.choices(["A", "T", "C", "G"], k=99))
n = 4 #not exactly 18 but good enough as a test case
print(seq)
>>>GTAGGCCAGAAGTCCAAAATGACTCACTCCTTAGTCACAATTACACAGGGATATGAAGAGATTTGTGTGGTGGTAATACGTGCCTCGAGTAGCGTATAT
#dictionary because translation
bp = {"A":"T", "T":"A", "G":"C", "C":"G"}
#checks if first half translates into reversed second half
#returns False if not, e.g., if the length ls of s is not an even number
def palin(s):
ls = len(s)
if ls%2:
return False
return s[:ls//2]=="".join([bp[i] for i in s[ls:ls//2-1:-1]])
#now to the actual test, checking all substrings of length n in our test sequence seq
#returns tuples of the index within seq and the found substring
res = [(i, seq[i:i+n]) for i in range(len(seq)-n+1) if palin(seq[i:i+n])]
print(res)
>>>[(3, 'GGCC'), (38, 'AATT'), (50, 'ATAT'), (77, 'ACGT'), (84, 'TCGA'), (94, 'TATA'), (95, 'ATAT')]