从大文件中特定模式的出现中提取名称
extract name from specific occurence of a pattern in a large file
我有一个 FASTA 文件,它基本上是一个文本文件,用于描述具有超过 10,000 个 FASTA 序列(以 > 开头)的生物序列数据 (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp)。文件的开头如下所示:
>Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
等等大约 10,000 个基因。
我想:
- 找出哪些基因包含特定模式 (CTTTGTA)
- 这种模式在那个基因中出现了多少次?
- 导出包含模式频率的基因名称列表。
Bash 或 Python(或 R)中的任何解决方案都值得赞赏。
P.S。到目前为止我已经尝试过但没有奏效:将基因及其序列提取到单独的文件中,然后在单独的文件中查找模式。但是,我可以不生成这些单独的文件。我用了
grep '^>' file.txt > new_file.txt
但我得到的输出是一个只包含所有基因名称的文件。
sequences = open('fastafile.txt').read().split('>') # Creates a list of sequences.
needle = 'CTTTGTA'
occurrences = {}
for sequence in sequences:
occ = sequence.count(needle) # Returns the number of times the substring occurs in the string sequence.
if occ: # If greater than 0, create an entry in our dictionary. The sequence being the key and the count the value.
occurrences[sequence] = occ
output = []
sorted_occurrences = sorted(occurrences.items(), key=operator.itemgetter(1)) # Sort the dictionary by length, so sequences with the highest occurrence of the needle appear at the top.
for seq, occ_count in sorted_occurrences.iteritems():
gene_name, sequence = seq.split('\n')
formatted_line = '{gene_name} - {occ_count}'.format(gene_name=gene_name, occ_count=str(occ_count)) # Format the lines the way you want.
output.append(formatted_line)
with open('occurences.txt') as o_f:
o_f.write('\n'.join(output))
这里是使用 stringi
包的 R 解决方案。由于没有单个文本文件或类似文件可以作为可重现的示例进行访问,因此我使用 cat()
和 readlines()
来读取代表您提供的行的副本的临时文件。另请检查时序基准,大文件可能会感兴趣。
library(stringi)
cat(">Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
", file = "tempfile.txt")
genes <- readLines("tempfile.txt", n=-1)
unlink("tempfile.txt")
genes <- unlist(stri_split_fixed(paste(genes, collapse = " "), ">"))
genes <- genes[ genes != ""]
genenames <- unlist(stri_extract_all_regex(genes, "Gene \w+"))
genes <- stri_replace_all_fixed(genes, genenames, "")
names(genes) <- genenames
genes <- gsub("\s+", "", genes, perl = T)
gene_pattern_freq <- function(str, patterns) {
res <- sapply(patterns, function(p) {
stringi::stri_count_fixed(str, p)
}, USE.NAMES = T)
rownames(res) <- names(str)
return(res)
}
searchpatterns <- c("AA", "GT", "GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA")
result <- gene_pattern_freq(genes, searchpatterns)
result
# AA GT GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA
# Gene A 14 6 1
# Gene B 21 10 1
# Gene C 52 18 0
# Gene D 52 18 0
library(microbenchmark)
microbenchmark(gene_pattern_freq(genes, searchpatterns))
# Unit: microseconds
# expr min lq mean median uq max neval
# gene_pattern_freq(genes, searchpatterns) 68.687 77.371 123.438 78.161 79.345 4479.19 100
#export
write.csv(result, file = "../mypath/gene_pattern_freq_result.csv" )
我有一个 FASTA 文件,它基本上是一个文本文件,用于描述具有超过 10,000 个 FASTA 序列(以 > 开头)的生物序列数据 (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp)。文件的开头如下所示:
>Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
等等大约 10,000 个基因。 我想:
- 找出哪些基因包含特定模式 (CTTTGTA)
- 这种模式在那个基因中出现了多少次?
- 导出包含模式频率的基因名称列表。
Bash 或 Python(或 R)中的任何解决方案都值得赞赏。
P.S。到目前为止我已经尝试过但没有奏效:将基因及其序列提取到单独的文件中,然后在单独的文件中查找模式。但是,我可以不生成这些单独的文件。我用了
grep '^>' file.txt > new_file.txt
但我得到的输出是一个只包含所有基因名称的文件。
sequences = open('fastafile.txt').read().split('>') # Creates a list of sequences.
needle = 'CTTTGTA'
occurrences = {}
for sequence in sequences:
occ = sequence.count(needle) # Returns the number of times the substring occurs in the string sequence.
if occ: # If greater than 0, create an entry in our dictionary. The sequence being the key and the count the value.
occurrences[sequence] = occ
output = []
sorted_occurrences = sorted(occurrences.items(), key=operator.itemgetter(1)) # Sort the dictionary by length, so sequences with the highest occurrence of the needle appear at the top.
for seq, occ_count in sorted_occurrences.iteritems():
gene_name, sequence = seq.split('\n')
formatted_line = '{gene_name} - {occ_count}'.format(gene_name=gene_name, occ_count=str(occ_count)) # Format the lines the way you want.
output.append(formatted_line)
with open('occurences.txt') as o_f:
o_f.write('\n'.join(output))
这里是使用 stringi
包的 R 解决方案。由于没有单个文本文件或类似文件可以作为可重现的示例进行访问,因此我使用 cat()
和 readlines()
来读取代表您提供的行的副本的临时文件。另请检查时序基准,大文件可能会感兴趣。
library(stringi)
cat(">Gene A
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATT
>Gene B
GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAATTCCAGAAGTTGGACAGACATATATAGACAGCACATATATTA
TCTTTATTTTTTTATGTATGATAACATTAAATATAACGTTCAACAATTACACCGTTAGCAGTGTGAGCAAAAACGATTAA
AAAGTAAATATTATAAAAGCCCTC
>Gene C
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
>Gene D
AACAACAAATTGCCATCTACCCGTTTGAATCCTGTAATAATAACTTGCCCAGATTTGCTGCAGCATACTCCTAGAGTTGG
GCTGGGTGGCCCACACAAGCGATAATAACATTTAACAATTGTTTGATATATGTACTTTTTTTTAAGTTTTTTTCTCCTCG
TACTTGCCTTCCAAAAACTCGTTAGCTTTGTACACATACGCCTTTAATTAAAATACTGATAGATGCGTACCACTTACGTC
ATTAGAAAAAGTCACCAAAAGGAAAAATATGGACGACACAAGAACGAGGAGATCTAAGCCACTCGTAGACCACTAAGCAC
AAAATACCCGAAAAATATAACTGATATGATTGCCAACTACCCTGCGACTATGTAAACCCAACCTTCCCCCCTCCTTTACC
CTCTTATTCAAATCGACGCGTGTGTAGAAGATACACTTATTATATTTTTTTTCTGAGATACAATTATAAACACAAAAACG
ACTTTTAACTATATATTAAATAAAAACAAAAGGAAAAACATAATAATTT
", file = "tempfile.txt")
genes <- readLines("tempfile.txt", n=-1)
unlink("tempfile.txt")
genes <- unlist(stri_split_fixed(paste(genes, collapse = " "), ">"))
genes <- genes[ genes != ""]
genenames <- unlist(stri_extract_all_regex(genes, "Gene \w+"))
genes <- stri_replace_all_fixed(genes, genenames, "")
names(genes) <- genenames
genes <- gsub("\s+", "", genes, perl = T)
gene_pattern_freq <- function(str, patterns) {
res <- sapply(patterns, function(p) {
stringi::stri_count_fixed(str, p)
}, USE.NAMES = T)
rownames(res) <- names(str)
return(res)
}
searchpatterns <- c("AA", "GT", "GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA")
result <- gene_pattern_freq(genes, searchpatterns)
result
# AA GT GAACTACACAAACGTAAAATGTAAAACAAAGGTATAAA
# Gene A 14 6 1
# Gene B 21 10 1
# Gene C 52 18 0
# Gene D 52 18 0
library(microbenchmark)
microbenchmark(gene_pattern_freq(genes, searchpatterns))
# Unit: microseconds
# expr min lq mean median uq max neval
# gene_pattern_freq(genes, searchpatterns) 68.687 77.371 123.438 78.161 79.345 4479.19 100
#export
write.csv(result, file = "../mypath/gene_pattern_freq_result.csv" )