解析一个 gff 文件,获取脚手架名称并创建一个新的数据框
Parse a gff file, get scaffold name and creat a new dataframe
我解释一下我的问题。
我其实有一个gff文件(Augustus输出)
在这个 gff 文件中,我有不同支架中基因的定位,例如:
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold_0 AUGUSTUS gene 5629 8516 0.13 + . g1
scaffold_0 AUGUSTUS transcript 5629 8516 0.13 + . g1.t1
scaffold_0 AUGUSTUS start_codon 5629 5631 . + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS intron 5662 7695 0.18 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS intron 8219 8424 0.55 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 5629 5661 0.41 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 7696 8218 0.56 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 8425 8516 0.59 + 2 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS stop_codon 8514 8516 . + 0 transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MRVKSGNLRENNEARGTSANVALCTVSRGYGGRGARGGGVSGANSGPSGARGGSTRARSRAESVVRLPLSPRTLCAPG
# PVAGPHTGAQAQQHPQQAQSRDHHASQPLTGGAGTGSGLEDGPGSADEPMVHLERPQQAGTWICCVPCYWLRRSKAVHKALLTFAMLLVTSLLVTSPV
# LFLITTLPEAASIRHTYACLFSLSKIPLPPHMKTNGTPF]
# end gene g1
例如,这里的g1.t1 位于scaffold 0 中。
我还有一个包含多个信息的选项卡文件,例如:
scaf_name scaf_length cov_depth GC
scaffold_0 541519 79.7321515957889 0.398087601727733
scaffold_1 251893 80.0520141488648 0.398288162037055
scaffold_10 169987 79.1650185014148 0.400994512896853
scaffold_100 168110 77.8095354232348 0.417265648727282
scaffold_1000 34744 70.048756619848 0.419531948061824
scaffold_10000 648 73.8009259259259 0.427052956957891
scaffold_100000 118 0 0.401301375405743
scaffold_100001 118 0 0.389793951804145
我想做的是用这些信息创建一个数据框,例如,如果我有 5 个基因:
g1.t1 in scaffold 256
g2.t1 in scaffold 890
g3.t1 in scaffold 6
g4.t1 in scaffold 6
g5.t1 in scaffold 1234
这些脚手架的信息是:
scaf_name scaf_length cov_depth GC
scaffold 6 56786 79 0.39
scaffold 256 789765 86 0.42
scaffold 890 3456 85 0.40
scaffold 1234 590 90 0.41
我想要这样的数据框:
gene_name scaf_name scaf_length cov_depth GC
g3.t1 scaffold 6 56786 79 0.39
g4.t1 scaffold 6 56786 79 0.39
g1.t1 scaffold 256 789765 86 0.42
g2.t1 scaffold 890 3456 85 0.40
g5.t1 scaffold 1234 590 90 0.41
我想用 python 和 pandas 是可以做到的,不是吗?
感谢您的帮助:)
这是我感谢 Ami 制作的脚本:
import pandas as pd
gff_0042="/Users/etudiant/Desktop/Horizon_project/Augustus/Augustus_out/run_augustus_0042.out"
liste=["scaf_name","source","feature","start","end","score","strand","frame","gene"]
gene_info=pd.read_csv(gff_0042,comment='"',sep='\s+',header=None,names=liste)
scaf_info=pd.read_csv("cov_GC.tab",sep='\t')
scaf_info.scaf_name = scaf_info.scaf_name.str.replace(' ', '_')
new=pd.merge(gene_info, scaf_info, on='scaf_name')
print(new)
这是我得到的输出的一部分:
scaf_name source feature start end score strand frame gene scaf_length cov_depth GC
0 scaffold_0 AUGUSTUS gene 5793 11216 0.13 + . g1 541519 79.7321515957889 0.39808760172773294
1 scaffold_0 AUGUSTUS transcript 5793 11216 0.13 + . g1.t1 541519 79.7321515957889 0.39808760172773294
2 scaffold_0 AUGUSTUS start_codon 5793 5795 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
3 scaffold_0 AUGUSTUS intron 5829 7371 0.41 + . transcript_id 541519 79.7321515957889 0.39808760172773294
4 scaffold_0 AUGUSTUS intron 7405 7695 0.8 + . transcript_id 541519 79.7321515957889 0.39808760172773294
5 scaffold_0 AUGUSTUS intron 8248 11054 0.37 + . transcript_id 541519 79.7321515957889 0.39808760172773294
6 scaffold_0 AUGUSTUS CDS 5793 5828 0.45 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
7 scaffold_0 AUGUSTUS CDS 7372 7404 0.81 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
8 scaffold_0 AUGUSTUS CDS 7696 8247 0.84 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
9 scaffold_0 AUGUSTUS CDS 11055 11216 0.4 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
10 scaffold_0 AUGUSTUS stop_codon 11214 11216 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
11 scaffold_0 AUGUSTUS gene 38793 39207 0.91 - . g2 541519 79.7321515957889 0.39808760172773294
12 scaffold_0 AUGUSTUS transcript 38793 39207 0.91 - . g2.t1 541519 79.7321515957889 0.39808760172773294
13 scaffold_0 AUGUSTUS stop_codon 38793 38795 . - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
14 scaffold_0 AUGUSTUS intron 38904 39066 0.92 - . transcript_id 541519 79.7321515957889 0.39808760172773294
15 scaffold_0 AUGUSTUS CDS 38793 38903 0.95 - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
16 scaffold_0 AUGUSTUS CDS 39067 39207 0.91 - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
17 scaffold_0 AUGUSTUS start_codon 39205 39207 . - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
18 scaffold_0 AUGUSTUS gene 50409 58153 0.07 + . g3 541519 79.7321515957889 0.39808760172773294
19 scaffold_0 AUGUSTUS transcript 50409 58153 0.07 + . g3.t1 541519 79.7321515957889 0.39808760172773294
20 scaffold_0 AUGUSTUS start_codon 50409 50411 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
21 scaffold_0 AUGUSTUS intron 50520 50762 0.65 + . transcript_id 541519 79.7321515957889 0.39808760172773294
22 scaffold_0 AUGUSTUS intron 50895 51080 0.91 + . transcript_id 541519 79.7321515957889 0.39808760172773294
23 scaffold_0 AUGUSTUS intron 51180 51284 0.61 + . transcript_id 541519 79.7321515957889 0.39808760172773294
24 scaffold_0 AUGUSTUS intron 51458 55603 0.47 + . transcript_id 541519 79.7321515957889 0.39808760172773294
25 scaffold_0 AUGUSTUS intron 55791 56101 0.98 + . transcript_id 541519 79.7321515957889 0.39808760172773294
26 scaffold_0 AUGUSTUS intron 56279 56384 1 + . transcript_id 541519 79.7321515957889 0.39808760172773294
27 scaffold_0 AUGUSTUS intron 56486 56573 1 + . transcript_id 541519 79.7321515957889 0.39808760172773294
28 scaffold_0 AUGUSTUS intron 56630 56952 0.97 + . transcript_id 541519 79.7321515957889 0.39808760172773294
29 scaffold_0 AUGUSTUS intron 57240 57365 0.57 + . transcript_id 541519 79.7321515957889 0.39808760172773294
30 scaffold_0 AUGUSTUS intron 57525 57736 0.81 + . transcript_id 541519 79.7321515957889 0.39808760172773294
31 scaffold_0 AUGUSTUS intron 57814 58053 0.75 + . transcript_id 541519 79.7321515957889 0.39808760172773294
32 scaffold_0 AUGUSTUS CDS 50409 50519 0.39 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
33 scaffold_0 AUGUSTUS CDS 50763 50894 0.74 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
34 scaffold_0 AUGUSTUS CDS 51081 51179 0.91 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
35 scaffold_0 AUGUSTUS CDS 51285 51457 0.57 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
36 scaffold_0 AUGUSTUS CDS 55604 55790 0.98 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
37 scaffold_0 AUGUSTUS CDS 56102 56278 0.98 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
38 scaffold_0 AUGUSTUS CDS 56385 56485 1 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
39 scaffold_0 AUGUSTUS CDS 56574 56629 0.98 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
40 scaffold_0 AUGUSTUS CDS 56953 57239 0.64 + 2 transcript_id 541519 79.7321515957889 0.39808760172773294
41 scaffold_0 AUGUSTUS CDS 57366 57524 0.81 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
42 scaffold_0 AUGUSTUS CDS 57737 57813 0.9 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
43 scaffold_0 AUGUSTUS CDS 58054 58153 0.5 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
44 scaffold_0 AUGUSTUS stop_codon 58151 58153 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
如果可能的话,我希望只得到:
scaf_name source feature start end score strand frame gene scaf_length cov_depth GC
1 scaffold_0 AUGUSTUS transcript 5793 11216 0.13 + . g1.t1 541519 79.7321515957889 0.39808760172773294
12 scaffold_0 AUGUSTUS transcript 38793 39207 0.91 - . g2.t1 541519 79.7321515957889 0.39808760172773294
19 scaffold_0 AUGUSTUS transcript 50409 58153 0.07 + . g3.t1 541519 79.7321515957889 0.39808760172773294
等等..
首先,使用read_csv
读取gff文件。注意以下几点:
- 设置
comment='"'
- 设置 `sep='\s+'
- 设置
header=None
- 将
names=...
设置为一些指示名称的列表,第一个名称为 'scaf_name'
,最后一个名称为 'gene'
。
放在gene_info
.
现在使用read_csv
读取标签限制文件,设置sep='\t'
。将它放在 scaf_info
.
运行
scaf_info.scaf_name = scaf_info.scaf_name.str.replace(' ', '_')
以便将 scaffold 6
之类的内容更改为 scaffold_6
。
现在
pd.merge(gene_info, scaf_info, on='scaf_name')
我解释一下我的问题。
我其实有一个gff文件(Augustus输出) 在这个 gff 文件中,我有不同支架中基因的定位,例如:
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold_0 AUGUSTUS gene 5629 8516 0.13 + . g1
scaffold_0 AUGUSTUS transcript 5629 8516 0.13 + . g1.t1
scaffold_0 AUGUSTUS start_codon 5629 5631 . + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS intron 5662 7695 0.18 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS intron 8219 8424 0.55 + . transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 5629 5661 0.41 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 7696 8218 0.56 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS CDS 8425 8516 0.59 + 2 transcript_id "g1.t1"; gene_id "g1";
scaffold_0 AUGUSTUS stop_codon 8514 8516 . + 0 transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MRVKSGNLRENNEARGTSANVALCTVSRGYGGRGARGGGVSGANSGPSGARGGSTRARSRAESVVRLPLSPRTLCAPG
# PVAGPHTGAQAQQHPQQAQSRDHHASQPLTGGAGTGSGLEDGPGSADEPMVHLERPQQAGTWICCVPCYWLRRSKAVHKALLTFAMLLVTSLLVTSPV
# LFLITTLPEAASIRHTYACLFSLSKIPLPPHMKTNGTPF]
# end gene g1
例如,这里的g1.t1 位于scaffold 0 中。 我还有一个包含多个信息的选项卡文件,例如:
scaf_name scaf_length cov_depth GC
scaffold_0 541519 79.7321515957889 0.398087601727733
scaffold_1 251893 80.0520141488648 0.398288162037055
scaffold_10 169987 79.1650185014148 0.400994512896853
scaffold_100 168110 77.8095354232348 0.417265648727282
scaffold_1000 34744 70.048756619848 0.419531948061824
scaffold_10000 648 73.8009259259259 0.427052956957891
scaffold_100000 118 0 0.401301375405743
scaffold_100001 118 0 0.389793951804145
我想做的是用这些信息创建一个数据框,例如,如果我有 5 个基因:
g1.t1 in scaffold 256
g2.t1 in scaffold 890
g3.t1 in scaffold 6
g4.t1 in scaffold 6
g5.t1 in scaffold 1234
这些脚手架的信息是:
scaf_name scaf_length cov_depth GC
scaffold 6 56786 79 0.39
scaffold 256 789765 86 0.42
scaffold 890 3456 85 0.40
scaffold 1234 590 90 0.41
我想要这样的数据框:
gene_name scaf_name scaf_length cov_depth GC
g3.t1 scaffold 6 56786 79 0.39
g4.t1 scaffold 6 56786 79 0.39
g1.t1 scaffold 256 789765 86 0.42
g2.t1 scaffold 890 3456 85 0.40
g5.t1 scaffold 1234 590 90 0.41
我想用 python 和 pandas 是可以做到的,不是吗? 感谢您的帮助:)
这是我感谢 Ami 制作的脚本:
import pandas as pd
gff_0042="/Users/etudiant/Desktop/Horizon_project/Augustus/Augustus_out/run_augustus_0042.out"
liste=["scaf_name","source","feature","start","end","score","strand","frame","gene"]
gene_info=pd.read_csv(gff_0042,comment='"',sep='\s+',header=None,names=liste)
scaf_info=pd.read_csv("cov_GC.tab",sep='\t')
scaf_info.scaf_name = scaf_info.scaf_name.str.replace(' ', '_')
new=pd.merge(gene_info, scaf_info, on='scaf_name')
print(new)
这是我得到的输出的一部分:
scaf_name source feature start end score strand frame gene scaf_length cov_depth GC
0 scaffold_0 AUGUSTUS gene 5793 11216 0.13 + . g1 541519 79.7321515957889 0.39808760172773294
1 scaffold_0 AUGUSTUS transcript 5793 11216 0.13 + . g1.t1 541519 79.7321515957889 0.39808760172773294
2 scaffold_0 AUGUSTUS start_codon 5793 5795 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
3 scaffold_0 AUGUSTUS intron 5829 7371 0.41 + . transcript_id 541519 79.7321515957889 0.39808760172773294
4 scaffold_0 AUGUSTUS intron 7405 7695 0.8 + . transcript_id 541519 79.7321515957889 0.39808760172773294
5 scaffold_0 AUGUSTUS intron 8248 11054 0.37 + . transcript_id 541519 79.7321515957889 0.39808760172773294
6 scaffold_0 AUGUSTUS CDS 5793 5828 0.45 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
7 scaffold_0 AUGUSTUS CDS 7372 7404 0.81 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
8 scaffold_0 AUGUSTUS CDS 7696 8247 0.84 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
9 scaffold_0 AUGUSTUS CDS 11055 11216 0.4 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
10 scaffold_0 AUGUSTUS stop_codon 11214 11216 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
11 scaffold_0 AUGUSTUS gene 38793 39207 0.91 - . g2 541519 79.7321515957889 0.39808760172773294
12 scaffold_0 AUGUSTUS transcript 38793 39207 0.91 - . g2.t1 541519 79.7321515957889 0.39808760172773294
13 scaffold_0 AUGUSTUS stop_codon 38793 38795 . - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
14 scaffold_0 AUGUSTUS intron 38904 39066 0.92 - . transcript_id 541519 79.7321515957889 0.39808760172773294
15 scaffold_0 AUGUSTUS CDS 38793 38903 0.95 - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
16 scaffold_0 AUGUSTUS CDS 39067 39207 0.91 - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
17 scaffold_0 AUGUSTUS start_codon 39205 39207 . - 0 transcript_id 541519 79.7321515957889 0.39808760172773294
18 scaffold_0 AUGUSTUS gene 50409 58153 0.07 + . g3 541519 79.7321515957889 0.39808760172773294
19 scaffold_0 AUGUSTUS transcript 50409 58153 0.07 + . g3.t1 541519 79.7321515957889 0.39808760172773294
20 scaffold_0 AUGUSTUS start_codon 50409 50411 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
21 scaffold_0 AUGUSTUS intron 50520 50762 0.65 + . transcript_id 541519 79.7321515957889 0.39808760172773294
22 scaffold_0 AUGUSTUS intron 50895 51080 0.91 + . transcript_id 541519 79.7321515957889 0.39808760172773294
23 scaffold_0 AUGUSTUS intron 51180 51284 0.61 + . transcript_id 541519 79.7321515957889 0.39808760172773294
24 scaffold_0 AUGUSTUS intron 51458 55603 0.47 + . transcript_id 541519 79.7321515957889 0.39808760172773294
25 scaffold_0 AUGUSTUS intron 55791 56101 0.98 + . transcript_id 541519 79.7321515957889 0.39808760172773294
26 scaffold_0 AUGUSTUS intron 56279 56384 1 + . transcript_id 541519 79.7321515957889 0.39808760172773294
27 scaffold_0 AUGUSTUS intron 56486 56573 1 + . transcript_id 541519 79.7321515957889 0.39808760172773294
28 scaffold_0 AUGUSTUS intron 56630 56952 0.97 + . transcript_id 541519 79.7321515957889 0.39808760172773294
29 scaffold_0 AUGUSTUS intron 57240 57365 0.57 + . transcript_id 541519 79.7321515957889 0.39808760172773294
30 scaffold_0 AUGUSTUS intron 57525 57736 0.81 + . transcript_id 541519 79.7321515957889 0.39808760172773294
31 scaffold_0 AUGUSTUS intron 57814 58053 0.75 + . transcript_id 541519 79.7321515957889 0.39808760172773294
32 scaffold_0 AUGUSTUS CDS 50409 50519 0.39 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
33 scaffold_0 AUGUSTUS CDS 50763 50894 0.74 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
34 scaffold_0 AUGUSTUS CDS 51081 51179 0.91 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
35 scaffold_0 AUGUSTUS CDS 51285 51457 0.57 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
36 scaffold_0 AUGUSTUS CDS 55604 55790 0.98 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
37 scaffold_0 AUGUSTUS CDS 56102 56278 0.98 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
38 scaffold_0 AUGUSTUS CDS 56385 56485 1 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
39 scaffold_0 AUGUSTUS CDS 56574 56629 0.98 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
40 scaffold_0 AUGUSTUS CDS 56953 57239 0.64 + 2 transcript_id 541519 79.7321515957889 0.39808760172773294
41 scaffold_0 AUGUSTUS CDS 57366 57524 0.81 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
42 scaffold_0 AUGUSTUS CDS 57737 57813 0.9 + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
43 scaffold_0 AUGUSTUS CDS 58054 58153 0.5 + 1 transcript_id 541519 79.7321515957889 0.39808760172773294
44 scaffold_0 AUGUSTUS stop_codon 58151 58153 . + 0 transcript_id 541519 79.7321515957889 0.39808760172773294
如果可能的话,我希望只得到:
scaf_name source feature start end score strand frame gene scaf_length cov_depth GC
1 scaffold_0 AUGUSTUS transcript 5793 11216 0.13 + . g1.t1 541519 79.7321515957889 0.39808760172773294
12 scaffold_0 AUGUSTUS transcript 38793 39207 0.91 - . g2.t1 541519 79.7321515957889 0.39808760172773294
19 scaffold_0 AUGUSTUS transcript 50409 58153 0.07 + . g3.t1 541519 79.7321515957889 0.39808760172773294
等等..
首先,使用read_csv
读取gff文件。注意以下几点:
- 设置
comment='"'
- 设置 `sep='\s+'
- 设置
header=None
- 将
names=...
设置为一些指示名称的列表,第一个名称为'scaf_name'
,最后一个名称为'gene'
。
放在gene_info
.
现在使用read_csv
读取标签限制文件,设置sep='\t'
。将它放在 scaf_info
.
运行
scaf_info.scaf_name = scaf_info.scaf_name.str.replace(' ', '_')
以便将 scaffold 6
之类的内容更改为 scaffold_6
。
现在
pd.merge(gene_info, scaf_info, on='scaf_name')