计算每个转录本的平均外显子数
Calculating the average number of exons per transcript
我希望能够计算每个转录本的外显子的平均数量我的档案。
仅供参考: 我正在使用的文件格式称为 GENCODE GTF 文件。
要找到每个转录本的外显子数量,我们必须在字段 9 (</code>) 的分号分隔列表中进行搜索,但仅搜索 <em> 外显子特征行</em>(字符串“<code>exon
”在制表符分隔文件的字段 3 (</code>) 中的行)。</p>
<p>每个独特的 <code>transcript_id
都有一系列关联的外显子特征线,字段 9 (</code>) 中的“<code>exon_number n
”数量不断增加:
...; transcript_id "ENST00000488147.1"; ...; exon_number 1; ...
...; transcript_id "ENST00000488147.1"; ...; exon_number 2; ...
...
...; transcript_id "ENST00000488147.1"; ...; exon_number n; ...
要获得每个转录本的 平均 外显子数,只需将每个唯一 trasnscript_id
的最大值 exon_number
相加,然后除以总外显子数唯一 transcript_id
已使用。
以下是我文件中任何给定行的 10 种可能格式:
chr19 HAVANA gene 405438 409170 . - . gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3";
chr19 HAVANA transcript 405438 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA exon 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA exon 405438 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA CDS 407099 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA start_codon 408359 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA stop_codon 407096 407098 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 405438 407098 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 408362 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
注意:字符串“exon_number n
”出现在所有要素类型(字段3)的行中,除了“gene
”和“transcript
”,因此计算应该特定于字段 3 ONLY.
中带有“exon
”的行
我的实际文件中的示例行:
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12613 12697 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12975 13052 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13221 13374 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13453 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA gene 14404 29570 . - . gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; havana_gene "OTTHUMG00000000958.1";
chr1 HAVANA transcript 14404 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 29534 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 24738 24891 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 2; exon_id "ENSE00003507205.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 18268 18366 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 3; exon_id "ENSE00003477500.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17915 18061 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 4; exon_id "ENSE00003565697.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17606 17742 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 5; exon_id "ENSE00003475637.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17233 17368 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 6; exon_id "ENSE00003502542.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 16858 17055 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 7; exon_id "ENSE00003553898.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 16607 16765 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 8; exon_id "ENSE00003621279.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 15796 15947 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 9; exon_id "ENSE00002030414.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 15005 15038 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 10; exon_id "ENSE00001935574.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 14404 14501 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 11; exon_id "ENSE00001843071.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level 3;
基于我的示例的所需输出:
6.66667
由于ENST00000456328.2
有3个外显子,ENST00000450305.2
有6个外显子,ENST00000488147.1
有11个外显子~(3 + 6 + 11)/3 = 6.66667
尝试过的解决方案
(i) 我尝试为字段 3 中具有“exon
”的所有行打印列表字段 9。以下输出 gene_id 1194547
:
if [ == "exon" ]; then
awk '{a[]++}END{for(i in a){print i, a[i]}}' file
fi
我相信这意味着我的代码只计算第 9 列的第一部分:gene_id
。不幸的是,我无法得到一个数字的平均值。
(ii) 我尝试打印字段 3 中包含“exon
”的所有行,然后计算 exon_number
出现的次数独特的“transcript_id
”,但我现在对 awk
的了解非常薄弱,所以我觉得脚本还差得远。下面输出一个唯一的 gene_id
列表及其各自的计数,使得输出文件的每一行看起来像这样 "ENSG00000242201.3"; 1
, "ENSG00000132478.9"; 41
, "ENSG00000155099.7"; 34
:
awk '=="exon" {print [=15=]}' file | awk '{count[]++} END {for (word in count) print word, count[word]}'
产生的计数不准确,因为它们只是将所有与 外显子 相同的 gene_id
的行相加。这意味着 ENSG00000183186.7
在文件中可能 found/counted 47
次,但是这个 gene 实际上会产生许多不同大小的转录本,由于选择性剪接(这个问题的生物学相关性),而代码忽略了这一点。它应该特定于“transcript_id
”,而不是“gene_id
”。
(iii) 此代码正在进行中,需要大量修改。我尝试(1)仅当字段 3 包含字符串 exon
时才提取字段 9,(2)拆分“;
”分隔列表并提取 only每个唯一“transcript_id
”的最高值 exon_number n
,然后 (3) 计算每个转录本的 平均 外显子数。
awk ' == "exon" {print }' file
#step needed here to split the “;” separated fields to extract the max exon_number
awk '{ total += } END { print total/NR }' file
好吧,我还是不明白整个故事。所以无论如何:
# extract the lines with 3rd column equal to "exon"
awk ' == "exon"' |
# Extract the interesting parts with a regex. I like `sed` more, so I'll use it.
# extract the transcript_id and exon number
# separator is ';'
sed -n 's/.*transcript_id \([^;]*\).*exon_number \([^;]*\).*/;/p' |
# because I noticed that number is equal to count of lines
# just count the lines with the same transcript_id
# (actually exon_number is ignored)
# and then calculate the average of the array
awk -F';' '{a[]++} END{for (i in a) {cnt++; sum += a[i]} print sum/cnt}'
# or you could also like get the highest number from exon_number with like
# awk ... '{ if (a[] < ) a[]= } ...'
对于作为 tested on repl the script outputs 提供的输入:
6.66667
我在上面留下了 3 个命令,所以它看起来很冗长,同样的工作可以用一个 awk 脚本完成,这比 3 个进程链要快。只需在 awk
中使用 gensub
而不是 awk
中的 sed
和 split
即可将输入拆分为 awk 中的数组。
我希望能够计算每个转录本的外显子的平均数量我的档案。
仅供参考: 我正在使用的文件格式称为 GENCODE GTF 文件。
要找到每个转录本的外显子数量,我们必须在字段 9 (</code>) 的分号分隔列表中进行搜索,但仅搜索 <em> 外显子特征行</em>(字符串“<code>exon
”在制表符分隔文件的字段 3 (</code>) 中的行)。</p>
<p>每个独特的 <code>transcript_id
都有一系列关联的外显子特征线,字段 9 (</code>) 中的“<code>exon_number n
”数量不断增加:
...; transcript_id "ENST00000488147.1"; ...; exon_number 1; ...
...; transcript_id "ENST00000488147.1"; ...; exon_number 2; ...
...
...; transcript_id "ENST00000488147.1"; ...; exon_number n; ...
要获得每个转录本的 平均 外显子数,只需将每个唯一 trasnscript_id
的最大值 exon_number
相加,然后除以总外显子数唯一 transcript_id
已使用。
以下是我文件中任何给定行的 10 种可能格式:
chr19 HAVANA gene 405438 409170 . - . gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3";
chr19 HAVANA transcript 405438 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA exon 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA exon 405438 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA CDS 407099 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA start_codon 408359 408361 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA stop_codon 407096 407098 . - 0 gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 409006 409170 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 405438 407098 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19 HAVANA UTR 408362 408401 . - . gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
注意:字符串“exon_number n
”出现在所有要素类型(字段3)的行中,除了“gene
”和“transcript
”,因此计算应该特定于字段 3 ONLY.
exon
”的行
我的实际文件中的示例行:
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12613 12697 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 12975 13052 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13221 13374 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA exon 13453 13670 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1 HAVANA gene 14404 29570 . - . gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; havana_gene "OTTHUMG00000000958.1";
chr1 HAVANA transcript 14404 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 29534 29570 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 24738 24891 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 2; exon_id "ENSE00003507205.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 18268 18366 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 3; exon_id "ENSE00003477500.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17915 18061 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 4; exon_id "ENSE00003565697.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17606 17742 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 5; exon_id "ENSE00003475637.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 17233 17368 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 6; exon_id "ENSE00003502542.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 16858 17055 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 7; exon_id "ENSE00003553898.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 16607 16765 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 8; exon_id "ENSE00003621279.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 15796 15947 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 9; exon_id "ENSE00002030414.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 15005 15038 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 10; exon_id "ENSE00001935574.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 HAVANA exon 14404 14501 . - . gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 11; exon_id "ENSE00001843071.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1 ENSEMBL gene 17369 17436 . - . gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level 3;
基于我的示例的所需输出:
6.66667
由于ENST00000456328.2
有3个外显子,ENST00000450305.2
有6个外显子,ENST00000488147.1
有11个外显子~(3 + 6 + 11)/3 = 6.66667
尝试过的解决方案
(i) 我尝试为字段 3 中具有“exon
”的所有行打印列表字段 9。以下输出 gene_id 1194547
:
if [ == "exon" ]; then
awk '{a[]++}END{for(i in a){print i, a[i]}}' file
fi
我相信这意味着我的代码只计算第 9 列的第一部分:gene_id
。不幸的是,我无法得到一个数字的平均值。
(ii) 我尝试打印字段 3 中包含“exon
”的所有行,然后计算 exon_number
出现的次数独特的“transcript_id
”,但我现在对 awk
的了解非常薄弱,所以我觉得脚本还差得远。下面输出一个唯一的 gene_id
列表及其各自的计数,使得输出文件的每一行看起来像这样 "ENSG00000242201.3"; 1
, "ENSG00000132478.9"; 41
, "ENSG00000155099.7"; 34
:
awk '=="exon" {print [=15=]}' file | awk '{count[]++} END {for (word in count) print word, count[word]}'
产生的计数不准确,因为它们只是将所有与 外显子 相同的 gene_id
的行相加。这意味着 ENSG00000183186.7
在文件中可能 found/counted 47
次,但是这个 gene 实际上会产生许多不同大小的转录本,由于选择性剪接(这个问题的生物学相关性),而代码忽略了这一点。它应该特定于“transcript_id
”,而不是“gene_id
”。
(iii) 此代码正在进行中,需要大量修改。我尝试(1)仅当字段 3 包含字符串 exon
时才提取字段 9,(2)拆分“;
”分隔列表并提取 only每个唯一“transcript_id
”的最高值 exon_number n
,然后 (3) 计算每个转录本的 平均 外显子数。
awk ' == "exon" {print }' file
#step needed here to split the “;” separated fields to extract the max exon_number
awk '{ total += } END { print total/NR }' file
好吧,我还是不明白整个故事。所以无论如何:
# extract the lines with 3rd column equal to "exon"
awk ' == "exon"' |
# Extract the interesting parts with a regex. I like `sed` more, so I'll use it.
# extract the transcript_id and exon number
# separator is ';'
sed -n 's/.*transcript_id \([^;]*\).*exon_number \([^;]*\).*/;/p' |
# because I noticed that number is equal to count of lines
# just count the lines with the same transcript_id
# (actually exon_number is ignored)
# and then calculate the average of the array
awk -F';' '{a[]++} END{for (i in a) {cnt++; sum += a[i]} print sum/cnt}'
# or you could also like get the highest number from exon_number with like
# awk ... '{ if (a[] < ) a[]= } ...'
对于作为 tested on repl the script outputs 提供的输入:
6.66667
我在上面留下了 3 个命令,所以它看起来很冗长,同样的工作可以用一个 awk 脚本完成,这比 3 个进程链要快。只需在 awk
中使用 gensub
而不是 awk
中的 sed
和 split
即可将输入拆分为 awk 中的数组。