在继续下一个规则之前,如何让 Snakemake 将所有样本应用于单个规则?
How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule?
在有 j 个内核的机器上,给定一个依赖于 RuleA 的 RuleB,我希望 Snakemake 执行我的工作流程如下:
RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread
RuleB 同时在 j
个样本上执行。
相反,工作流执行如下:
RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...
每次对 1 个样本执行规则 B。
按该顺序执行,ruleB 无法并行化,并且工作流 运行 比原来慢得多。
更具体地说,我想使用 STAR 将读数与基因组进行比对,并使用 RNASeQC 对其进行量化。 RNASEQC 工具是单线程的,而 STAR 可以在单个样本上使用多个线程。
这导致 Snakemake 比对样本 1 中的读取,然后使用 rnaseqc 对其进行量化,之后它继续在样本 2 中执行相同的操作。我希望它首先读取所有样本,然后继续对它们进行量化(这样,它就能够 运行 单线程 rnaseqc 工具的多个实例)。
Snakemake 文件的相关摘录:
sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]
rule all:
input:
expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)
rule run_star:
input:
index_dir=rules.star_index.output.index_dir,
fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
output:
"star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
"star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
"star/{sample}/{sample}ReadsPerGene.out.tab",
"star/{sample}/{sample}Log.final.out"
log:
"logs/star/{sample}.log"
params:
extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
sample_name = "{sample}"
threads: 18
script:
"scripts/star_align.py"
rule rnaseqc:
input:
bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
output:
"rnaseqc/{sample}/{sample}.exon_reads.gct",
"rnaseqc/{sample}/{sample}.gene_fragments.gct",
"rnaseqc/{sample}/{sample}.gene_reads.gct",
"rnaseqc/{sample}/{sample}.gene_tpm.gct",
"rnaseqc/{sample}/{sample}.metrics.tsv"
params:
extra="-s {sample} --legacy",
output_dir="rnaseqc/{sample}"
log:
"logs/rnaseqc/{sample}"
shell:
"rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"
奇怪的是,用 snakemake -np -j
做干燥的 运行 做了正确的事情:
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
log: logs/star/RNA-seq_L182_S16_L002.log
jobid: 1026
wildcards: sample=RNA-seq_L182_S16_L002
threads: 18
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
log: logs/star/RNA-seq_L173_S7_L001.log
jobid: 737
wildcards: sample=RNA-seq_L173_S7_L001
threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L221_S15_L001
jobid: 215
wildcards: sample=RNA-seq_L221_S15_L001
rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L284_S38_L001
jobid: 278
wildcards: sample=RNA-seq_L284_S38_L001
但在没有 -np
标志的情况下执行 snakemake -j
则不会。
[Mon Oct 21 13:13:49 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
log: logs/star/RNA-seq_L249_S3_L001.log
jobid: 813
wildcards: sample=RNA-seq_L249_S3_L001
threads: 18
Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done
[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L249_S3_L001
jobid: 243
wildcards: sample=RNA-seq_L249_S3_L001
我正在使用 Conda 提供的最新版本的 Snakemake:
5.5.2
也许您正在寻找的是与规则 运行ning rnaseqc 相比,给予规则 运行ning STAR 更高的优先级。如果是这样,请查看 priorities 指令,例如:
rule star:
priority: 50
...
rule rnaseqc:
priority: 0
...
(未测试)这应该 运行 首先是所有明星作业,一次一个,因为它们每个需要 18 个内核,然后是并行的所有 rnaseqc 作业。
在有 j 个内核的机器上,给定一个依赖于 RuleA 的 RuleB,我希望 Snakemake 执行我的工作流程如下:
RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread
RuleB 同时在 j
个样本上执行。
相反,工作流执行如下:
RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...
每次对 1 个样本执行规则 B。
按该顺序执行,ruleB 无法并行化,并且工作流 运行 比原来慢得多。
更具体地说,我想使用 STAR 将读数与基因组进行比对,并使用 RNASeQC 对其进行量化。 RNASEQC 工具是单线程的,而 STAR 可以在单个样本上使用多个线程。
这导致 Snakemake 比对样本 1 中的读取,然后使用 rnaseqc 对其进行量化,之后它继续在样本 2 中执行相同的操作。我希望它首先读取所有样本,然后继续对它们进行量化(这样,它就能够 运行 单线程 rnaseqc 工具的多个实例)。
Snakemake 文件的相关摘录:
sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]
rule all:
input:
expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)
rule run_star:
input:
index_dir=rules.star_index.output.index_dir,
fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
output:
"star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
"star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
"star/{sample}/{sample}ReadsPerGene.out.tab",
"star/{sample}/{sample}Log.final.out"
log:
"logs/star/{sample}.log"
params:
extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
sample_name = "{sample}"
threads: 18
script:
"scripts/star_align.py"
rule rnaseqc:
input:
bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
output:
"rnaseqc/{sample}/{sample}.exon_reads.gct",
"rnaseqc/{sample}/{sample}.gene_fragments.gct",
"rnaseqc/{sample}/{sample}.gene_reads.gct",
"rnaseqc/{sample}/{sample}.gene_tpm.gct",
"rnaseqc/{sample}/{sample}.metrics.tsv"
params:
extra="-s {sample} --legacy",
output_dir="rnaseqc/{sample}"
log:
"logs/rnaseqc/{sample}"
shell:
"rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"
奇怪的是,用 snakemake -np -j
做干燥的 运行 做了正确的事情:
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
log: logs/star/RNA-seq_L182_S16_L002.log
jobid: 1026
wildcards: sample=RNA-seq_L182_S16_L002
threads: 18
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
log: logs/star/RNA-seq_L173_S7_L001.log
jobid: 737
wildcards: sample=RNA-seq_L173_S7_L001
threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L221_S15_L001
jobid: 215
wildcards: sample=RNA-seq_L221_S15_L001
rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L284_S38_L001
jobid: 278
wildcards: sample=RNA-seq_L284_S38_L001
但在没有 -np
标志的情况下执行 snakemake -j
则不会。
[Mon Oct 21 13:13:49 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
log: logs/star/RNA-seq_L249_S3_L001.log
jobid: 813
wildcards: sample=RNA-seq_L249_S3_L001
threads: 18
Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done
[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L249_S3_L001
jobid: 243
wildcards: sample=RNA-seq_L249_S3_L001
我正在使用 Conda 提供的最新版本的 Snakemake: 5.5.2
也许您正在寻找的是与规则 运行ning rnaseqc 相比,给予规则 运行ning STAR 更高的优先级。如果是这样,请查看 priorities 指令,例如:
rule star:
priority: 50
...
rule rnaseqc:
priority: 0
...
(未测试)这应该 运行 首先是所有明星作业,一次一个,因为它们每个需要 18 个内核,然后是并行的所有 rnaseqc 作业。