使用检查点的多个输出的 Snakemake 语法

Question

我正在使用 snakemake 构建管道。我有一个应该产生多个输出文件的检查点。这些输出文件后来在我的规则中全部在 expand 中使用。问题是我不知道将生成的文件数量，因此无法在展开中指定数据集。

文件将在 R 脚本中生成。

示例：

rule all:
    input:
        expand(["results/{output}],
               output=????)



checkpoint rscript:
    input:
        "foo.input"
    output:
        report("somedir/{output}"),
    script:
        "../scripts/foo.R"

当然这只是一小部分，但我基本上在我的 R 脚本中有一个循环来在 somedir 中输出多个文件。但是因为我不知道有多少，而且因为它们首先在 R 脚本中进行评估，所以我无法在展开中设置输出。

也许这对你们中的一些人来说是一个非常微不足道的问题，甚至是一个愚蠢的问题，并且有更好的方法来做到这一点。如果是这样的话，我仍然会很感激，因为我在理解大多数 snakemake 函数时遇到了问题，因为我能够理解英语中的函数。

如果有更多问题，我很乐意回答。（对我来说最好的情况是让输出具有我可以在运行时在 R 脚本中指定的名称）

（我也无法在另一个规则中聚合创建的文件，因为每个文件都会显示不同的情节）

编辑：主要问题似乎仍然是检查点 rscript 无法在 "somedir/" 中创建多个 {output} 文件。每次我的 rscript 中的循环写入 snakemake@output[[1]] 时，使用 touch("rscript_finish.flag") 的尝试似乎仅将 svg 文件输出为 "rscript_finish.flag" 或似乎覆盖 "rscript_finish.flag" .

Answer 1

没有愚蠢的问题:)。我希望我明白了，这实际上根本不是一个小问题！

def all_input(wildcards):
    checkpoints.rscript.get()  # make sure that checkpoint rscript is executed
    filenames, = glob_wildcards("somedir/{filenames}.png")  # find all the output_files of rscript
    return expand("somedir_cp/{fn}", fn=filenames)


rule all:
    input:
        all_input


rule add_to_report:
    input:
        "somedir/{filename}.png"
    output:
        report("somedir_cp/{filename}.png")
    shell:
        "cp {input} {output}"


checkpoint rscript:
    input:
        "foo.input"
    output:
        touch("rscript_finish.flag")
    script:
        "../scripts/foo.R"

我没有真正测试代码，所以我不确定它是否立即有效，但我认为逻辑是正确的。

解决这个问题的方法是用一个额外的规则，我称之为add_to_report。此规则所做的只是复制 rscript 的现有输出，并将其添加到报告中。 rule all 的工作方式是它首先调用 checkpoint rscript 的执行。一旦执行该文件，它就会找到它生成的所有文件。然后它说 rule all 需要将 rscript 生成的每个文件的副本作为输入，这将由 rule add_to_report 生成，因此文件被添加到报告中。

使用检查点的多个输出的 Snakemake 语法

Snakemake syntax for multiple outputs with the use of checkpoint

pipeline

snakemake