Introduction to Snakemake

Introduction to Snakemake workflows

2024-10-15

Chloé QUIGNOT (BIOI2 @I2BC) - ORCID: 0000-0001-8504-232X

Source: adapted from FAIRbioinfo 2021 training material of the IFB and Snakemake introduction tutorial from BIOI2

Material under CC-BY-SA licence

The principle behind Snakemake

Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)

Workflows are like legos:

Workflows are made up of blocks, each block performs a specific (set of) instruction(s)

workflow divided into rules

1 “block” = 1 rule:

- 1 rule = 1 instruction (ideally)
- inputs and outputs are one or multiple files
- at least 1 input and/or 1 output per rule

Linking data flows

Rule order is not important…

execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution

…but matching file names is key!

Rules are linked together by Snakemake using matching filenames in their input and output directives.

2 rules linked together

At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline.

A workflow example

Below is a workflow example using 2 tools sequentially to align 2 protein sequences:

In this example, we have:

2 linked rules: fusionFasta and Mafft
input protein sequence files named *.fasta
an intermediate file generated by fusionFasta named *fused.fasta
the final output named *aligned.fasta generated by Mafft

How Snakemake creates your workflow

How Snakemake creates your workflow (summary)

Snakemake (Smk) steps	running path
Smk creates the DAG from the snakefile
Smk sees that the final output `*aligned.fasta` doesn’t exist but knows it can create it with the `Mafft` rule
`Mafft` needs files matching `*fused.fasta` (don’t exist) but the `fusionFasta` rule can generate it
`fusionFasta` needs `.fasta` files

How Snakemake creates your workflow (summary)

Snakemake steps	running path
`.fasta` files exist! Smk stops backtracking
Smk runs the `fusionFasta` rule
`P10415_P01308_fused.fasta` exists and feeds the `Mafft` rule
the final output (`P10415_P01308_aligned.fasta`) is generated, the workflow has finished

Rules are run when outputs are missing… but not only

Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…

Rules are run if:

output doesn’t exist
output exists but is older than the input
changes detected in parameters, code or tool versions since last execution

The Snakemake world

Many default files constitute the “Snakemake system” & there are standards on how to organise them.

They are not all necessary for a basic pipeline execution.

The most important is the Snakefile, that’s where all the code is saved.

For more information: https://github.com/snakemake-workflows/snakemake-workflow-template

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

=> Rules usually have a unique name which defines them

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

=> Rules usually have a unique name which defines them
=> input, output, shell etc. are called directives
=> "myInputFile" & "myOutputFile" specify 1 or more input & output files
=> shell specifies what to do (shell commands in this case -> alternative directives exist)

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

single rule example

Within the Snakefile…

The Snakefile is where rules are defined
The basic syntax of a rule is:

rule myRuleName:
____input: "myInputFile"
____output: "myOutputFile"
____shell: "echo {input} > {output}"

single rule example

=> code alignment (=indentations) is important
=> files and shell directives should be given within quotes (', " or """ for multi-line code)
=> additional & optional directives exist, e.g.: params:, resources:, log:, etc. (we’ll see them later)

For more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

Snakefile of the previous example

rule fusionFasta:
    input:
        p1="P10415.fasta",
        p2="P01308.fasta",
    output:
        "P10415_P01308_fused.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """
    
rule mafft:
    input:
        "P10415_P01308_fused.fasta",
    output:
        "P10415_P01308_aligned.fasta",
    shell:
        """
        mafft {input} > {output} 
        """

2 rules: fusionFasta & mafft
fusionFasta: 2 input (p1 & p2) & 1 output file
mafft: 1 input & 1 output file

NB: input & output files can be named
e.g. p1="P10415.fasta"
and explicitly accessed in shell
e.g. {input.p1} or {input[0]}

The concept of target rule

If you try running the previous example, it will only run fusionFasta, why??!

=> because in this case, fusionFasta is the target rule…

What’s a target rule?

it specifies the final result files that should be generated
it’s the first (and “only”) rule executed by Snakemake
if other rules are run, it’s only because they’re needed to create the files expected by the target rule

How do you define a rule as target?

Technically, any rule could be a target rule…

Default: first rule in the file (i.e. fusionFasta)
Or: use default_target: True directive
Or: specify the rule or its output in the command line at execution

For more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#target-rules/

Adding a target rule to the previous example

Here, we could make mafft the target rule. But it’s common practice to create a new dedicated rule, often called target or all. This rule will list all final output files of the pipeline.

rule all:
    input: "P10415_P01308_aligned.fasta",

rule fusionFasta:
    input:
        p1="P10415.fasta",
        p2="P01308.fasta",
    output:
        "P10415_P01308_fused.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """
    
rule mafft:
    input:
        "P10415_P01308_fused.fasta",
    output:
        "P10415_P01308_aligned.fasta",
    shell:
        """
        mafft {input} > {output}

Here, we created a third rule called all.

It’s the target rule here because it’s the first rule in the Snakefile.

It lists all final output files of our pipeline.

The power of wilcards

In Snakemake, rules can be generalised using wildcards to replace parts of file names:

reduces hardcoding: more flexible input and output directives, and adaptable to new data without modification
automatically resolved (ie. replaced by the regular expression: ".+")
written between braces: {}
specific per rule

A same file can be accessed using different patterns e.g. the file 101/file.A.txt could be encoded:
- output : "{set}1/file.{grp}.txt" in which case set=10, grp=A
- output : "{set}/file.A.{ext}" in which case set=101, ext=txt
- output : "{filename}.txt" in which case filename=101/file.A

The most important condition with wildcards: if they are used in both input and output directives, they should match (i.e. same number, same names). The names (or keywords) used are totally arbitrary.

For more information on wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards/

Adding wildcards to the previous example

Here, we generalised rules fusionFasta and mafft to any fasta input thanks to wildcards. In this case, the values of the given wildcards are deduced by Snakemake using the final output files defined in the all rule.

rule all:
    input: "P10415_P01308_aligned.fasta",

rule fusionFasta:
    input:
        p1="{upid1}.fasta",
        p2="{upid2}.fasta",
    output:
        "{upid1}_{upid2}.fasta",
    shell:
        """
        cat {input.p1} {input.p2} > {output}
        """
    
rule mafft:
    input:
        "{prefix}.fasta",
    output:
        "{prefix}_aligned.fasta",
    shell:
        """
        mafft {input} > {output} 
        """

mafft DAG

A last word about wildcards

adding constraints

You can change the default regex of given wildcards with the wildcard_constraints: directive

For example:

regex: wildcard_constraints: upid1="[A-Z0-9]+"
list of values: wildcard_constraints: upid1="P10415|P01308"

accessing wildcards

Within input & output directives: {wildcard_name}
Within other directives of the same rule:
- within shell: use the wildcards keyword, e.g. shell: "echo {wildcards.upid1}"
- within other directives: use functions (will see this later)

How to run a Snakemake pipeline?

When Snakemake is installed (how to install):

move into the directory containing the Snakefile
type snakemake --cores 1 to run the pipeline (--cores specifies the number of cores to use)

Snakemake’s monolog & it’s hidden treasure chest

When you run Snakemake, you’ll get a full report printed on the screen of its progress:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
fusionFasta    1
mafft          1
all            1
total          3

[...]

5 of 3 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log

When it’s finished, a .snakemake folder will appear in your working directory:

it can be heavy (when using environments)
it can contain a lot of files (unsuited for some file systems)
it’s a hidden folder so ls -a to see it
don’t forget to remove it once you’re sure you’ve finished your analysis

Useful debugging options

Visualise the Snakemake DAG

To visualise the complete workflow (--dag), rule dependencies (--rulegraph) or rule dependencies with their I/O files, in dot language. Uses the dot tool of the graphviz package to create a png, pdf or other format:

snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rule.png
snakemake --filegraph | dot -Tpng > file.png

Use the `--dry-run` option

Using this option will perform a “dry-run” i.e. nothing will be executed but everything that would’ve been run is displayed on the screen

Other useful options for debugging when running Snakemake

print the shell command that is run: -p --printshellcmds
print a summary and status of rule: -D, e.g.

output_file	date	rule	log-file(s)	input-file(s)	shellcmd	status	plan
`P10415.fasta`	01/10/24	loadData			`wget https://www.uniprot.org/uniprot/P10415.fasta`	ok	no update

All command line options: https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options

Conclusion

So far, we’ve seen:

Snakemake workflow = set of rules
Rules are written in Snakefiles
Snakemake links rules together by matching up common input/output files
Rules are defined by their name and contain directives (of which input and output to specify input & output files):

rule myRuleName
    input: "myInputFile"
    output: "myOutputFile"
    shell: "echo {input} > {output}"

Snakemake only executes the target rule and only rules that will help in generating its files
Rules can be generalised using wildcards
A Snakefile is run with the snakemake --cores 1 command (+ other options available)
Debugging options: --dag, --rulegraph, --filegraph and --dry-run

Introduction to Snakemake

Introduction to Snakemake workflows

The principle behind Snakemake

Workflows are like legos:

1 “block” = 1 rule:

Linking data flows

Rule order is not important…

…but matching file names is key!

A workflow example

How Snakemake creates your workflow

How Snakemake creates your workflow (summary)

How Snakemake creates your workflow (summary)

Rules are run when outputs are missing… but not only

The Snakemake world

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Within the Snakefile…

Snakefile of the previous example

The concept of target rule

What’s a target rule?

How do you define a rule as target?

Adding a target rule to the previous example

The power of wilcards

Adding wildcards to the previous example

A last word about wildcards

adding constraints

accessing wildcards

How to run a Snakemake pipeline?

Snakemake’s monolog & it’s hidden treasure chest

Useful debugging options

Visualise the Snakemake DAG

Use the --dry-run option

Other useful options for debugging when running Snakemake

Conclusion

Use the `--dry-run` option