2024-10-15
Chloé QUIGNOT (BIOI2 @I2BC) - ORCID: 0000-0001-8504-232X
Source: adapted from FAIRbioinfo 2021 training material of the IFB and Snakemake introduction tutorial from BIOI2
Material under CC-BY-SA
licence
Snakemake = Python (aka “snake”, a programming language) + Make (a rule-based automation tool)
Workflows are made up of blocks, each block performs a specific (set of) instruction(s)
- 1 rule = 1 instruction (ideally) - inputs and outputs are one or multiple files - at least 1 input and/or 1 output per rule |
execution order ≠ code order => Snakemake does a pick & mix of the rules it needs at execution
Rules are linked together by Snakemake using matching filenames in their input and output directives.
At execution, Snakemake creates a DAG (directed acyclic graph), that it will follow to generate the final output of your pipeline.
Below is a workflow example using 2 tools sequentially to align 2 protein sequences:
In this example, we have:
fusionFasta
and Mafft
*.fasta
*fused.fasta
*aligned.fasta
generated by
MafftSnakemake (Smk) steps | running path |
---|---|
Smk creates the DAG from the snakefile | |
Smk sees that the final output
*aligned.fasta doesn’t exist but knows it can create it
with the Mafft rule |
|
Mafft needs files matching
*fused.fasta (don’t exist) but the fusionFasta
rule can generate it |
|
fusionFasta needs
.fasta files |
Snakemake steps | running path |
---|---|
.fasta files exist! Smk stops
backtracking |
|
Smk runs the fusionFasta
rule |
|
P10415_P01308_fused.fasta
exists and feeds the Mafft rule |
|
the final output
(P10415_P01308_aligned.fasta ) is generated, the workflow
has finished |
Snakemake’s job is to make sure that everything is up-to-date, otherwise it (re-)runs the rules that need to be run…
Rules are run if:
Many default files constitute the “Snakemake system” & there are standards on how to organise them.
They are not all necessary for a basic pipeline execution.
The most important is the Snakefile
, that’s where all
the code is saved.
For more information: https://github.com/snakemake-workflows/snakemake-workflow-template
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines them
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
rule myRuleName:
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders & are replaced by input & output
file names at execution
rule myRuleName:
____input: "myInputFile"
____output: "myOutputFile"
____shell: "echo {input} > {output}"
=> Rules usually have a unique name which defines
them
=> input
, output
, shell
etc.
are called directives
=> "myInputFile"
& "myOutputFile"
specify 1 or more input & output files
=> shell
specifies what to do (shell
commands in this case -> alternative directives exist)
=> {input}
& {output}
are
placeholders & are replaced by input & output
file names at execution
=> code alignment (=indentations) is important
=> files and shell
directives should be given within
quotes ('
, "
or """
for
multi-line code)
=> additional & optional directives exist, e.g.:
params:
, resources:
, log:
, etc.
(we’ll see them later)
For more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
rule fusionFasta:
input:
p1="P10415.fasta",
p2="P01308.fasta",
output:
"P10415_P01308_fused.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
rule mafft:
input:
"P10415_P01308_fused.fasta",
output:
"P10415_P01308_aligned.fasta",
shell:
"""
mafft {input} > {output}
"""
fusionFasta
& mafft
fusionFasta
: 2 input (p1
&
p2
) & 1 output filemafft
: 1 input & 1 output fileNB: input & output files can be named
e.g. p1="P10415.fasta"
and explicitly accessed in shell
e.g. {input.p1}
or {input[0]}
If you try running the previous example, it will only run
fusionFasta
, why??!
=> because in this case, fusionFasta
is the
target rule…
Technically, any rule could be a target rule…
fusionFasta
)default_target: True
directiveFor more information: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#target-rules/
Here, we could make mafft
the target rule. But it’s
common practice to create a new dedicated rule, often called
target
or all
. This rule will list all final
output files of the pipeline.
rule all:
input: "P10415_P01308_aligned.fasta",
rule fusionFasta:
input:
p1="P10415.fasta",
p2="P01308.fasta",
output:
"P10415_P01308_fused.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
rule mafft:
input:
"P10415_P01308_fused.fasta",
output:
"P10415_P01308_aligned.fasta",
shell:
"""
mafft {input} > {output}
Here, we created a third rule called all
.
It’s the target rule here because it’s the first rule in the Snakefile.
It lists all final output files of our pipeline.
In Snakemake, rules can be generalised using wildcards to replace parts of file names:
".+"
){}
A same file can be accessed using different patterns e.g. the file
101/file.A.txt
could be encoded:
- output : "{set}1/file.{grp}.txt"
in which case
set=10
, grp=A
- output : "{set}/file.A.{ext}"
in which case
set=101
, ext=txt
- output : "{filename}.txt"
in which case
filename=101/file.A
The most important condition with wildcards: if they are used in both
input
and output
directives, they should match
(i.e. same number, same names). The names (or keywords) used are totally
arbitrary.
For more information on wildcards: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards/
Here, we generalised rules fusionFasta
and
mafft
to any fasta input thanks to wildcards. In this case,
the values of the given wildcards are deduced by Snakemake using the
final output files defined in the all
rule.
rule all:
input: "P10415_P01308_aligned.fasta",
rule fusionFasta:
input:
p1="{upid1}.fasta",
p2="{upid2}.fasta",
output:
"{upid1}_{upid2}.fasta",
shell:
"""
cat {input.p1} {input.p2} > {output}
"""
rule mafft:
input:
"{prefix}.fasta",
output:
"{prefix}_aligned.fasta",
shell:
"""
mafft {input} > {output}
"""
You can change the default regex of given wildcards with the
wildcard_constraints:
directive
For example:
wildcard_constraints: upid1="[A-Z0-9]+"
wildcard_constraints: upid1="P10415|P01308"
Within input
& output
directives:
{wildcard_name}
Within other directives of the same rule:
shell
: use the wildcards
keyword,
e.g. shell: "echo {wildcards.upid1}"
When Snakemake is installed (how to install):
Snakefile
snakemake --cores 1
to run the pipeline
(--cores
specifies the number of cores to use)When you run Snakemake, you’ll get a full report printed on the screen of its progress:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
------- -------
fusionFasta 1
mafft 1
all 1
total 3
[...]
5 of 3 steps (100%) done
Complete log: .snakemake/log/2024-02-20T150605.574089.snakemake.log
When it’s finished, a .snakemake
folder will appear in
your working directory:
ls -a
to see itTo visualise the complete workflow (--dag
), rule
dependencies (--rulegraph
) or rule dependencies with their
I/O files, in dot language. Uses the dot
tool of the
graphviz
package to create a png, pdf or other format:
snakemake --dag | dot -Tpng > dag.png
snakemake --rulegraph | dot -Tpng > rule.png
snakemake --filegraph | dot -Tpng > file.png
--dry-run
optionUsing this option will perform a “dry-run” i.e. nothing will be executed but everything that would’ve been run is displayed on the screen
-p --printshellcmds
-D
, e.g.output_file | date | rule | log-file(s) | input-file(s) | shellcmd | status | plan | |
P10415.fasta |
01/10/24 | loadData | wget https://www.uniprot.org/uniprot/P10415.fasta |
ok | no update |
All command line options: https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options
So far, we’ve seen:
input
and output
to specify input & output
files):rule myRuleName
input: "myInputFile"
output: "myOutputFile"
shell: "echo {input} > {output}"
wildcards
snakemake --cores 1
command
(+ other options available)--dag
, --rulegraph
,
--filegraph
and --dry-run