Introduction to Snakemake - Exercises


A. Setup

1. connect to the IFB cluster

Through ssh:

ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr

For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/logging-in/

Or through OpenOnDemand:

For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/

2. prepare your working environment

cd /shared/projects/2417_wf4bioinfo/
mkdir -p $USER/day2-introsmk
cp snakemake-workshop/day2-introsmk/Snakefile $USER/day2-introsmk
cd $USER/day2-introsmk


The following sections are divided into questions (Q) and tasks (T). Explanations and answers can be found in the relative sections by developping the dedicated links.

B. Understanding the pipeline

A colleague gave me a Snakemake pipeline, now what?!

Q1: plot the rule graph to visualise the pipeline & understand what it does

Hints:

Click to see an example solution

Make sure you’re in the right folder:

cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk

Load the right modules:

module load graphviz/2.40.1
module load snakemake/8.9.0

Run snakemake:

snakemake --rulegraph | dot -Tpng > rule.png
Click here for explanations

We can see from the rule graph, that there are 4 rules:

  • loadData
  • fusionFasta
  • mafft
  • all

Given the linear look of the graph, all rules seem to be applied sequentially (i.e. the output of the first is the input of the next, and so on)

Q2: find in snakemake options how to list rules and possible target rules

Hints:

Click to see the answer

The options are --list-rules and --list-target-rules:

snakemake --list-rules
snakemake --list-target-rules
Click here for explanations

Rule list:

all
fusionFasta
loadData
mafft

The list of all rules outputted by the command line corresponds to the ones we see in the rule graph.

Target rule list:

all

There seems to be only one official target rule in the Snakefile: all

T3: let’s have a look at the Snakefile

Open up the Snakefile with your favorite editor (e.g. vi, nano, emacs, …) or viewer (e.g. less, more, cat, …). Look at the rules in the same order as in the rule graph & try to understand what they do (pay specific attention to the shell directives).

Explanation:

In a nutshell:

This pipeline doesn’t need any input files, it downloads sequences given their UniProt identifier directly from the web, then fuses them into a single file to align them with mafft

Side note: Python within Snakemake

If you’re familiar with Python, you’ll have recognised plenty of Python-like bits in this Snakefile. Although you don’t have to know Python to understand Snakemake, it’s still a good thing to keep it in mind while you’re using it. If you have a look at more advanced code, you’ll see that you can transfer quite a bit of your Python skills into Snakemake (you can import packages for example, or create functions). Also, keep in mind that, as for Python, Snakemake is sensitive to indents in the code.

Q4: dry-run the code

Dry-running the code is a good way on doing a final check to make sure everything will work once you run the pipeline on the slave nodes of the cluster.

NB: Dry-runs don’t actually use any major resources so running it directly from the cluster’s master node is ok.

Hints:

Click to see an example solution

Make sure you’re in the right folder:

cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk

Load the right modules:

module load snakemake/8.9.0

Run snakemake:

snakemake --dry-run
Click here for explanations

You should get an output on your screen similar to the following:

Building DAG of jobs...
Job stats:
job            count
-----------  -------
all                1
fusionFasta        1
loadData           2
mafft              1
total              5

Execute 2 jobs...

[Mon Aug 19 08:32:28 2024]
localrule loadData:
    output: results/data/P01308.fasta
    jobid: 4
    reason: Missing output files: results/data/P01308.fasta
    wildcards: sample=P01308
    resources: tmpdir=<TBD>


[Mon Aug 19 08:32:28 2024]
localrule loadData:
    output: results/data/P10415.fasta
    jobid: 3
    reason: Missing output files: results/data/P10415.fasta
    wildcards: sample=P10415
    resources: tmpdir=<TBD>

Execute 1 jobs...

[Mon Aug 19 08:32:28 2024]
localrule fusionFasta:
    input: results/data/P10415.fasta, results/data/P01308.fasta
    output: results/alignment/P10415_P01308.fasta
    jobid: 2
    reason: Missing output files: results/alignment/P10415_P01308.fasta; Input files updated by another job: results/data/P10415.fasta, results/data/P01308.fasta
    wildcards: uniprotid1=P10415, uniprotid2=P01308
    resources: tmpdir=<TBD>

Execute 1 jobs...

[Mon Aug 19 08:32:28 2024]
localrule mafft:
    input: results/alignment/P10415_P01308.fasta
    output: results/alignment/P10415_P01308_aligned.fasta
    jobid: 1
    reason: Missing output files: results/alignment/P10415_P01308_aligned.fasta; Input files updated by another job: results/alignment/P10415_P01308.fasta
    wildcards: prefix=P10415_P01308
    resources: tmpdir=<TBD>

Execute 1 jobs...

[Mon Aug 19 08:32:28 2024]
localrule all:
    input: results/alignment/P10415_P01308_aligned.fasta
    jobid: 0
    reason: Input files updated by another job: results/alignment/P10415_P01308_aligned.fasta
    resources: tmpdir=<TBD>

Job stats:
job            count
-----------  -------
all                1
fusionFasta        1
loadData           2
mafft              1
total              5

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all, fusionFasta, mafft
    output files have to be generated:
        fusionFasta, loadData, mafft

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

The Job stats section (lines 2-9, and again lines 61-68) tells you that 5 jobs will be run in total and the Reasons section (lines 70-75) will explain why. In this case, output files are missing, and depending on which ones are missing some or all of the rules will have to be rerun.

In the log, there is also 1 section dedicated to each job that would have been run by Snakemake (e.g. loadData for P01308 lines 13-19). Each section shows the input and output file paths, the reason why it’s run and information specific to that job i.e. the resources used, its jobid, the wildcard values…


C. Running the pipeline

Let’s run the pipeline on the IFB cluster. We’ll go through a SLURM submission script to run the Snakefile within a SLURM job. SLURM quick-start guide: https://ifb-elixirfr.gitlab.io/cluster/doc/quick-start/

Q5: Create a SLURM submission script

Hints:

Click to see an example solution

Example run_script.sh:

#!/bin/bash

#SBATCH --job-name=day2-introsmk
#SBATCH --mem=100Mb
#SBATCH --account=2417_wf4bioinfo
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00

# load necessary modules
module load snakemake/8.9.0
module load mafft/7.515

# move to WD
cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk

# run snakemake with minimum options
snakemake -c 1

Q6: Run the pipeline (i.e. submit the script) & follow your job

Click to see an example solution

Assuming you have all SLURM options within your run script:

sbatch run_script.sh

Then follow your submitted jobs with:

squeue -u $USER

or

squeue --me

D. Checking the output

T8: Have a look at your working directory

For example with ls or with tree. You see a results folder (output of the pipeline), your script, the Snakefile and the slurm output log. At a first glance, all outputs seem to have been generated.

Output of tree command:

.
├── results
│   ├── alignment
│   │   ├── P10415_P01308_aligned.fasta
│   │   └── P10415_P01308.fasta
│   └── data
│       ├── P01308.fasta
│       └── P10415.fasta
├── run_script.sh
├── slurm-41257618.out
└── Snakefile

3 directories, 7 files

T9: Have a closer look at slurm’s output log:

For example with cat slurm-41257618.out. The output shoud look a lot like the one you’ve generated during dry-run. We can also see from the log that the pipeline ended without an error cf.  5 of 5 steps (100%) done and no mention of errors:

[Fri Aug 16 18:35:03 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-08-16T183501.992097.snakemake.log

As you can see, Snakemake also saves the log outputted on the screen in a <date-time>.snakemake.log file within the .snakemake folder.

T10: Have a quick glance at the .snakemake folder:

.snakemake is a hidden folder and is the default working directory of Snakemake. It’s generated within the directory within which Snakemake is run (you can change this with the --directory option). This folder contains log files, metadata (such as run statistics) and environments that Snakemake collects and generates during a run. Thus, it can get quite big (in volume & number of files) and is not automatically deleted.

.snakemake/
├── auxiliary
├── conda
├── conda-archive
├── incomplete
├── locks
├── log
│   └── 2024-08-16T183501.992097.snakemake.log
├── metadata
│   ├── cmVzdWx0cy9hbGlnbm1lbnQvUDEwNDE1X1AwMTMwOC5mYXN0YQ==
│   ├── cmVzdWx0cy9hbGlnbm1lbnQvUDEwNDE1X1AwMTMwOF9hbGlnbmVkLmZhc3Rh
│   ├── cmVzdWx0cy9kYXRhL1AwMTMwOC5mYXN0YQ==
│   └── cmVzdWx0cy9kYXRhL1AxMDQxNS5mYXN0YQ==
├── shadow
└── singularity

9 directories, 5 files

E. Conclusion

Accomplished objectives:

Command summary for Snakemake:

T11: The last task of this exercise: clean up your current directory

rm -rf .snakemake/

You can also delete the results and SLURM’s output file if you wish.