Introduction to Snakemake

In this Exercise, we’ll be starting where we left off in Exercise A and we’ll be using the snakefile you wrote in the objective 6.

In the following objectives, we will be building on the Snakefile from Exercise 1A which successively runs FastQC then MultiQC on a set of RNA-seq data.

Each step will reply to an objective. Thus, we will be doing several cycles of executing snakemake, observing the results and improving the code. Each code version will be noted ex1b_oX.smk, with X a progressive digit corresponding to the objective number.

Setup

It’s the same setup as for Exercise A, we will work on a node of the cluster with Snakemake, FastQC and MultiQC modules loaded.

As a reminder, you should be in your working directory from Exercise A in which you have the Data/ folder containing all example files.

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls -a
.  ..  .snakemake  Data  FastQC  multiqc_data  multiqc_report.html

module load snakemake/8.9.0 fastqc/0.12.1 multiqc/1.13 graphviz/2.40.1

NB: If you’ve skipped Exercise A, you can copy-paste the example solution script from Exercise A objective 6.

Objective 1 - glob_wildcards()

Make Snakemake find your input files for you, using the glob_wildcards() function (similar to the glob module in Python).

Where to start?

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

snakemake -s ex1b_o1.smk -c 1 -p
# -c is short for --cores

Observe the output

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:35 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    jobid: 6
    reason: Missing output files: FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.zip
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz
application/gzip
Started analysis of SRR3105697_chr18.fastq.gz
Approx 5% complete for SRR3105697_chr18.fastq.gz
Approx 10% complete for SRR3105697_chr18.fastq.gz
Approx 15% complete for SRR3105697_chr18.fastq.gz
Approx 20% complete for SRR3105697_chr18.fastq.gz
Approx 25% complete for SRR3105697_chr18.fastq.gz
Approx 30% complete for SRR3105697_chr18.fastq.gz
Approx 35% complete for SRR3105697_chr18.fastq.gz
Approx 40% complete for SRR3105697_chr18.fastq.gz
Approx 45% complete for SRR3105697_chr18.fastq.gz
Approx 50% complete for SRR3105697_chr18.fastq.gz
Approx 55% complete for SRR3105697_chr18.fastq.gz
Approx 60% complete for SRR3105697_chr18.fastq.gz
Approx 65% complete for SRR3105697_chr18.fastq.gz
Approx 70% complete for SRR3105697_chr18.fastq.gz
Approx 75% complete for SRR3105697_chr18.fastq.gz
Approx 80% complete for SRR3105697_chr18.fastq.gz
Approx 85% complete for SRR3105697_chr18.fastq.gz
Approx 90% complete for SRR3105697_chr18.fastq.gz
Approx 95% complete for SRR3105697_chr18.fastq.gz
Analysis complete for SRR3105697_chr18.fastq.gz
[Thu Oct  3 17:00:44 2024]
Finished job 6.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:45 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 4
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.zip
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz
application/gzip
Started analysis of SRR3105698_chr18.fastq.gz
Approx 5% complete for SRR3105698_chr18.fastq.gz
Approx 10% complete for SRR3105698_chr18.fastq.gz
Approx 15% complete for SRR3105698_chr18.fastq.gz
Approx 20% complete for SRR3105698_chr18.fastq.gz
Approx 25% complete for SRR3105698_chr18.fastq.gz
Approx 30% complete for SRR3105698_chr18.fastq.gz
Approx 35% complete for SRR3105698_chr18.fastq.gz
Approx 40% complete for SRR3105698_chr18.fastq.gz
Approx 45% complete for SRR3105698_chr18.fastq.gz
Approx 50% complete for SRR3105698_chr18.fastq.gz
Approx 55% complete for SRR3105698_chr18.fastq.gz
Approx 60% complete for SRR3105698_chr18.fastq.gz
Approx 65% complete for SRR3105698_chr18.fastq.gz
Approx 70% complete for SRR3105698_chr18.fastq.gz
Approx 75% complete for SRR3105698_chr18.fastq.gz
Approx 80% complete for SRR3105698_chr18.fastq.gz
Approx 85% complete for SRR3105698_chr18.fastq.gz
Approx 90% complete for SRR3105698_chr18.fastq.gz
Approx 95% complete for SRR3105698_chr18.fastq.gz
Analysis complete for SRR3105698_chr18.fastq.gz
[Thu Oct  3 17:00:55 2024]
Finished job 4.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:00:55 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz
application/gzip
Started analysis of SRR3105699_chr18.fastq.gz
Approx 5% complete for SRR3105699_chr18.fastq.gz
Approx 10% complete for SRR3105699_chr18.fastq.gz
Approx 15% complete for SRR3105699_chr18.fastq.gz
Approx 20% complete for SRR3105699_chr18.fastq.gz
Approx 25% complete for SRR3105699_chr18.fastq.gz
Approx 30% complete for SRR3105699_chr18.fastq.gz
Approx 35% complete for SRR3105699_chr18.fastq.gz
Approx 40% complete for SRR3105699_chr18.fastq.gz
Approx 45% complete for SRR3105699_chr18.fastq.gz
Approx 50% complete for SRR3105699_chr18.fastq.gz
Approx 55% complete for SRR3105699_chr18.fastq.gz
Approx 60% complete for SRR3105699_chr18.fastq.gz
Approx 65% complete for SRR3105699_chr18.fastq.gz
Approx 70% complete for SRR3105699_chr18.fastq.gz
Approx 75% complete for SRR3105699_chr18.fastq.gz
Approx 80% complete for SRR3105699_chr18.fastq.gz
Approx 85% complete for SRR3105699_chr18.fastq.gz
Approx 90% complete for SRR3105699_chr18.fastq.gz
Approx 95% complete for SRR3105699_chr18.fastq.gz
Analysis complete for SRR3105699_chr18.fastq.gz
[Thu Oct  3 17:01:05 2024]
Finished job 2.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:01:05 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip; Set of input files has changed since last execution
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip

  /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.25.1 now available!
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105699_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105698_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3099587_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/cquignot/day2-session1/FastQC/SRR3105697_chr18_fastqc.zip
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 6/6  
|            fastqc | Found 6 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
[Thu Oct  3 17:01:09 2024]
Finished job 7.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:01:09 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3105699_chr18_fastqc.html, multiqc_report.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html
    resources: tmpdir=/tmp

[Thu Oct  3 17:01:09 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-10-03T170034.983240.snakemake.log

As you can see, Snakemake found 3 extra files in you Data/ directory to run FastQC on and regenerated the MultiQC report to include them.

Objective 2 - config file

Adapt your pipeline (ex1b_o2.smk) to use a configuration file called myConfig.yml that specifies the path of the input data directory, and replace the hard-coding values (Data/) in the snakefile.

Where to start?

Why use a configuration file? In this file, you will place all hard-coding values of the snakefile (paths to files, core numbers, parameter values, etc). In this way, you have a generalised analysis pipeline that you can easily apply to new input data by simply changing the configfile rather than the snakefile directly.

Test the script

Next, let’s check if your pipeline still works as it should using the following command:

snakemake -s ex1b_o2.smk -c 1 -p --configfile ex1.yml

Observe the output

As you can see, Snakemake doesn’t see anything that changed that might influence the outputs that were already generated and tells you that there’s “Nothing to be done”.

To go further - more parameters

The configfile can be used to specify as many parameters as you want, for example, the name of the output directory in the fastqc rule. In order to add parameters to the shell command line, you can use the params directive:

How values of directives are used in “shell”

Let’s take a moment to compare the various “wildcards” that we’ve used within the shell calls and the way we specify the files in our directives. For example, in the fastqc rule:

As you can see, you are free to add as many files to your directives as you wish: input has only 1, output has 2 and params has only 1 (and we’ve named it “outdir”). Note that if you provide more than one file to a directive, these should be separated by a comma (see output directive above for an example). If you are afraid of forgetting it, you can systematically add a comma at the end of your directive lines.

To use the files specified in your directives within the shell directive, you can use the wildcard syntax ({variable}). In the case of our input directive, as we only have 1 file, we used the directive name directly ({input}). However, as soon as you have several files, you can use Python’s list syntax. For example, if we had to explicitly mention our output files in the shell command, we would have specified them with {output[0]} and {output[1]}. We can also give names to each element if there are several to avoid using indexing (i.e. [0], [1] etc.). For example, in the params directive, we chose to name the file, so we use the syntax {directive.name} instead.

Transfer your Python skills - bonus for Python addicts only 😉

And finally, it’s important to note that the language Snakemake uses is Python, so don’t hesitate to use Python’s modules within the snakefile if you’re familiar with the language. For example, you could use the os.path module’s join() function instead of concatenating your paths with the + operator. You could also use Python’s string formats or triple quote strings too.

Objective 3 - logs

Create a new snakefile named ex1b_o3.smk in which we redirect the standard output and error streams to log files.

Where to start?

Test the script

snakemake -s ex1b_o3.smk -c 1 -p --configfile ex1.yml

Observe the output

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         6
multiqc        1
total          8

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:05:46 2024]
localrule fastqc:
    input: Data/SRR3105697_chr18.fastq.gz
    output: FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.html
    log: Logs/SRR3105697_chr18_fastqc.std, Logs/SRR3105697_chr18_fastqc.err
    jobid: 6
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105697_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105697_chr18.fastq.gz 1>Logs/SRR3105697_chr18_fastqc.std 2>Logs/SRR3105697_chr18_fastqc.err
[Thu Oct  3 17:05:57 2024]
Finished job 6.
1 of 8 steps (12%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:05:57 2024]
localrule fastqc:
    input: Data/SRR3105699_chr18.fastq.gz
    output: FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.html
    log: Logs/SRR3105699_chr18_fastqc.std, Logs/SRR3105699_chr18_fastqc.err
    jobid: 2
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105699_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105699_chr18.fastq.gz 1>Logs/SRR3105699_chr18_fastqc.std 2>Logs/SRR3105699_chr18_fastqc.err
[Thu Oct  3 17:06:07 2024]
Finished job 2.
2 of 8 steps (25%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:07 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    log: Logs/SRR3099585_chr18_fastqc.std, Logs/SRR3099585_chr18_fastqc.err
    jobid: 1
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz 1>Logs/SRR3099585_chr18_fastqc.std 2>Logs/SRR3099585_chr18_fastqc.err
[Thu Oct  3 17:06:17 2024]
Finished job 1.
3 of 8 steps (38%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:17 2024]
localrule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    log: Logs/SRR3099587_chr18_fastqc.std, Logs/SRR3099587_chr18_fastqc.err
    jobid: 5
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz 1>Logs/SRR3099587_chr18_fastqc.std 2>Logs/SRR3099587_chr18_fastqc.err
[Thu Oct  3 17:06:27 2024]
Finished job 5.
4 of 8 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:27 2024]
localrule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    log: Logs/SRR3099586_chr18_fastqc.std, Logs/SRR3099586_chr18_fastqc.err
    jobid: 3
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz 1>Logs/SRR3099586_chr18_fastqc.std 2>Logs/SRR3099586_chr18_fastqc.err
[Thu Oct  3 17:06:37 2024]
Finished job 3.
5 of 8 steps (62%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:37 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 4
    reason: Code has changed since last execution; Params have changed since last execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err
[Thu Oct  3 17:06:48 2024]
Finished job 4.
6 of 8 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:48 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    log: Logs/multiqc.std, Logs/multiqc.err
    jobid: 7
    reason: Input files updated by another job: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105699_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3105697_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.zip
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3105699_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3105698_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip FastQC/SRR3105697_chr18_fastqc.zip 1>Logs/multiqc.std 2>Logs/multiqc.err
[Thu Oct  3 17:06:53 2024]
Finished job 7.
7 of 8 steps (88%) done
Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 17:06:53 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, multiqc_report.html
    jobid: 0
    reason: Input files updated by another job: multiqc_report.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3105698_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3105697_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3105699_chr18_fastqc.html
    resources: tmpdir=/tmp

[Thu Oct  3 17:06:53 2024]
Finished job 0.
8 of 8 steps (100%) done
Complete log: .snakemake/log/2024-10-03T170546.019954.snakemake.log

As you can see, the fastqc steps don’t generate as much text as before. Also, if you have a look at your working directory, you should see a Logs folder in there now, containing all the individual logs of your input files and rules:

john.doe@cluster-i2bc:/data/work/I2BC/john.doe/snakemake_tutorial$ ls Logs/
multiqc.err                  SRR3099586_chr18_fastqc.err  SRR3105697_chr18_fastqc.err  SRR3105699_chr18_fastqc.err
multiqc.std                  SRR3099586_chr18_fastqc.std  SRR3105697_chr18_fastqc.std  SRR3105699_chr18_fastqc.std
SRR3099585_chr18_fastqc.err  SRR3099587_chr18_fastqc.err  SRR3105698_chr18_fastqc.err
SRR3099585_chr18_fastqc.std  SRR3099587_chr18_fastqc.std  SRR3105698_chr18_fastqc.std

Objective 4 - rerun

Motivation

Have you noticed how Snakemake sometimes decides to re-run everything (although output files already exist) and sometimes not?

Snakemake always justifies its choices in the output log. Sometimes its because the files are missing, other times it might because the code has changed, etc.. For example:

[Tue Feb 20 15:35:31 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    log: Logs/SRR3105698_chr18_fastqc.std, Logs/SRR3105698_chr18_fastqc.err
    jobid: 2
    reason: Code has changed since last execution
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz 1>Logs/SRR3105698_chr18_fastqc.std 2>Logs/SRR3105698_chr18_fastqc.err

[Tue Feb 20 15:06:06 2024]
localrule fastqc:
    input: Data/SRR3105698_chr18.fastq.gz
    output: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    jobid: 2
    reason: Missing output files: FastQC/SRR3105698_chr18_fastqc.zip, FastQC/SRR3105698_chr18_fastqc.html
    wildcards: sample=SRR3105698_chr18
    resources: tmpdir=/var/tmp/pbs.743371.pbsserver

fastqc --outdir FastQC Data/SRR3105698_chr18.fastq.gz

This behaviour is because Snakemake takes into account all provenance information to define which jobs to rerun and not solely the fact that outputs exist or not. Thus, parameter, code and software environment changes, as well as changes in the set of input files of a job will make Snakemake rerun all the steps, even if the output files already exist.

Of note: in previous versions of Snakemake (before v.7.8.0), rerunning jobs relied purely on file modification times so Snakemake would only re-execute the pipeline if the timestamp of output files were older than the ones of input files.

How to control re-running criteria?

There is a command line option that you can use in Snakemake in order to specify your re-running criteria: --rerun-triggers. By default, all triggers are used (code,input,mtime,params,software-env), which guarantees that results are consistent with the workflow code and configuration. To revert to Snakemake’s behaviour before v.7.8.0, you can use --rerun-triggers mtime. This option will tell Snakemake to only use modification time when determining whether a job should be executed or not. For example:

snakemake -s ex1b_o3.smk -c 1 -p --rerun-triggers mtime

Force re-run

If you rerun your snakemake command line now, without changing anything to the code (with or without the --rerun-trigger mtime option), you should see a message from Snakemake telling you that nothing needs doing:

However, if you’d like to force Snakemake to re-run the whole pipeline, you can do one of the following:

– delete all output folders and results before re-running the Snakemake command

rm -rf FastQC multiqc*
snakemake -s ex1b_o3.smk -c 1 -p --configfile ex1.yml

– use Snakemake’s --forcerun (-R) or --forceall (-F) options when you run the Snakemake command. --forcerun reruns a specific rule or input which you will have to specify in the command line. --forceall forces everything to be re-run. For example:

# specify a rule
snakemake -s ex1b_o3.smk -c 1 -p -R fastqc --configfile ex1.yml
# specify an input/output file
snakemake -s ex1b_o3.smk -c 1 -p -R FastQC/SRR3099585_chr18_fastqc.zip --configfile ex1.yml
# rerun everything
snakemake -s ex1b_o3.smk -c 1 -p -F --configfile ex1.yml

Recap

Some more useful options when running Snakemake

– print the reason for each rule execution: -r --reason (for Snakemake version <8 only)

– automatically create HTML reports (--report report.html) containing runtime statistics, a visualisation of the workflow topology, used software and data provenance information (need to add the jinja2 package as a dependency)