Introduction to Snakemake

Exercise A - creating a first basic pipeline from scratch

Introduction

For this practical exercise, we will:

access snakemake (setup)
create a first snakefile with a single rule (o1)
upscale your pipeline with a second input (o2)
add a second rule to create a first workflow (o3)
discover & understand the use of a target rule (o4)
understand how rules are linked (o4)
learn how to generalise inputs and outputs of rules (#wildcards) (o5 & o6)
get accustomed to using wildcards within a snakemake rule (o5 & o6)
learn how to visualise your pipeline (recap)
learn how to simulate the execution with dry-run (recap)

Our input: bulk RNA-seq data in fastq format

Our objective: to evaluate the quality of our data

Our tools: FastQC and MultiQC are tools commonly used to analyse the quality of sequencing data, we would like to run these within a Snakemake pipeline

Our final objective is to create a snakefile to manage this small workflow:

How this exercise is organised:

We will be building the pipeline progressively together. Each step will reply to an objective. Thus, we will be doing several cycles of executing snakemake, observing the results and improving the code. Each code version will be noted ex1_oX.smk, with X a progressive digit corresponding to the objective number.

Setup

Connect to IFB cluster

Through ssh:

ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr

For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/logging-in/

Or through OpenOnDemand:

For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/

Prepare your working environment

Once on the cluster, connect to a node :

Hints:

don’t forget to add the 2417_wf4bioinfo project with --account

login@clust-slurm-client:~$ srun --account 2417_wf4bioinfo --pty bash
srun: job 41998304 queued and waiting for resources
srun: job 41998304 has been allocated resources
login@cpu-node7

Then load the appropriate tools using the module command:

module load snakemake/8.9.0
module load fastqc/0.12.1
module load multiqc/1.13
module load graphviz/2.40.1

Why these 4 tools? Because we will be using all four in our pipeline.

Double-check that you’ve loaded the modules correctly, for example, by checking their version:

login@cpu-node7:~$ snakemake --version
8.9.0

And create & move to your chosen working space, for example:

cd /shared/projects/2417_wf4bioinfo/
mkdir -p $USER/day2-session1
cd $USER/day2-session1

Fetch the example files

Example files are accessible on Zenodo as a tar.gz archive. To fetch and extract the files, you can use the following command lines:

# download archive
wget "https://zenodo.org/record/3997237/files/FAIR_Bioinfo_data.tar.gz"
# extract files
tar -zxf FAIR_Bioinfo_data.tar.gz
# delete the archive
rm FAIR_Bioinfo_data.tar.gz

You should now see in your working space, a folder called Data containing compressed fastq files (*.fastq.gz) and O. tauri genome files (*.gff and *.fna).

Output of tree command:

└── Data/
    ├── O.tauri_annotation.gff 
    ├── O.tauri_genome.fna
    ├── SRR3099585_chr18.fastq.gz
    ├── SRR3099586_chr18.fastq.gz
    ├── SRR3099587_chr18.fastq.gz
    ├── SRR3105697_chr18.fastq.gz
    ├── SRR3105698_chr18.fastq.gz
    └── SRR3105699_chr18.fastq.gz
1 directories, 8 files

Objective 1 - 1 rule 1 input

Create a snakemake file named ex1_o1.smk including the first step of the RNAseq workflow (aka. execution of the QC analysis using the FastQC tool) on one of the RNA-seq files.

Where to start?

This training session is not about learning how to analyse RNA-seq data, so here are a few hints about FastQC usage:

Input file: SRR3099585_chr18.fastq.gz in the ${PWD}/Data directory
FastQC command: fastqc --outdir FastQCResultDirectory inputFileName
Expected output: the 2 result files (*_fastqc.zip and *_fastqc.html) will be located in your outdir and named using the base name of your input file (eg. SRR3099585_chr18_fastqc.zip)

Click to see an example solution

Your Code for ex1_o1.smk should look like this:

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz" 
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html"
  shell: 
    "fastqc --outdir FastQC/ {input}"

Test the script

Next, let’s check if your pipeline actually works using the following command to run your snakefile:

snakemake --snakefile ex1_o1.smk --cores 1

Observe the output

You should see something similar to the following output on your screen:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job       count
------  -------
fastqc        1
total         1

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 10:46:30 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    jobid: 0
    reason: Missing output files: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    resources: tmpdir=/tmp

application/gzip
Started analysis of SRR3099585_chr18.fastq.gz
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Thu Oct  3 10:46:49 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-10-03T104629.964160.snakemake.log

Have a look at the log that’s printed on your screen. The “Job stats” table in Snakemake’s output log will show you a summary of exactly what rules will be run and how many times (i.e. on how many input files). In this case, the fastqc rule will be run once.

Job stats:
job       count
------  -------
fastqc        1
total         1

Now, let’s have a look at your working directory:

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls -a 
.  ..  Data ex1_o1.smk FastQC .snakemake

Snakemake saw that the output directory FastQC was missing and created it before running FastQC (and that’s a good thing since FastQC doesn’t do this naturally and would have created an error otherwise!).

Also, notice the existence of a hidden folder called .snakemake. This is were Snakemake stores temporary files relative to the jobs that were executed. By default, this folder is created in the current working directory (i.e. the one from which the Snakemake command was executed).

Objective 2 - 1 rule 2 inputs

Let’s scale up a little bit! Create a new snakefile named ex1_o2.smk which will run FastQC on two RNA-seq input files.

Where to start?

Input files: SRR3099585_chr18.fastq.gz and SRR3099586_chr18.fastq.gz in the ${PWD}/Data directory
Expected output: we now expect result files for both of these inputs, so don’t forget to add the respective *_fastqc.zip and *_fastqc.html files for the second input

Click to see an example solution

Your Code for ex1_o2.smk should look like this:

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz",
    "Data/SRR3099586_chr18.fastq.gz"
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html" 
  shell: 
    "fastqc --outdir FastQC/ {input}"

Test the script

Next, let’s check again if your pipeline still works:

snakemake -s ex1_o2.smk --cores 1 -p
# -s: short form of the --snakefile option
# -p: prints the commands

Observe the output

You should see something similar to the following output on your screen:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job       count
------  -------
fastqc        1
total         1

Select jobs to execute...
Execute 1 jobs...

[Thu Oct  3 10:50:07 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz, Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    jobid: 0
    reason: Missing output files: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    resources: tmpdir=/tmp

fastqc --outdir FastQC/ Data/SRR3099585_chr18.fastq.gz Data/SRR3099586_chr18.fastq.gz
application/gzip
application/gzip
Started analysis of SRR3099585_chr18.fastq.gz
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
Started analysis of SRR3099586_chr18.fastq.gz
Approx 5% complete for SRR3099586_chr18.fastq.gz
Approx 10% complete for SRR3099586_chr18.fastq.gz
Approx 15% complete for SRR3099586_chr18.fastq.gz
Approx 20% complete for SRR3099586_chr18.fastq.gz
Approx 25% complete for SRR3099586_chr18.fastq.gz
Approx 30% complete for SRR3099586_chr18.fastq.gz
Approx 35% complete for SRR3099586_chr18.fastq.gz
Approx 40% complete for SRR3099586_chr18.fastq.gz
Approx 45% complete for SRR3099586_chr18.fastq.gz
Approx 50% complete for SRR3099586_chr18.fastq.gz
Approx 55% complete for SRR3099586_chr18.fastq.gz
Approx 60% complete for SRR3099586_chr18.fastq.gz
Approx 65% complete for SRR3099586_chr18.fastq.gz
Approx 70% complete for SRR3099586_chr18.fastq.gz
Approx 75% complete for SRR3099586_chr18.fastq.gz
Approx 80% complete for SRR3099586_chr18.fastq.gz
Approx 85% complete for SRR3099586_chr18.fastq.gz
Approx 90% complete for SRR3099586_chr18.fastq.gz
Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
[Thu Oct  3 10:50:25 2024]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2024-10-03T105007.097748.snakemake.log

As you can see in the highlighted line above, Snakemake detects that FastQC wasn’t yet run on your second input (“Missing output files: FastQC/SRR3099586_ch18_fastqc.html, FastQC/SRR3099586_ch18_fastqc.zip”, line 20) and then re-executes the fastqc command on this input.

Have a look at your output folder, you should now have 4 files in there:

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session$ ls FastQC
SRR3099585_chr18_fastqc.html  SRR3099585_chr18_fastqc.zip  SRR3099586_chr18_fastqc.html  SRR3099586_chr18_fastqc.zip

Objective 3 - 2 rules 2 inputs

We’ve seen how to add inputs, now let’s add a new rule! Create a new snakefile named ex1_o3.smk in which we add a new rule which will run MultiQC on a list of all output files of FastQC.

Where to start?

MultiQC is a tool used to aggregate the results of multiple other tools into a single html file.

Input files: FastQC’s zip file outputs.
MultiQC command: multiqc *fastqc.zip
Expected output: 2 files: multiqc_report.html and a multiqc_data repository
NB: when the output is a directory, you have to specify this using the directory() function. In this case, you would have to put: directory("multiqc_data")

Click to see an example solution

Your code for ex1_o3.smk should look like this:

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz",
    "Data/SRR3099586_chr18.fastq.gz",
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html", 
  shell: "fastqc --outdir FastQC {input}"

rule multiqc: 
  input: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.zip", 
  output:
    "multiqc_report.html", 
    directory("multiqc_data") 
  shell: 
    "multiqc {input}"

Test the script

Next, let’s check again if your pipeline works:

snakemake -s ex1_o3.smk --cores 1 -p

Observe the output

Depending on the Snakemake version, you might get a short message saying “nothing to be done”, or it might re-run fastqc because “reason: Code has changed since last execution”.

Wait, what??! What about my new multiqc rule?

Expected behaviour:

Job stats:
job       count
------  -------
fastqc        1
multiqc       1
total         1

Current behaviour:

Job stats:
job       count
------  -------
fastqc        1
total         1

By default, Snakemake will only execute the first rule it encounters in your Snakefile, it’s called the target rule. If (and only if) the necessary input files to execute this rule are missing, will it scan the other rules in your Snakefile to generate them. In our case, the fastqc rule is the target rule as it’s written first. Since all the necessary input files are already available for the fastqc rule, Snakemake doesn’t execute any of the other rules in the file. Let’s see in the next objective how to fix this.

Objective 4 - target rule

Create a new snakefile named ex1_o4.smk in which we add a new & dedicated target rule at the beginning of the Snakefile.

Motivation - how can I run my multiqc rule?

Your first thought might be to change the order of your rules in the Snakefile to make sure multiqc is first. There’s actually a way of telling Snakemake at execution what rule you want to be the target rule (which avoids editing the file), by just adding its name. For example, the following command specifies multiqc as the target:

snakemake -s ex1_o3.smk --cores 1 -p multiqc

This solution works but it’s limited to sequential pipelines (i.e. the output of one is the direct input of the next). For less linear pipelines (e.g. output of rule1 is input for rules 2 and 3), this solution won’t work as only one rule can be the target rule.

As we now know, if the input files to the target rule don’t exist, Snakemake will first execute any of the other rules in the Snakefile that will help it generate these missing files. If we use this knowledge to our advantage, we can add a dedicated target rule at the top of our Snakefile which requires the outputs of key rules in our Snakefile. This will enable us to execute them all via a single line of code. We will be doing this below.

Where to start?

add a new rule: define a new rule (above the other rules in your Snakefile) containing only an input directive.
fill in the input directive: list all files that you expect as output in your pipeline (in this case, you could list all the outputs of fastqc and multiqc, or just the outputs of multiqc since our pipeline is completely linear).

more about target rules

Click to see an example solution

Your code for ex1_o4.smk should look like this:

rule all:
  input: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html", 
    "multiqc_report.html", 
    "multiqc_data", 

rule fastqc:
  input: 
    "Data/SRR3099585_chr18.fastq.gz",
    "Data/SRR3099586_chr18.fastq.gz",
  output: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099585_chr18_fastqc.html",
    "FastQC/SRR3099586_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.html", 
  shell: "fastqc --outdir FastQC {input}"

rule multiqc: 
  input: 
    "FastQC/SRR3099585_chr18_fastqc.zip", 
    "FastQC/SRR3099586_chr18_fastqc.zip", 
  output:
    "multiqc_report.html", 
    directory("multiqc_data") 
  shell: 
    "multiqc {input}"

Note that, technically, you could remove all FastQC html and zip files from the new “all” target rule and only keep the multiqc outputs since our pipeline is completely linear in this example (i.e. Snakemake is capable of retracing the sequence of rules it needs to run in order to get to the final MultiQC output).

Test the script

Next, let’s run the pipeline and see if it works:

snakemake -s ex1_o4.smk --cores 1 -p

Observe the output

You should see something similar to the following output on your screen:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
multiqc        1
total          2

Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 14:47:28 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 2
    reason: Missing output files: multiqc_report.html, multiqc_data
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip

 /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.25.1 now available!
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2  
Matplotlib is building the font cache; this may take a moment.
|            fastqc | Found 2 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
[Wed Oct  2 14:47:48 2024]
Finished job 2.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 14:47:48 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html, multiqc_report.html, multiqc_data
    jobid: 0
    reason: Input files updated by another job: multiqc_report.html, multiqc_data
    resources: tmpdir=/tmp

[Wed Oct  2 14:47:48 2024]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake/log/2024-10-02T144728.099838.snakemake.log

This time, multiqc is executed and you should be able to see its outputs in your working directory:

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session1$ ls
Data  ex1_o1.smk  ex1_o2.smk ex1_o3.smk ex1_o4.smk FastQC multiqc_data  multiqc_report.html

Objective 5 - expand

Create a new snakefile named ex1_o5.smk in which we add an extra input to the fastqc rule.

Where to start?

Tired of explicitly writing all input and output file names?

=> Use Snakemake’s expand() function to manage all your input RNA-seq files at once.

So, how should we go about it?

Create a list: create a Python list at the beginning of the snakefile containing all the base names of your input files (don’t include the .fastq.gz suffix)
NB: in Python, a list of strings can be defined like this:
list_name = ["item1","item2",...,"itemN"]
Integrate expand(): replace the list of file names in the input and output directives of your rules by the expand() function.

For example, the input directive of your fastqc rule could look like this:

expand("Data/{sample}",sample=list_name)

Explanation: This function expects a string as input defining the file paths, in which you replace the variable parts by “placeholders” called wildcards. In this case, there is only one, that we named sample and that should be written between braces: {sample}.

As additional inputs to this function, we also have to specify the elements we would like to replace each wildcard with. In this case, we have to give the function our list of base names (sample=list_name).

more about expand()

Click to see an example solution

Your code for ex1_o5.smk should look like this (we added the SRR3099587_chr18 sample to the previous 2):

SAMPLES=["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),    
    expand("FastQC/{sample}_fastqc.zip", sample=SAMPLES), 
    "multiqc_report.html",
    "multiqc_data",

rule fastqc:
  input:
    expand("Data/{sample}.fastq.gz", sample=SAMPLES)
  output:
    expand("FastQC/{sample}_fastqc.zip", sample=SAMPLES),
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES)
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Test the script

Next, let’s run your pipeline again:

snakemake -s ex1_o5.smk --cores 1 -p

Observe the output

You should see something similar to the following output on your screen:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         1
multiqc        1
total          3

Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:16:37 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz, Data/SRR3099586_chr18.fastq.gz, Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html
    jobid: 1
    reason: Missing output files: FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.zip; Set of input files has changed since last execution; Code has changed since last execution
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz Data/SRR3099586_chr18.fastq.gz Data/SRR3099587_chr18.fastq.gz
application/gzip
application/gzip
Started analysis of SRR3099585_chr18.fastq.gz
application/gzip
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
Started analysis of SRR3099586_chr18.fastq.gz
Approx 5% complete for SRR3099586_chr18.fastq.gz
Approx 10% complete for SRR3099586_chr18.fastq.gz
Approx 15% complete for SRR3099586_chr18.fastq.gz
Approx 20% complete for SRR3099586_chr18.fastq.gz
Approx 25% complete for SRR3099586_chr18.fastq.gz
Approx 30% complete for SRR3099586_chr18.fastq.gz
Approx 35% complete for SRR3099586_chr18.fastq.gz
Approx 40% complete for SRR3099586_chr18.fastq.gz
Approx 45% complete for SRR3099586_chr18.fastq.gz
Approx 50% complete for SRR3099586_chr18.fastq.gz
Approx 55% complete for SRR3099586_chr18.fastq.gz
Approx 60% complete for SRR3099586_chr18.fastq.gz
Approx 65% complete for SRR3099586_chr18.fastq.gz
Approx 70% complete for SRR3099586_chr18.fastq.gz
Approx 75% complete for SRR3099586_chr18.fastq.gz
Approx 80% complete for SRR3099586_chr18.fastq.gz
Approx 85% complete for SRR3099586_chr18.fastq.gz
Approx 90% complete for SRR3099586_chr18.fastq.gz
Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
Started analysis of SRR3099587_chr18.fastq.gz
Approx 5% complete for SRR3099587_chr18.fastq.gz
Approx 10% complete for SRR3099587_chr18.fastq.gz
Approx 15% complete for SRR3099587_chr18.fastq.gz
Approx 20% complete for SRR3099587_chr18.fastq.gz
Approx 25% complete for SRR3099587_chr18.fastq.gz
Approx 30% complete for SRR3099587_chr18.fastq.gz
Approx 35% complete for SRR3099587_chr18.fastq.gz
Approx 40% complete for SRR3099587_chr18.fastq.gz
Approx 45% complete for SRR3099587_chr18.fastq.gz
Approx 50% complete for SRR3099587_chr18.fastq.gz
Approx 55% complete for SRR3099587_chr18.fastq.gz
Approx 60% complete for SRR3099587_chr18.fastq.gz
Approx 65% complete for SRR3099587_chr18.fastq.gz
Approx 70% complete for SRR3099587_chr18.fastq.gz
Approx 75% complete for SRR3099587_chr18.fastq.gz
Approx 80% complete for SRR3099587_chr18.fastq.gz
Approx 85% complete for SRR3099587_chr18.fastq.gz
Approx 90% complete for SRR3099587_chr18.fastq.gz
Approx 95% complete for SRR3099587_chr18.fastq.gz
Analysis complete for SRR3099587_chr18.fastq.gz
[Wed Oct  2 15:17:25 2024]
Finished job 1.
1 of 3 steps (33%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:17:25 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 2
    reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip

  /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.25.1 now available!
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099587_chr18_fastqc.zip
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 3/3  
|            fastqc | Found 3 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
[Wed Oct  2 15:17:30 2024]
Finished job 2.
2 of 3 steps (67%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:17:30 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, multiqc_report.html, multiqc_data
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.zip, multiqc_report.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html, multiqc_data, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.zip
    resources: tmpdir=/tmp

[Wed Oct  2 15:17:30 2024]
Finished job 0.
3 of 3 steps (100%) done
Complete log: .snakemake/log/2024-10-02T151637.309287.snakemake.log

Snakemake detects that the output files of SRR3099587_chr18 are missing (“reason: Missing output files: FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.zip”) and also that “Set of input files has changed since last execution” so it re-ran the fastqc rule.

Have a look at your output folder, you should now have 6 files in there:

login@node06:/shared/projects/2417_wf4bioinfo/login/day2-session1$ ls FastQC
SRR3099585_chr18_fastqc.html  SRR3099586_chr18_fastqc.html  SRR3099587_chr18_fastqc.html  
SRR3099585_chr18_fastqc.zip   SRR3099586_chr18_fastqc.zip   SRR3099587_chr18_fastqc.zip

Have you noticed that, up until now, Snakemake re-executed FastQC on all inputs every time the fastqc rule was run, indifferent of the fact that the output files were already generated or not?

No? Have a closer look at the output log that you just got… In fact, if you look closely, fastqc is always run only once but on a variable number of inputs (depending on the input of our rule):

Job stats:
job        count
-------  -------
all            1
fastqc         1
multiqc        1
total          3
...

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz Data/SRR3099586_chr18.fastq.gz Data/SRR3099587_chr18.fastq.gz

Snakemake re-runs fastqc on all inputs because we gave as input to the fastqc rule a list of files rather than just a single file. However, this is sub-optimal, especially on a computer cluster on which you have access to a large amount of available resources to run your jobs on. We’ll see how to change this in the next objective.

Objective 6 - individual vs cumulative rules

Create a new snakefile named ex1_o6.smk in which fastqc runs on each input individually.

Where to start?

Our culprit: the expand() function in the fastqc rule provides a list of files which are then interpreted as a combined input rather than individual files.

change the fastqc rule: remove the expand() in the input and output of the fastqc rule but keep the wildcard-containing strings

Click to see an example solution

Your code for ex1_o6.smk should look like this:

SAMPLES=["SRR3099585_chr18","SRR3099586_chr18","SRR3099587_chr18"]

rule all:
  input:
    expand("FastQC/{sample}_fastqc.html", sample=SAMPLES),
    expand("FastQC/{sample}_fastqc.zip", sample=SAMPLES),
    "multiqc_report.html",
    "multiqc_data",

rule fastqc: 
  input:
    "Data/{sample}.fastq.gz" 
  output:
    "FastQC/{sample}_fastqc.zip",  
    "FastQC/{sample}_fastqc.html" 
  shell: "fastqc --outdir FastQC {input}"

rule multiqc:
  input:
    expand("FastQC/{sample}_fastqc.zip", sample = SAMPLES)
  output:
    "multiqc_report.html",
    directory("multiqc_data")
  shell: "multiqc {input}"

Explanation: we used wildcards to “generalise” the input and output of the fastqc rule. You can see Data/{sample}.fastq.gz, FastQC/{sample}_fastqc.zip and FastQC/{sample}_fastqc.html as “templates” (with “{sample}” being the only variable part) for input and output file names for the fastqc rule.

Of note, {sample} doesn’t have to match the wildcard name given in the previous expand functions. In theory, we could have used any other wildcard name as long as input and output directives of a same rule match (e.g. {mysample} instead of {sample}).

Test the script

Next, let’s check again if your pipeline works:

snakemake -s ex1_o6.smk --cores 1 -p

Observe the output

You should see something similar to the following output on your screen.

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:36:11 2024]
localrule fastqc:
    input: Data/SRR3099587_chr18.fastq.gz
    output: FastQC/SRR3099587_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.html
    jobid: 3
    reason: Set of input files has changed since last execution
    wildcards: sample=SRR3099587_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099587_chr18.fastq.gz
application/gzip
Started analysis of SRR3099587_chr18.fastq.gz
Approx 5% complete for SRR3099587_chr18.fastq.gz
Approx 10% complete for SRR3099587_chr18.fastq.gz
Approx 15% complete for SRR3099587_chr18.fastq.gz
Approx 20% complete for SRR3099587_chr18.fastq.gz
Approx 25% complete for SRR3099587_chr18.fastq.gz
Approx 30% complete for SRR3099587_chr18.fastq.gz
Approx 35% complete for SRR3099587_chr18.fastq.gz
Approx 40% complete for SRR3099587_chr18.fastq.gz
Approx 45% complete for SRR3099587_chr18.fastq.gz
Approx 50% complete for SRR3099587_chr18.fastq.gz
Approx 55% complete for SRR3099587_chr18.fastq.gz
Approx 60% complete for SRR3099587_chr18.fastq.gz
Approx 65% complete for SRR3099587_chr18.fastq.gz
Approx 70% complete for SRR3099587_chr18.fastq.gz
Approx 75% complete for SRR3099587_chr18.fastq.gz
Approx 80% complete for SRR3099587_chr18.fastq.gz
Approx 85% complete for SRR3099587_chr18.fastq.gz
Approx 90% complete for SRR3099587_chr18.fastq.gz
Approx 95% complete for SRR3099587_chr18.fastq.gz
Analysis complete for SRR3099587_chr18.fastq.gz
[Wed Oct  2 15:36:25 2024]
Finished job 3.
1 of 5 steps (20%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:36:25 2024]
localrule fastqc:
    input: Data/SRR3099585_chr18.fastq.gz
    output: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099585_chr18_fastqc.html
    jobid: 1
    reason: Set of input files has changed since last execution
    wildcards: sample=SRR3099585_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099585_chr18.fastq.gz
application/gzip
Started analysis of SRR3099585_chr18.fastq.gz
Approx 5% complete for SRR3099585_chr18.fastq.gz
Approx 10% complete for SRR3099585_chr18.fastq.gz
Approx 15% complete for SRR3099585_chr18.fastq.gz
Approx 20% complete for SRR3099585_chr18.fastq.gz
Approx 25% complete for SRR3099585_chr18.fastq.gz
Approx 30% complete for SRR3099585_chr18.fastq.gz
Approx 35% complete for SRR3099585_chr18.fastq.gz
Approx 40% complete for SRR3099585_chr18.fastq.gz
Approx 45% complete for SRR3099585_chr18.fastq.gz
Approx 50% complete for SRR3099585_chr18.fastq.gz
Approx 55% complete for SRR3099585_chr18.fastq.gz
Approx 60% complete for SRR3099585_chr18.fastq.gz
Approx 65% complete for SRR3099585_chr18.fastq.gz
Approx 70% complete for SRR3099585_chr18.fastq.gz
Approx 75% complete for SRR3099585_chr18.fastq.gz
Approx 80% complete for SRR3099585_chr18.fastq.gz
Approx 85% complete for SRR3099585_chr18.fastq.gz
Approx 90% complete for SRR3099585_chr18.fastq.gz
Approx 95% complete for SRR3099585_chr18.fastq.gz
Analysis complete for SRR3099585_chr18.fastq.gz
[Wed Oct  2 15:36:39 2024]
Finished job 1.
2 of 5 steps (40%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:36:39 2024]
localrule fastqc:
    input: Data/SRR3099586_chr18.fastq.gz
    output: FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.html
    jobid: 2
    reason: Set of input files has changed since last execution
    wildcards: sample=SRR3099586_chr18
    resources: tmpdir=/tmp

fastqc --outdir FastQC Data/SRR3099586_chr18.fastq.gz
application/gzip
Started analysis of SRR3099586_chr18.fastq.gz
Approx 5% complete for SRR3099586_chr18.fastq.gz
Approx 10% complete for SRR3099586_chr18.fastq.gz
Approx 15% complete for SRR3099586_chr18.fastq.gz
Approx 20% complete for SRR3099586_chr18.fastq.gz
Approx 25% complete for SRR3099586_chr18.fastq.gz
Approx 30% complete for SRR3099586_chr18.fastq.gz
Approx 35% complete for SRR3099586_chr18.fastq.gz
Approx 40% complete for SRR3099586_chr18.fastq.gz
Approx 45% complete for SRR3099586_chr18.fastq.gz
Approx 50% complete for SRR3099586_chr18.fastq.gz
Approx 55% complete for SRR3099586_chr18.fastq.gz
Approx 60% complete for SRR3099586_chr18.fastq.gz
Approx 65% complete for SRR3099586_chr18.fastq.gz
Approx 70% complete for SRR3099586_chr18.fastq.gz
Approx 75% complete for SRR3099586_chr18.fastq.gz
Approx 80% complete for SRR3099586_chr18.fastq.gz
Approx 85% complete for SRR3099586_chr18.fastq.gz
Approx 90% complete for SRR3099586_chr18.fastq.gz
Approx 95% complete for SRR3099586_chr18.fastq.gz
Analysis complete for SRR3099586_chr18.fastq.gz
[Wed Oct  2 15:36:52 2024]
Finished job 2.
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:36:52 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 4
    reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip
    resources: tmpdir=/tmp

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip

  /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.25.1 now available!
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099585_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099586_chr18_fastqc.zip
|           multiqc | Search path : /shared/ifbstor1/projects/2417_wf4bioinfo/ngoue/day2-session1/FastQC/SRR3099587_chr18_fastqc.zip
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 3/3  
|            fastqc | Found 3 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : multiqc_report.html
|           multiqc | Data        : multiqc_data
|           multiqc | MultiQC complete
[Wed Oct  2 15:36:58 2024]
Finished job 4.
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Oct  2 15:36:58 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip, multiqc_report.html, multiqc_data
    jobid: 0
    reason: Input files updated by another job: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, multiqc_data, FastQC/SRR3099587_chr18_fastqc.zip, multiqc_report.html, FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html
    resources: tmpdir=/tmp

[Wed Oct  2 15:36:58 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-10-02T153608.543855.snakemake.log

We can see that each input is now run with FastQC individually. You can see this when you look at the “Job stats” table (3 fastqc jobs), but also when you look at the fastqc command lines that were run (there is now 1 command per file). Note that Snakemake’s order of execution can be quite random for independent jobs.

Job stats:
job        count
-------  -------
all            1
fastqc         3
multiqc        1
total          5

Recap

Now you have your first functional pipeline! To get there, we’ve seen:

snakefile: rule and workflow concepts
how rules are linked thanks to input/output files, and the use of a target rule
how to generalise the inputs of a rule and the expand function

As a last step to this exercise, we will see below:

how to visualise your pipeline
how to simulate the execution with dry-run

Visualising your pipeline

Now that we have a complete pipeline, let’s ask Snakemake to make us a summary plot. This is especially interesting for big pipelines to make sure that your workflow is doing what you think it should be doing!

Snakemake uses the dot tool from the graphviz package to create diagrams of the complete workflow with the --dag option or of the rule dependencies with the --rulegraph option:

snakemake --dag -s ex1_o6.smk --cores 1 -p | dot -Tpng > ex1_o6_dag.png
snakemake --rulegraph -s ex1_o6.smk --cores 1 -p | dot -Tpng > ex1_o6_rule.png

Diagramme of the complete workflow

Diagramme of the rule dependencies

As you can see above, all fastqc outputs have arrows going to both the multiqc and to the target “all” rule. This is linked to what we specified in your input directive of the “all” rule in our snakefile. In theory, it only needs the multiqc output, but we decided to also specify the fastqc outputs in there. Both options should work.

You can also see, at the top of the left plot, that Snakemake specifies the different values of our “sample” placeholder.

Dry runs

Another way of verifying your pipeline is to use the dry run option. Snakemake will then simulate the run and you can see from the log on your screen what steps it would have executed and why.

For this, you can whether add -n (short) or --dryrun (long) to your snakemake command line:

# -R multiqc is used to force re-execution of the multiqc rule
snakemake -s ex1_o6.smk --cores 1 -p -R multiqc --dry-run

And you’ll get something like this as output:

Building DAG of jobs...
Job stats:
job        count
-------  -------
all            1
multiqc        1
total          2

Execute 1 jobs...

[Thu Feb  8 14:14:51 2024]
localrule multiqc:
    input: FastQC/SRR3099585_chr18_fastqc.zip, FastQC/SRR3099586_chr18_fastqc.zip, FastQC/SRR3099587_chr18_fastqc.zip
    output: multiqc_report.html, multiqc_data
    jobid: 0
    reason: Forced execution
    resources: tmpdir=<TBD>

multiqc FastQC/SRR3099585_chr18_fastqc.zip FastQC/SRR3099586_chr18_fastqc.zip FastQC/SRR3099587_chr18_fastqc.zip
Execute 1 jobs...

[Thu Feb  8 14:14:51 2024]
localrule all:
    input: FastQC/SRR3099585_chr18_fastqc.html, FastQC/SRR3099586_chr18_fastqc.html, FastQC/SRR3099587_chr18_fastqc.html, multiqc_report.html
    jobid: 4
    reason: Input files updated by another job: multiqc_report.html, multiqc_data
    resources: tmpdir=<TBD>

Job stats:
job        count
-------  -------
all            1
multiqc        1
total          2

Reasons:
    (check individual jobs above for details)
    forced:
        multiqc
    input files updated by another job:
        all

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.