Through ssh:
ssh -X -o ServerAliveInterval=60 -l mylogin core.cluster.france-bioinformatique.fr
For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/logging-in/
Or through OpenOnDemand:
For more information: https://ifb-elixirfr.gitlab.io/cluster/doc/software/openondemand/
cd /shared/projects/2417_wf4bioinfo/
mkdir -p $USER/day2-introsmk
cp snakemake-workshop/day2-introsmk/Snakefile $USER/day2-introsmk
cd $USER/day2-introsmk
The following sections are divided into questions (Q) and tasks (T). Explanations and answers can be found in the relative sections by developping the dedicated links.
A colleague gave me a Snakemake pipeline, now what?!
Q1: plot the rule graph to visualise the pipeline & understand what it does
Hints:
snakemake
and graphviz
modules before running snakemake
Snakefile
or use the -s path/to/Snakefile
optionMake sure you’re in the right folder:
cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk
Load the right modules:
module load graphviz/2.40.1
module load snakemake/8.9.0
Run snakemake
:
snakemake --rulegraph | dot -Tpng > rule.png
Q2: find in snakemake options how to list rules and possible target rules
Hints:
snakemake --help
to get the help, you can combine it
with grep
for exampleThe options are --list-rules
and
--list-target-rules
:
snakemake --list-rules
snakemake --list-target-rules
Rule list:
all
fusionFasta
loadData
mafft
The list of all rules outputted by the command line corresponds to the ones we see in the rule graph.
Target rule list:
all
There seems to be only one official target rule in the Snakefile:
all
T3: let’s have a look at the Snakefile
Open up the Snakefile with your favorite editor
(e.g. vi
, nano
, emacs
, …) or
viewer (e.g. less
, more
, cat
, …).
Look at the rules in the same order as in the rule graph & try to
understand what they do (pay specific attention to the
shell
directives).
Explanation:
loadData
: runs wget
to fetch fasta
sequences from the UniProt websitefusionFasta
: runs cat
to concatenate 2
fasta filesmafft
: runs mafft
to align the 2 sequences
within the fused output of fusionFasta
all
: specifies the final output. It’s the target rule
because it’s the first rule in the Snakefile.In a nutshell:
This pipeline doesn’t need any input files, it downloads sequences
given their UniProt identifier directly from the web, then fuses them
into a single file to align them with mafft
Side note: Python within Snakemake
If you’re familiar with Python, you’ll have recognised plenty of Python-like bits in this Snakefile. Although you don’t have to know Python to understand Snakemake, it’s still a good thing to keep it in mind while you’re using it. If you have a look at more advanced code, you’ll see that you can transfer quite a bit of your Python skills into Snakemake (you can import packages for example, or create functions). Also, keep in mind that, as for Python, Snakemake is sensitive to indents in the code.
Q4: dry-run the code
Dry-running the code is a good way on doing a final check to make sure everything will work once you run the pipeline on the slave nodes of the cluster.
NB: Dry-runs don’t actually use any major resources so running it directly from the cluster’s master node is ok.
Hints:
snakemake
module before running
snakemake
(if not already done)Snakefile
or use the -s path/to/Snakefile
optionsnakemake --help
Make sure you’re in the right folder:
cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk
Load the right modules:
module load snakemake/8.9.0
Run snakemake
:
snakemake --dry-run
You should get an output on your screen similar to the following:
Building DAG of jobs...
Job stats:
job count
----------- -------
all 1
fusionFasta 1
loadData 2
mafft 1
total 5
Execute 2 jobs...
[Mon Aug 19 08:32:28 2024]
localrule loadData:
output: results/data/P01308.fasta
jobid: 4
reason: Missing output files: results/data/P01308.fasta
wildcards: sample=P01308
resources: tmpdir=<TBD>
[Mon Aug 19 08:32:28 2024]
localrule loadData:
output: results/data/P10415.fasta
jobid: 3
reason: Missing output files: results/data/P10415.fasta
wildcards: sample=P10415
resources: tmpdir=<TBD>
Execute 1 jobs...
[Mon Aug 19 08:32:28 2024]
localrule fusionFasta:
input: results/data/P10415.fasta, results/data/P01308.fasta
output: results/alignment/P10415_P01308.fasta
jobid: 2
reason: Missing output files: results/alignment/P10415_P01308.fasta; Input files updated by another job: results/data/P10415.fasta, results/data/P01308.fasta
wildcards: uniprotid1=P10415, uniprotid2=P01308
resources: tmpdir=<TBD>
Execute 1 jobs...
[Mon Aug 19 08:32:28 2024]
localrule mafft:
input: results/alignment/P10415_P01308.fasta
output: results/alignment/P10415_P01308_aligned.fasta
jobid: 1
reason: Missing output files: results/alignment/P10415_P01308_aligned.fasta; Input files updated by another job: results/alignment/P10415_P01308.fasta
wildcards: prefix=P10415_P01308
resources: tmpdir=<TBD>
Execute 1 jobs...
[Mon Aug 19 08:32:28 2024]
localrule all:
input: results/alignment/P10415_P01308_aligned.fasta
jobid: 0
reason: Input files updated by another job: results/alignment/P10415_P01308_aligned.fasta
resources: tmpdir=<TBD>
Job stats:
job count
----------- -------
all 1
fusionFasta 1
loadData 2
mafft 1
total 5
Reasons:
(check individual jobs above for details)
input files updated by another job:
all, fusionFasta, mafft
output files have to be generated:
fusionFasta, loadData, mafft
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The Job stats
section (lines 2-9, and again lines 61-68)
tells you that 5 jobs will be run in total and the Reasons
section (lines 70-75) will explain why. In this case, output files are
missing, and depending on which ones are missing some or all of the
rules will have to be rerun.
In the log, there is also 1 section dedicated to each job that would
have been run by Snakemake (e.g. loadData
for P01308 lines
13-19). Each section shows the input and output file paths, the reason
why it’s run and information specific to that job i.e. the resources
used, its jobid, the wildcard values…
Let’s run the pipeline on the IFB cluster. We’ll go through a SLURM submission script to run the Snakefile within a SLURM job. SLURM quick-start guide: https://ifb-elixirfr.gitlab.io/cluster/doc/quick-start/
Q5: Create a SLURM submission script
Hints:
2417_wf4bioinfo
project with
--account
snakemake
and mafft
modules before running snakemake
snakemake -c 1
from within the folder
containing the Snakefile
Example run_script.sh
:
#!/bin/bash
#SBATCH --job-name=day2-introsmk
#SBATCH --mem=100Mb
#SBATCH --account=2417_wf4bioinfo
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
# load necessary modules
module load snakemake/8.9.0
module load mafft/7.515
# move to WD
cd /shared/projects/2417_wf4bioinfo/$USER/day2-introsmk
# run snakemake with minimum options
snakemake -c 1
Q6: Run the pipeline (i.e. submit the script) & follow your job
Assuming you have all SLURM options within your run script:
sbatch run_script.sh
Then follow your submitted jobs with:
squeue -u $USER
or
squeue --me
T8: Have a look at your working directory
For example with ls
or with tree
. You see a
results folder (output of the pipeline), your script, the
Snakefile
and the slurm output log. At a first glance, all
outputs seem to have been generated.
Output of tree
command:
.
├── results
│ ├── alignment
│ │ ├── P10415_P01308_aligned.fasta
│ │ └── P10415_P01308.fasta
│ └── data
│ ├── P01308.fasta
│ └── P10415.fasta
├── run_script.sh
├── slurm-41257618.out
└── Snakefile
3 directories, 7 files
T9: Have a closer look at slurm’s output log:
For example with cat slurm-41257618.out
. The output
shoud look a lot like the one you’ve generated during dry-run. We can
also see from the log that the pipeline ended without an error cf.
5 of 5 steps (100%) done
and no mention of errors:
[Fri Aug 16 18:35:03 2024]
Finished job 0.
5 of 5 steps (100%) done
Complete log: .snakemake/log/2024-08-16T183501.992097.snakemake.log
As you can see, Snakemake
also saves the log outputted
on the screen in a <date-time>.snakemake.log
file
within the .snakemake
folder.
T10: Have a quick glance at the .snakemake
folder:
.snakemake
is a hidden folder and is the default working
directory of Snakemake
. It’s generated within the directory
within which Snakemake
is run (you can change this with the
--directory
option). This folder contains log files,
metadata (such as run statistics) and environments that
Snakemake
collects and generates during a run. Thus, it can
get quite big (in volume & number of files) and is not
automatically deleted.
.snakemake/
├── auxiliary
├── conda
├── conda-archive
├── incomplete
├── locks
├── log
│ └── 2024-08-16T183501.992097.snakemake.log
├── metadata
│ ├── cmVzdWx0cy9hbGlnbm1lbnQvUDEwNDE1X1AwMTMwOC5mYXN0YQ==
│ ├── cmVzdWx0cy9hbGlnbm1lbnQvUDEwNDE1X1AwMTMwOF9hbGlnbmVkLmZhc3Rh
│ ├── cmVzdWx0cy9kYXRhL1AwMTMwOC5mYXN0YQ==
│ └── cmVzdWx0cy9kYXRhL1AxMDQxNS5mYXN0YQ==
├── shadow
└── singularity
9 directories, 5 files
Accomplished objectives:
Command summary for Snakemake:
snakemake --cores N
(N: number of cores
accessible to Snakemake)-s path/to/Snakefile
if the Snakefile is called
differently or is somewhere else--dry-run
to run a dry run of your pipeline--help
to get the list of all optionssnakemake [-s path/to/Snakefile] --dag | dot -Tpng > dag.png
(all jobs and rules)snakemake [-s path/to/Snakefile] --rulegraph | dot -Tpng > rulegraph.png
(just the rules)snakemake [-s path/to/Snakefile] --filegraph | dot -Tpng > filegraph.png
(a condensed version of the dag)T11: The last task of this exercise: clean up your current directory
rm -rf .snakemake/
You can also delete the results and SLURM’s output file if you wish.