s |
focus search bar ( enter to select, ▲ / ▼ to change selection) |
g c |
go to cluster |
g e |
go to edctools |
g f |
go to facility |
g g |
go to guidelines |
h |
toggle this help ( esc also exits) |
Maintained by BiBs. Last update : 20/06/2023. Methylator v0.1.
Implemented by BiBs-EDC, this workflow for DNA methylation data analysis runs effectively on both IFB and iPOP-UP clusters. If you encounter troubles or need additional tools or features, you can create an issue on the GitHub repository, email directly BiBs, or pass by the 366b room.
metadata.tsv
and config_main.yaml
sbatch Workflow.sh wgbs
Here is a simplified scheme of the workflow. The main steps are indicated in the blue boxes. Methylator will allow you to choose which steps you want to execute for your project. In the green circles are the input files you have to give for the different steps.
Please find instructions using the links below
In order to install Methylator, you have to clone the Methylator GitHub repository to your cluster project.
If you’re using the Jupyter Hub on IFB, you can open a Terminal by clicking on the corresponding icon.
Before clonning Methylator, go to your project using the cd
command.
[username @ cpu-node-12 ]$ cd /shared/projects/YourProjectName
Now you can clone the repository (use -b v0.1
to specify the version).
[username@clust-slurm-client YourProjectName]$ git clone https://github.com/parisepigenetics/Methylator
Cloning into 'Methylator'...
remote: Enumerating objects: 670, done.
remote: Counting objects: 100% (42/42), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 670 (delta 1), reused 31 (delta 1), pack-reused 628
Receiving objects: 100% (670/670), 999.24 MiB | 32.76 MiB/s, done.
Resolving deltas: 100% (315/315), done.
Checking out files: 100% (84/84), done.
Enter Methylator
directory (cd
) and look at the files using tree
or ls
.
[username@clust-slurm-client YourProjectName]$ cd Methylator
[username@clust-slurm-client Methylator]$ tree -L 2
.
????????????????
Methylator is launched as a python script named main_cluster.py
which calls the workflow manager named Snakemake. Snakemake will execute rules that are defined in workflow/xxx.rules
and distribute the corresponding jobs to the computing nodes via SLURM.
On the cluster, the main python script is launched via the shell script Workflow.sh
, which basically contains only one command python main_cluster.py
(+ loading of basic modules and information about the run).
A configuration file with the specifity of your cluster is needed. It has to be named configs/cluster_config.yaml
. Predefined files are available for IFB and iPOP-UP clusters.
For IFB, you should type:
cp configs/cluster_config_ifb.yaml configs/cluster_config.yaml
For iPOP-UP:
cp configs/cluster_config_ipop.yaml configs/cluster_config.yaml
For other clusters, you have to edit the file yourself…
Before running your analyses you can use the test dataset to make and check your installation. First copy the configuration file corresponding to the test.
[username@clust-slurm-client Methylator]$ cp TestDataset/configs/config_main.yaml configs/
Then start the workflow.
[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs
This will run the quality control of the raw FASTQ. See FASTQ quality control for detailed explanations. If everything goes find you will see the results in TestDataset/results/Test1/fastqc
. See also how to follow your jobs to know how to check that the run went fine.
You can now move on with your own data, or run the rest of the workflow on the test dataset. To do so you have to modify configs/config_main.yaml
turning QC
entry from “yes” to “no”. If you don’t know how to do that, see Preparing the run. Then restart the workflow.
[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs
Detailed explanation of the outputs are available in Workflow results.
If you want to use your own data, you should transfer the FASTQ files into your project folder /shared/projects/YourProjectName
before doing your analysis. Alternatively the workflow allows you to download data from SRA simply giving the SRRxxx
IDs, see below metadata.tsv.
The workflow is expecting gzip-compressed FASTQ files with names formatted as
SampleName_R1.fastq.gz
and SampleName_R2.fastq.gz
for pair-end data,SampleName.fastq.gz
for single-end data.If your files are not fitting this format, please see how to correct the names of a batch of FASTQ files.
It is highly recommended to check the md5sum for big files. If your raw FASTQ files are on your computer in PathTo/MethylProject/Fastq/
, you type in a terminal:
You@YourComputer:~$ cd PathTo/MethylProject
You@YourComputer:~/PathTo/MethylProject$ md5sum Fastq/* > Fastq/fastq.md5
You can then copy the Fastq
folder to the cluster using rsync
, replacing username
by your login:
You@YourComputer:~/PathTo/MethylProject$ rsync -avP Fastq/ username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Raw_fastq
In this example the FASTQ files are copied from PathTo/MethylProject/Fastq/
on your computer into a folder named Raw_fastq
in your project folder on IFB core cluster. On iPOP-UP cluster, only the address is different:
You@YourComputer:~/PathTo/MethylProject$ rsync -avP Fastq/ username@ipop-up.rpbs.univ-paris-diderot.fr:/shared/projects/YourProjectName/Raw_fastq
Feel free to name your folders as you want! You will be asked to enter your password, and then the transfer will begin. If it stops before the end, rerun the last command, it will only add the incomplete/missing files.
After the transfer, connect to the cluster (IFB, iPOP-UP) and check the presence of the files in Raw_fastq
using ls
or ll
command.
[username@clust-slurm-client YourProjectName]$ ll Raw_fastq
Check that the transfer went fine using md5sum
.
[username@clust-slurm-client YourProjectName]$ cd Raw_fastq
[username@clust-slurm-client Raw_fastq]$ md5sum -c fastq.md5
There are 2 files that you have to modify before running your analysis: metadata.tsv
and the yaml file corresponding to your sequencing technology (config_wgbs.yaml
or config_nanopore.yaml
) in the configs
folder. They control the workflow and many tool parameters.
To modify the text files from the terminal you can use vi or nano on iPOP-UP cluster, plus emacs and gedit (the last one being easier to use) on IFB.
You can also work on your computer and copy the files to the cluster using the scp
command or the graphic interface FileZilla.
You can find useful help to manage your data on IFB core documentation.
If you’re using the IFB Jupyter Hub, it’s easier as text and table editors are included, you just have to double click on the file you want to edit, modify and save it using the menu File/Save or Ctrl+S.
The experimental description is set up in config/metadata.tsv
:
[username@clust-slurm-client Methylator]$ cat configs/metadata.tsv
sample group
SRR9016926 WT
SRR9016927 1KO
SRR9016928 DKO
SRR11806587 WT
SRR11806588 1KO
SRR11806589 DKO
On Jupyter Hub:
The first column contains the sample names that have to correspond to the FASTQ names (for instance here D197-D192T27_R1.fastq.gz). The second column describes the group the sample belongs to and will be used for differential expression analysis. You can rename or move that file, as long as you adapt the METAFILE
entry in config_main.yaml
(see below).
The configuration of the workflow for BSseq (WGBS or RRBS) data (see step by step description below) is done in config/config_wgbs.yaml
.
This configuration file contains 3 parts:
[username@clust-slurm-client Methylator]$ cat configs/config_main.yaml
# Project name
PROJECT: Awesome_experience
???????????????????
Here you define where the FASTQ files are stored, where is the file describing the experimental set up, the name and localization of the folders where the results will be saved. The results (detailed in Workflow results) are separated into two folders:
BIGDATAPATH
RESULTPATH
/
). Here you also precise if your data are paired-end or single-end and the number of CPUs you want to use for your analysis.# ================== Shared parameters for some or all of the sub-workflows ==================
## key file if the data is stored remotely, otherwise leave it empty
KEY:
## the path to fastq files
READSPATH: /shared/projects/YourProjectName/Raw_fastq
????????????
Here you precise parameters that are specific to one of the steps of the workflow. See detailed description in step by step analysis.
??????
When the configuration files are ready, you can start the run by sbatch Workflow.sh wgbs
.
[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs
Please see below detailed explanation.
Prerequisite:
/shared/projects/YourProjectName/Raw_fastq
(but you can name your folders as you want, as long as you adjust the READSPATH
parameter in config_main.yaml
).config/metadata.tsv
according to your experimental design (with sample names or SRR identifiers).Now you have to check in config/config_main.yaml
that:
# Project name
PROJECT: EXAMPLE
Control of the workflow
, QC
or SRA
is set to yes
:If you want to download data from SRA, set SRA
to yes
. The QC will follow automatically. If you use your own data, put QC
to yes
and SRA
to no
.
## Do you want to download FASTQ files from public from Sequence Read Archive (SRA) ?
SRA: no # "yes" or "no". If set to "yes", the workflow will stop after the QC to let you decide whether you want to trim your raw data or not. In order to run the rest of the workflow, you have to set it to "no".
## Do you need to do quality control?
QC: yes # "yes" or "no". If set to "yes", the workflow will stop after the QC to let you decide whether you want to trim your raw data or not. In order to run the rest of the workflow, you have to set it to "no".
The rest of the part Control of the workflow
will be ignored. The software will stop after the QC to give you the opportunity to decide if trimming is necessary or not.
## the path to fastq files
READSPATH: /shared/projects/YourProjectName/Raw_fastq
## the meta file describing the experiment settings
METAFILE: /shared/projects/YourProjectName/Methylator/configs/metadata.tsv
## paths for intermediate and final results
BIGDATAPATH: /shared/projects/YourProjectName/Methylator/data # for big files
RESULTPATH: /shared/projects/YourProjectName/Methylator/results
???????????
When this is done, you can start the QC by running:
[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs
You can check if your job is running using squeue.
[username@clust-slurm-client Methylator]$ squeue --me
You should also check SLURM output files. See Description of the log files.
If everything goes fine, fastQC results will be in results/EXAMPLE/fastqc/
. For every sample you will have something like:
[username@clust-slurm-client Methylator]$ ll results/EXAMPLE/fastqc
total 38537
-rw-rw----+ 1 username username 640952 May 11 15:16 Sample1_forward_fastqc.html
-rw-rw----+ 1 username username 867795 May 11 15:06 Sample1_forward_fastqc.zip
-rw-rw----+ 1 username username 645532 May 11 15:16 Sample1_reverse_fastqc.html
-rw-rw----+ 1 username username 871080 May 11 15:16 Sample1_reverse_fastqc.zip
Those are individual fastQC reports. MultiQC is called after FastQC, so you will also find report_quality_control.html
that is a summary for all the samples.
You can copy those reports to your computer by typing (in a new local terminal):
You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourEXAMPLE/Methylator/results/EXAMPLE/fastqc PathTo/WhereYouWantToSave/
or look at them directly in the Jupyter Hub.
It’s time to decide if how much trimming you need. Trimming is generally necessary with WGBS or RRBS data.
??????? faire le trimming tout le temps????????
If you put TRIMMED: no
, there will be no trimming and the original FASTQ sequences will be mapped.
If you put TRIMMED: yes
, Trim Galore will remove low quality and very short reads, and cut the adapters. If you also want to remove a fixed number of bases in 5’ or 3’, you have to configure it. For instance if you want to remove the first 10 bases:
???????
# ================== Configuration for trimming ==================
## Number of trimmed bases
## put "no" for TRIM3 and TRIM5 if you don't want to trim a fixed number of bases.
TRIM5: 10 # integer or "no", remove N bp from the 5' end of reads. This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.
TRIM3: no # integer or "no", remove N bp from the 3' end of reads AFTER adapter/quality trimming has been performed.
At this step you have to provide the path to your genome index as well as to a GTF annotation file and a BED file with CpG island coordinates.
[username@clust-slurm-client ~]$ tree -L 2 /shared/bank/homo_sapiens/
/shared/bank/homo_sapiens/
├── GRCh37
│ ├── bowtie2
│ ├── fasta
│ ├── gff
│ ├── star -> star-2.7.2b
│ ├── star-2.6
│ └── star-2.7.2b
├── GRCh38
│ ├── bwa
│ ├── fasta
│ ├── gff
│ ├── star -> star-2.6.1a
│ └── star-2.6.1a
├── hg19
│ ├── bowtie
│ ├── bowtie2
│ ├── bwa
│ ├── fasta
│ ├── gff
│ ├── hisat2
│ ├── picard
│ ├── star -> star-2.7.2b
│ ├── star-2.6
│ └── star-2.7.2b
└── hg38
├── bowtie2
├── fasta
├── star -> star-2.7.2b
├── star-2.6
└── star-2.7.2b
30 directories, 0 files
If you don’t find what you need, you can ask for it on IFB or iPOP-UP community support. In case you don’t have a quick answer, you can download (for instance here) or produce the indexes you need in your folder (and remove it when it’s available in the common banks). Copy the path to the file you need and then paste the link to wget
. When downloading is over, you might have to decompress the file.
[username@clust-slurm-client Methylator]$ wget ???????
[username@clust-slurm-client index]$ tar -zxvf ????
GTF files can be downloaded from GenCode (mouse and human), ENSEMBL, NCBI (RefSeq, help here), …
Similarly you can download them to the server using wget
.
CpG Islands from UCSC Table ????
Be sure you give the right path to those files and adjust the other settings to your need:
# ================== Control of the workflow ==================
??????????
For an easy visualisation on a genome browser, BigWig files are generated.
As at the moment the default project quota in 250 Go you might be exceeding the space you have (and may or may not get error messages…). So if the mapping fails, try removing files to get more space, or ask to increase your quota on IFB Community support or iPOP-UP Community support. To see the space you have you can run:
[username@clust-slurm-client Methylator]$ du -h --max-depth=1 /shared/projects/YourProjectName/
and
[username@clust-slurm-client Methylator]$ lfsgetquota YourProjectName
Finally you have to set the parameters for the differential methylation analysis. You have to define the comparisons you want to do (pairs of conditions).
COMPARISON : [["WT","1KO"], ["WT","DKO"], ["1KO","DKO"]]
When the configuration files are fully adapted to your experimental set up and needs, you can start the workflow by running:
[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs
The first job is the main script. This job will call one or several snakefiles (.rules
files) that define small workflows of the individual steps. There are SLURM outputs at the 3 levels.
The configuration of the run and the timing are also saved:
Where to find those outputs and what do they contain?
The output is in slurm_output
and named Methylator-xxx.out
(default) or in the specified folder if you modified Workflow.sh
. A copy is also available in the logs
folder. It contains global information about your run.
Typically the main job output looks like :
[username@clust-slurm-client Methylator]$ cat slurm_output/Methylator-???.out
???
You can see at the end if this file if an error occured during the run. See Errors.
There are 4 snakefiles (visible in the workflow
folder) that correspond to the different steps of the analysis:
The SLURM outputs of those different steps are stored in the logs
folder and named as the date plus the corresponding snakefile: for instance
20230615T1540_trim.txt
or 20230615T1540_mapping.txt
.
Here is a description of one of those files (splitted):
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 30
???
????
You have here the corresponding job ID. You can follow that particular job in slurm-7908074.out
.
[Tue May 12 17:48:30 2020]
Finished job 4.
1 of 5 steps (20%) done
[Tue May 12 17:48:30 2020]
rule trimstart:
input: /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_forward.fastq.gz, /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_reverse.fastq.gz
output: /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/Test_forward.91bp_3prime.fq.gz, /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/Test_reverse.91bp_3prime.fq.gz
jobid: 3
wildcards: sample=Test
Submitted DRMAA job 3 with external jobid 7908075.
Removing temporary output file /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_forward.fastq.gz.
Removing temporary output file /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_reverse.fastq.gz.
[Tue May 12 17:49:10 2020]
Finished job 3.
2 of 5 steps (40%) done
[Tue May 12 17:51:10 2020]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /shared/mfs/data/projects/lxactko_analyse/RASflow/.snakemake/log/2020-05-12T174816.622252.snakemake.log
Every job generate a slurm-JOBID.out
file. It is localised in the working directory as long as the workflow is running. It is then moved to the slurm_output
folder. SLURM output specifies the rule, the sample (or samples) involved, and gives outputs specific to the tool:
[username@clust-slurm-client Methylator]$ cat slurm_output/
??????
Three extra files can be found in the logs
folder:
20230526T1623_running_time.txt
stores running times.[username@clust-slurm-client Methylator]$ cat logs/20230526T1623_running_time.txt
Project name: Test
Start time: Fri May 26 16:23:34 2023
Time of running trimming: 0:00:53
Time of running genome alignment: 0:02:55
Finish time: Fri May 26 16:30:09 2023
20230526T1623_configuration.txt
keeps a track of the configuration of the run (config_main.yaml
followed by metadata.tsv
, the environment contained in the Apptainer image with all tool versions, and resource configuration workflow/cluster.yaml
) ?????A changer en resources.yaml ???????[username@clust-slurm-client Methylator]$: cat logs/20230526T1623_configuration.txt
??????????
20200925T1057_free_disk.txt
stores the disk usage during the run (every minute, the remaining space is measured).
# quota:1 limit:1.5
time free_disk
20210726T1611 661
20210726T1612 660
20210726T1613 660
20210726T1614 660
20210726T1615 660
20210726T1616 660
If a run stops with no error, it can be that you don’t have enough space in your project. You can see in this file that the free disk tends to 0.
The results are separated into two folders :
configs/config_main.yaml
at BIGDATAPATH
## paths for intermediate and final results
BIGDATAPATH: /shared/projects/YourProjectName/Methylator/Big_Data # for big files
[username@clust-slurm-client Methylator]$ tree -L 2 Big_Data/EXAMPLE/
Big_Data/EXAMPLE/
???
...
configs/config_main.yaml
at RESULTPATH
RESULTPATH: /shared/projects/YourProjectName/Methylator/Results
[username@clust-slurm-client Methylator]$ tree -L 2 Results/EXAMPLE/
???????
This way you can get all the results on your computer by running (from your computer):
You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Methylator/results/EXAMPLE/ PathTo/WhereYouWantToSave/
and the huge files will stay on the server. You can of course download them as well if you have space (and this is recommended for the long term).
??? AJUSTER ???
A report named as 20210727T1030_report.html
summarizes your experiment and your results. You’ll find links to fastQC results, to mapping quality report, to exploratory analysis of all the samples and finally to pairwise differential expression analyses. Interactive plots are included in the report. They are very helpful to dig into the results.
TODO???
A compressed archive named 20210727T1030_report.tar.bz2
is also generated and contains the report and the targets of the different links, excluding big files to make it small enough to be sent to your collaborators.
An example of report is visible here.
Detailed description of all the outputs of the workflow is included below.
After trimming, the FASTQ are stored in the data folder defined in configs/config_main.yaml
at BIGDATAPATH:
.
In this examples the trim FASTQ files will be stored in /shared/projects/YourProjectName/Methylator/data/EXAMPLE/trim/
. They are named
In results/EXAMPLE/trimming
you’ll find trimming reports such as Sample1_forward.fastq.gz_trimming_report.txt
for each samples. You’ll find information about the tools and parameters, as well as trimming statistics:
SUMMARISING RUN PARAMETERS
==========================
Input filename: /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 4.2
Python version: could not detect
Number of cores used for trimming: 4
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Using Illumina adapter for trimming (count: 37651). Second best hit was Nextera (count: 0)
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
All Read 1 sequences will be trimmed by 10 bp from their 5' end to avoid poor qualities or biases
All Read 2 sequences will be trimmed by 10 bp from their 5' end to avoid poor qualities or biases (e.g. M-bias for BS-Seq applications)
All Read 1 sequences will be trimmed by 9 bp from their 3' end to avoid poor qualities or biases
All Read 2 sequences will be trimmed by 9 bp from their 3' end to avoid poor qualities or biases
Running FastQC on the data once trimming has completed
Output file will be GZIP compressed
This is cutadapt 4.2 with Python 3.7.12
Command line parameters: -j 4 -e 0.1 -q 28 -O 1 -a AGATCGGAAGAGC /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
Processing single-end reads on 4 cores ...
Finished in 148.610 s (3.960 µs/read; 15.15 M reads/minute).
=== Summary ===
Total reads processed: 37,527,910
Reads with adapters: 19,815,625 (52.8%)
Reads written (passing filters): 37,527,910 (100.0%)
Total basepairs processed: 1,876,395,500 bp
Quality-trimmed: 11,610,210 bp (0.6%)
Total written (filtered): 1,798,598,760 bp (95.9%)
=== Adapter 1 ===
Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 19815625 times
Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1
Bases preceding removed adapters:
A: 49.8%
C: 24.1%
G: 13.9%
T: 12.2%
none/other: 0.0%
Overview of removed sequences
length count expect max.err error counts
1 14655096 9381977.5 0 14655096
2 252919 2345494.4 0 252919
3 408890 586373.6 0 408890
[...]
RUN STATISTICS FOR INPUT FILE: /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
=============================================
37527910 sequences processed in total
This information is summarized in the MultiQC report, see below.
After the trimming, fastQC is automatically run on the new FASTQ and the results are also in the folder results/EXAMPLE/fastqc_trimming/
:
As previously MultiQC gives a summary for all the samples : results/EXAMPLE/fastqc_trimming/report_quality_control_after_trimming.html
. You’ll find information from the trimming report (for instance you can rapidly see the % of trim reads for the different samples) as well as from fastQC. It is included in the final report (ie ????.html
).
The mapped reads are stored as deduplicated, sorted bam in the data folder, in our example in Big_Data/EXAMPLE/mapping_BOWTIE2/Deduplicate/
, together with their .bai
index. They can be visualized using a genome browser such as IGV but this is not very convenient as the files are heavy. BigWig files, that summarize the information converting the individual read positions into a number of reads per bin of a given size, are more adapted.
To facilitate visualization on a genome browser, BigWig files are generated (window size of xx ?? bp). There are in ????
.
If not already done, you can specifically get the BigWig files on your computer running:
You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Methylator/??????, PathTo/WhereYouWantToSave/
Snapshot of BigWig tracks visualized on IGV. TODO
Qualimap is used to check the mapping quality. You’ll find qualimap reports in Results/EXAMPLE/mapping_BOWTIE2/alignmentQC
. Those reports contain a lot of information:
Once again MultiQC aggregates the results of all the samples and you can have a quick overview by looking at Results/EXAMPLE/mapping_BOWTIE2/multiqc/report_mapping_bismark.html
or in the final report (ie ????_report.html
).
Differential methylation results are in ???
.
???
All those plots are included in the final report.
You can check the jobs that are running using squeue
.
[username@clust-slurm-client Methylator]$ squeue --me
The sacct
command gives you information about past and running jobs. The documentation is here. You can get different information with the --format
option. For instance:
[username@clust-slurm-client Methylator]$ sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State
JobID JobName Start CPUTime MaxRSS ReqMem State
------------ ---------- ------------------- ---------- ---------- ---------- ----------
...
9875767 BigWig 2020-07-27T16:02:48 00:00:59 80000Mn COMPLETED
9875767.bat+ batch 2020-07-27T16:02:48 00:00:59 87344K 80000Mn COMPLETED
9875768 BigWigR 2020-07-27T16:02:51 00:00:44 80000Mn COMPLETED
9875768.bat+ batch 2020-07-27T16:02:51 00:00:44 85604K 80000Mn COMPLETED
9875769 PCA 2020-07-27T16:02:52 00:01:22 2000Mn COMPLETED
9875769.bat+ batch 2020-07-27T16:02:52 00:01:22 600332K 2000Mn COMPLETED
9875770 multiQC 2020-07-27T16:02:52 00:01:16 2000Mn COMPLETED
9875770.bat+ batch 2020-07-27T16:02:52 00:01:16 117344K 2000Mn COMPLETED
9875773 snakejob 2020-07-27T16:04:35 00:00:42 2000Mn COMPLETED
9875773.bat+ batch 2020-07-27T16:04:35 00:00:42 59360K 2000Mn COMPLETED
9875774 DEA 2020-07-27T16:05:25 00:05:49 2000Mn RUNNING
Here you have the job ID and name, its starting time, its running time, the maximum RAM used, the memory you requested (it has to be higher than MaxRSS, otherwise the job fails, but not much higher to allow the others to use the resource), and job status (failed, completed, running).
Add -S MMDD
to have older jobs (default is today only).
[username@clust-slurm-client Methylator]$ sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State -S 0518
If you want to cancel a job: scancel job number
[username@clust-slurm-client Methylator]$ scancel 8016984
Nota: when snakemake is working on a folder, this folder is locked so that you can’t start another DAG and create a big mess. If you cancel the main job, snakemake won’t be able to unlock the folder (see below).
To quickly check if everything went fine, you have to check the main log. If everything went fine you’ll have :
???
If not, you’ll see a summary of the errors:
??????
And you can check the problem looking as the specific log file, here logs/20231104T0921_mapping.txt
???
You can have the description of the error in the SLURM output corresponding to the external jobid, here 13605307:
[username @ clust-slurm-client Methylator]$ cat slurm_output/slurm-13605307.out
If you encounter an error starting gedit
[unsername @ clust-slurm-client 16:04]$ ~ : gedit toto.txt
(gedit:6625): Gtk-WARNING **: cannot open display:
Be sure to include -X
when connecting to the cluster (-X
option is necessary to run graphical applications remotely).
Use :
You@YourComputer:~$ ssh -X unsername@core.cluster.france-bioinformatique.fr
or
You@YourComputer:~$ ssh -X -o "ServerAliveInterval 10" unsername@core.cluster.france-bioinformatique.fr
The option -o "ServerAliveInterval 10"
is facultative, it keeps the connection alive even if you’re not using your shell for a while.
If you don’t get MultiQC report_quality_control.html
report in results/EXAMPLE/fastqc
, you may have some fastq files not fitting the required format:
Please see how to correct a batch of FASTQ files.
I set up the memory necessary for each rule, but it is possible that big datasets induce a memory excess error. In that case the job stops and you get in the corresponding Slurm output something like this:
slurmstepd: error: Job 8430179 exceeded memory limit (10442128 > 10240000), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 8430179 ON cpu-node-13 CANCELLED AT 2020-05-20T09:58:05 ***
Will exit after finishing currently running jobs.
In that case, you can increase the memory request by modifying in workflow/resources.yaml
the mem
entry corresponding to the rule that failed.
[username@clust-slurm-client Methylator]$ cat workflow/resources.yaml
__default__:
mem: 500
name: rnaseq
cpus: 1
qualityControl:
mem: 6000
name: QC
cpus: 2
trim:
mem: 6000
name: trimming
cpus: 8
...
If the rule that failed is not listed here, you can add it respecting the format. And restart your workflow.
When snakemake is working on a folder, this folder is locked so that you can’t start another DAG and create a big mess. If you cancel the main job, snakemake won’t be able to unlock the folder and next time you run Workflow.sh wgbs
, you will get the following error:
Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/shared/mfs/data/projects/awesome/Methylator
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.
In order to remove the lock, run:
[username@clust-slurm-client Methylator]$ sbatch Unlock.sh
Then you can restart your workflow.
Sometimes you may reach the quota you have for your project. To check the quota, run:
[username@clust-slurm-client Methylator]$ lfsgetquota YourProjectName
In principle it should raise an error, but sometimes it doesn’t and it’s hard to find out what is the problem. So if a task fails with no error (typically during mapping), try to make more space (or ask for more space on IFB Community support or iPOP-UP Community support) before trying again.
Workflow.sh wgbs
. It’s easier to find the outputs you’re interested in days/weeks/months/years later.TODO : rename image to methylator.simg ????
If you work on several projects [as defined by cluster documentation], you can either
wgbsflow.simg
from your first project:
[username@clust-slurm-client PROJECT2]$ cd Methylator
[username@clust-slurm-client Methylator]$ ln -s /shared/projects/YourFirstProjectName/Methylator/wgbsflow.simg/ wgbsflow.simg
To save time avoiding typing long commands again and again, you can add aliases to your .bashrc
file (change only the aliases, unless you know what you’re doing).
[username@clust-slurm-client ]$ cat ~/.bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
alias qq="squeue -u username"
alias sa="sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State"
alias ll="ls -lht --color=always"
It will work next time you connect to the server. When you type sa
, you will get the command sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State
running.
It is possible to quickly rename all your samples using mv
. For instance if your samples are named with dots instead of underscores and without R
:
[username@clust-slurm-client Raw_fastq]$ ls
D192T27.1.fastq.gz
D192T27.2.fastq.gz
D192T28.1.fastq.gz
D192T28.2.fastq.gz
D192T29.1.fastq.gz
D192T29.2.fastq.gz
D192T30.1.fastq.gz
D192T30.2.fastq.gz
D192T31.1.fastq.gz
D192T31.2.fastq.gz
D192T32.1.fastq.gz
D192T32.2.fastq.gz
You can modify them using mv
and a loop on sample numbers.
[username@clust-slurm-client Raw_fastq]$ for i in `seq 27 32`; do mv D192T$i\.1.fastq.gz D192T$i\_R1.fastq.gz; done
[username@clust-slurm-client Raw_fastq]$ for i in `seq 27 32`; do mv D192T$i\.2.fastq.gz D192T$i\_R2.fastq.gz; done
Now sample names are OK:
[username@clust-slurm-client Raw_fastq]$ ls
D192T27_R1.fastq.gz
D192T27_R2.fastq.gz
D192T28_R1.fastq.gz
D192T28_R2.fastq.gz
D192T29_R1.fastq.gz
D192T29_R2.fastq.gz
D192T30_R1.fastq.gz
D192T30_R2.fastq.gz
D192T31_R1.fastq.gz
D192T31_R2.fastq.gz
D192T32_R1.fastq.gz
D192T32_R2.fastq.gz
BiBs
2024 parisepigenetics
https://github.com/parisepigenetics/bibs |
programming pages theme v0.5.22 (https://github.com/pixeldroid/programming-pages) |
s |
focus search bar ( enter to select, ▲ / ▼ to change selection) |
g c |
go to cluster |
g e |
go to edctools |
g f |
go to facility |
g g |
go to guidelines |
h |
toggle this help ( esc also exits) |