DNA methylation analysis using Methylator

Detailed tutorial

Maintained by BiBs. Last update : 20/06/2023. Methylator v0.1.

This is a BETA version of the workflow, and of this documentation! There is no guarantee!!!

If you use this workflow to analyse your data, don’t forget to acknowledge BiBs in all your communications !

EDC people
“We thank the Bioinformatics and Biostatistics Core Facility, Paris Epigenetics and Cell Fate Center for bioinformatics support.”

External users
“We thank the Bioinformatics and Biostatistics Core Facility, Paris Epigenetics and Cell Fate Center for sharing their analysis workflows.”

Implemented by BiBs-EDC, this workflow for DNA methylation data analysis runs effectively on both IFB and iPOP-UP clusters. If you encounter troubles or need additional tools or features, you can create an issue on the GitHub repository, email directly BiBs, or pass by the 366b room.

Table of content

Your analysis in a nutshell
Resources
Before starting : create an account and a project on your favorite cluster
Methylator installation and description
Adapt the workflow to your cluster
Quick start with the test dataset
Transfer your data
Preparing the run
Running your analysis step by step
Description of the log files
Workflow results
Statisical analysis results
Extra help!

Your analysis in a nutshell

Get an account on a cluster and create a project
Clone Methylator repository
Adapt the workflow to your cluster
Transfer your data to the cluster
Modify metadata.tsv and config_main.yaml
Run the workflow typing sbatch Workflow.sh wgbs
Look at the results

Here is a simplified scheme of the workflow. The main steps are indicated in the blue boxes. Methylator will allow you to choose which steps you want to execute for your project. In the green circles are the input files you have to give for the different steps.

drawing

Resources

iPOP-UP
- Documentation
- Community support
- BiBs practical guide
IFB
- Create and manage your account
- Community support
- Documentation
- Jupyter Hub
- BiBs practical guide
Tools implemented
- FastQC
- MultiQC
- Trim Galore
- Samtools
- deepTools
- Qualimap
- ???

Before starting : create an account and a project on your favorite cluster

Please find instructions using the links below

Methylator installation and description

In order to install Methylator, you have to clone the Methylator GitHub repository to your cluster project.

If you’re using the Jupyter Hub on IFB, you can open a Terminal by clicking on the corresponding icon.

drawing

Before clonning Methylator, go to your project using the cd command.

[username @ cpu-node-12 ]$ cd /shared/projects/YourProjectName

Now you can clone the repository (use -b v0.1 to specify the version).

[username@clust-slurm-client YourProjectName]$ git clone https://github.com/parisepigenetics/Methylator
Cloning into 'Methylator'...
remote: Enumerating objects: 670, done.
remote: Counting objects: 100% (42/42), done.
remote: Compressing objects: 100% (41/41), done.
remote: Total 670 (delta 1), reused 31 (delta 1), pack-reused 628
Receiving objects: 100% (670/670), 999.24 MiB | 32.76 MiB/s, done.
Resolving deltas: 100% (315/315), done.
Checking out files: 100% (84/84), done.

Enter Methylator directory (cd) and look at the files using tree or ls.

[username@clust-slurm-client YourProjectName]$ cd Methylator
[username@clust-slurm-client Methylator]$ tree -L 2
.
????????????????

Methylator is launched as a python script named main_cluster.py which calls the workflow manager named Snakemake. Snakemake will execute rules that are defined in workflow/xxx.rules and distribute the corresponding jobs to the computing nodes via SLURM.

drawing

On the cluster, the main python script is launched via the shell script Workflow.sh, which basically contains only one command python main_cluster.py (+ loading of basic modules and information about the run).

Adapt the workflow to your cluster

A configuration file with the specifity of your cluster is needed. It has to be named configs/cluster_config.yaml. Predefined files are available for IFB and iPOP-UP clusters.

For IFB, you should type:

cp configs/cluster_config_ifb.yaml configs/cluster_config.yaml

For iPOP-UP:

cp configs/cluster_config_ipop.yaml configs/cluster_config.yaml

For other clusters, you have to edit the file yourself…

Quick start with the test dataset

Before running your analyses you can use the test dataset to make and check your installation. First copy the configuration file corresponding to the test.

[username@clust-slurm-client Methylator]$ cp TestDataset/configs/config_main.yaml configs/

Then start the workflow.

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs

This will run the quality control of the raw FASTQ. See FASTQ quality control for detailed explanations. If everything goes find you will see the results in TestDataset/results/Test1/fastqc. See also how to follow your jobs to know how to check that the run went fine.
You can now move on with your own data, or run the rest of the workflow on the test dataset. To do so you have to modify configs/config_main.yaml turning QC entry from “yes” to “no”. If you don’t know how to do that, see Preparing the run. Then restart the workflow.

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs

Detailed explanation of the outputs are available in Workflow results.

Transfer your data

If you want to use your own data, you should transfer the FASTQ files into your project folder /shared/projects/YourProjectName before doing your analysis. Alternatively the workflow allows you to download data from SRA simply giving the SRRxxx IDs, see below metadata.tsv.

FASTQ names

The workflow is expecting gzip-compressed FASTQ files with names formatted as

SampleName_R1.fastq.gz and SampleName_R2.fastq.gz for pair-end data,
SampleName.fastq.gz for single-end data.

If your files are not fitting this format, please see how to correct the names of a batch of FASTQ files.

Generate md5sum

It is highly recommended to check the md5sum for big files. If your raw FASTQ files are on your computer in PathTo/MethylProject/Fastq/, you type in a terminal:

You@YourComputer:~$ cd PathTo/MethylProject
You@YourComputer:~/PathTo/MethylProject$ md5sum Fastq/* > Fastq/fastq.md5

Copy to the cluster

You can then copy the Fastq folder to the cluster using rsync, replacing username by your login:

You@YourComputer:~/PathTo/MethylProject$ rsync -avP  Fastq/ username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Raw_fastq

In this example the FASTQ files are copied from PathTo/MethylProject/Fastq/ on your computer into a folder named Raw_fastq in your project folder on IFB core cluster. On iPOP-UP cluster, only the address is different:

You@YourComputer:~/PathTo/MethylProject$ rsync -avP  Fastq/ username@ipop-up.rpbs.univ-paris-diderot.fr:/shared/projects/YourProjectName/Raw_fastq

Feel free to name your folders as you want! You will be asked to enter your password, and then the transfer will begin. If it stops before the end, rerun the last command, it will only add the incomplete/missing files.

Check md5sum

After the transfer, connect to the cluster (IFB, iPOP-UP) and check the presence of the files in Raw_fastq using ls or ll command.

[username@clust-slurm-client YourProjectName]$ ll Raw_fastq

Check that the transfer went fine using md5sum.

[username@clust-slurm-client YourProjectName]$ cd Raw_fastq
[username@clust-slurm-client Raw_fastq]$ md5sum -c fastq.md5

Preparing the run

There are 2 files that you have to modify before running your analysis, metadata.tsv and the config yaml file associate withe your sequencing technologies (config_wgbs.yaml or config_nanopore.yaml in the configs folder). The config yaml file is the most important file. It controls the workflow and many tool parameters.

To modify the text files from the terminal you can use vi or nano on iPOP-UP cluster, plus emacs and gedit (the last one being easier to use) on IFB.

Nota
In order to use gedit, be sure that you included -X when connecting to the IFB cluster (-X option is necessary to run graphical applications remotely). See common errors.

Tip
In order to navigate easily in your files with your regular file manager, you can mount your project folder as a disk on your local system. Please follow the instructions for Windows or Linux. This way, you can modify your files directly using any local text editor.

You can also work on your computer and copy the files to the cluster using the scp command or the graphic interface FileZilla.

Caution
Never use word processor (like Microsoft Word or LibreOffice Writer) to modify your code and never copy/past code to/from those softwares. Use only text editors and UTF-8 encoding.

You can find useful help to manage your data on IFB core documentation.

If you’re using the IFB Jupyter Hub, it’s easier as text and table editors are included, you just have to double click on the file you want to edit, modify and save it using the menu File/Save or Ctrl+S.

metadata.tsv

The experimental description is set up in config/metadata.tsv:

[username@clust-slurm-client Methylator]$ cat configs/metadata.tsv 
sample	group
SRR9016926	WT
SRR9016927	1KO
SRR9016928	DKO
SRR11806587	WT
SRR11806588	1KO
SRR11806589	DKO

Important
The columns have to be tab-separated and the header to remain unchanged.

Important
Please, don’t name groups with and underscore

On Jupyter Hub:

drawing

The first column contains the sample names that have to correspond to the FASTQ names (for instance here D197-D192T27_R1.fastq.gz). The second column describes the group the sample belongs to and will be used for differential expression analysis. You can rename or move that file, as long as you adapt the METAFILE entry in config_main.yaml (see below).

Tip
It is also possible to download and use directly SRA data! That’s easy, just enter the SRRxxxx IDs in the first column instead of the sample names!

config_wgbs.yaml

The configuration of the workflow for BSseq (WGBS or RRBS) data (see step by step description below) is done in config/config_wgbs.yaml.

Important
The yaml format is key:[space]value. The space is mandatory.

This configuration file contains 3 parts:

1) Define a project name, paths and shared parameters

[username@clust-slurm-client Methylator]$ cat configs/config_main.yaml

# Project name
PROJECT: Awesome_experience

## paths for intermediate final results
BIGDATAPATH: Big_Data # for big files
RESULTPATH: Results 

# ===================== Configuration for BS-seq data =================== #
# ===================================================================== #
DATATYPE: "WGBS" # "WGBS" (Whole genome) or "RRBS" (Reduced representation) 
METAFILE: /shared/projects/YourProjectName/configs/metadata_wgbs.tsv   # the meta file describing the experiment settings
READSPATH: /shared/projects/YourProjectName/fastq # the path to fastq files
COMPARISON: [["WT","1KO"], ["WT","DKO"], ["1KO","DKO"]]

Here you define where the FASTQ files are stored, where is the file describing the experimental set up, the name and localization of the folders where the results will be saved. The results (detailed in Workflow results) are separated into two folders:

the big files: trimmed FASTQ, bam files are in an specific folder defined at BIGDATAPATH
the small files: QC reports, count tables, BigWig, etc. are in the final result folder defined at RESULTPATH
Examples are given in the configuration file, but you’re free to name and organise them as you want. Be sure to include the full path (starting from /). Here you also precise if your data are paired-end or single-end and the number of CPUs you want to use for your analysis.

Important
For differentially methylation analysis, you need to give the comparison(s) you want to do. If multiple comparisons, specify each pair (CONTROL & TREAT) in order respectively.

2) Steps of the workflow you want to run

# ========================= Control of the workflow ========================= #
# =========================================================================== #

## Do you want to download FASTQ files from public from Sequence Read Archive (SRA) ? 
SRA: no # "yes" or "no". 

## Do you need to do quality control ?
QC: no # "yes" or "no". 

## Do you need to do trimming ?
TRIMMED: no # "yes" or "no"

## Do you need to do mapping ?
MAPPING: no # "yes" or "no"

## Do you want to do the globale exploration ?
EXPLORATION: yes # "yes" or "no"

## Do you need to do differentially methylation analysis ?
DIFFERENTIAL: yes # "yes" or "no"

## Do you need a final report ?
REPORT: no # "yes" or "no"

Important
if QC or SRA is set to yes, the workflow will stop after the QC to let you decide whether you want to trim your raw data or not. In order to run the rest of the workflow, you have to set both QC and SRA to no.

Tip
We recommend that you perform the exploration before starting the differential analysis. Exploration ensures that samples are sufficiently distinct for differential analysis to be useful. It also enables you to refine the choice of certain parameters appropriate to your data (minimum coverage, unification, etc.).

3) Configuration of the specific tools

Here you precise parameters that are specific to one of the steps of the workflow. See detailed description in step by step analysis.

config_nanopore.yaml

The configuration of the workflow for nanopore data.

Important
The yaml format is key:[space]value. The space is mandatory.

Like the config_wgbs this configuration file contains 3 parts:

1) Define a project name, paths and shared parameters

# Project name
PROJECT: RRMS

## paths for intermediate final results
BIGDATAPATH: Big_Data # for big files
RESULTPATH: Results

## genome files
GENOMEPATH: /shared/projects/edc_nanopore/mm39/mm39.fa  # path to the reference genome's folder 

## genome
ASSEMBLY: mm39 # mm10 name of the assembly used for the analysis (now use mm39, new version)

# ===================== Configuration for Nanopore data ==================== #
# ========================================================================== #
DATATYPE: NANOPORE # do not touch!
METAFILE: configs/metadata_rrms.tsv
NANO_BAM_PATH: /shared/projects/edc_nanopore/BAM_files/
COMPARISON: [["WT","NP95"]] 
# Did you select specific regions during sequencing?
RRMS: yes 
BED_RRMS: /shared/projects/edc_nanopore/rrms_mm39_v2_corr_end.bed
5HMC: yes

Important
if you have perform a RRMS (selection of specific regions during the sequencing), don’t forget to put the path to the BED file used for selection below.

Unlike BSseq sequencing, in nanopore it’s possible to differentiate between 5mc and 5hmc methylation. If you have basecalled FAST5 files including 5hmC detection, you can explore this mark.

2) Steps of the workflow you want to run

# ========================= Control of the workflow ========================= #
# =========================================================================== #

## Do you want to do the globale exploration ?
EXPLORATION: yes # "yes" or "no"

## Do you need to do differentially methylation analysis ?
DIFFERENTIAL: yes # "yes" or "no"

With nanopore data, the first workflow steps are not necessary. Only the statistical analysis is common to both.

3) Configuration of the specific tools

Running your analysis step by step

When the configuration files are ready, you can start the run by sbatch Workflow.sh wgbs or sbatch Workflow.sh nanopore. Specifying the data type allows the workflow to directly select the correct configuration file.

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh nanopore

Please see below detailed explanation.

FASTQ quality control (eventually after SRA data retrieval)

Prerequisite:

Using your own data: your FASTQ files are on the cluster, in our example in /shared/projects/YourProjectName/Raw_fastq (but you can name your folders as you want, as long as you adjust the READSPATH parameter in config_main.yaml).
You have modified config/metadata.tsv according to your experimental design (with sample names or SRR identifiers).

Now you have to check in config/config_main.yaml that:

you gave a project name

# Project name
PROJECT: EXAMPLE

In Control of the workflow, QC or SRA is set to yes:

If you want to download data from SRA, set SRA to yes. The QC will follow automatically. If you use your own data, put QC to yes and SRA to no.

## Do you want to download FASTQ files from public from Sequence Read Archive (SRA) ? 
SRA: no  # "yes" or "no". If set to "yes", the workflow will stop after the QC to let you decide whether you want to trim your raw data or not. In order to run the rest of the workflow, you have to set it to "no".

## Do you need to do quality control?
QC: yes  # "yes" or "no". If set to "yes", the workflow will stop after the QC to let you decide whether you want to trim your raw data or not. In order to run the rest of the workflow, you have to set it to "no".

The rest of the part Control of the workflow will be ignored. The software will stop after the QC to give you the opportunity to decide if trimming is necessary or not.

The shared parameters are correct (paths to the FASTQ files, metadata.tsv, result folders, single or paired-end data).

## the path to fastq files
READSPATH: /shared/projects/YourProjectName/Raw_fastq

## the meta file describing the experiment settings
METAFILE: /shared/projects/YourProjectName/configs/metadata.tsv

## paths for intermediate and final results
BIGDATAPATH: /shared/projects/YourProjectName/data # for big files
RESULTPATH: /shared/projects/YourProjectName/results

When this is done, you can start the QC by running:

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs

You can check if your job is running using squeue.

[username@clust-slurm-client Methylator]$ squeue --me

You should also check SLURM output files. See Description of the log files.

FastQC results

If everything goes fine, fastQC results will be in results/EXAMPLE/fastqc/. For every sample you will have something like:

[username@clust-slurm-client Methylator]$ ll results/EXAMPLE/fastqc
total 38537
-rw-rw----+ 1 username username  640952 May 11 15:16 Sample1_forward_fastqc.html
-rw-rw----+ 1 username username  867795 May 11 15:06 Sample1_forward_fastqc.zip
-rw-rw----+ 1 username username  645532 May 11 15:16 Sample1_reverse_fastqc.html
-rw-rw----+ 1 username username  871080 May 11 15:16 Sample1_reverse_fastqc.zip

Those are individual fastQC reports. MultiQC is called after FastQC, so you will also find report_quality_control.html that is a summary for all the samples. You can copy those reports to your computer by typing (in a new local terminal):

You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourEXAMPLE/Methylator/results/EXAMPLE/fastqc PathTo/WhereYouWantToSave/

or look at them directly in the Jupyter Hub.
It’s time to decide if how much trimming you need. Trimming is generally necessary with WGBS or RRBS data.

Satisfactory data quality?
In principle you can now run all the rest of the pipeline at once. To do so you have set SRA and QC to “no” and to configure the other parts of config_main.yaml.

Trimming

If you put TRIMMED: no, there will be no trimming and the original FASTQ sequences will be mapped.

If you put TRIMMED: yes, Trim Galore will remove low quality and very short reads, and cut the adapters. If you also want to remove a fixed number of bases in 5’ or 3’, you have to configure it. For instance if you want to remove the first 10 bases:

# ================== Configuration for trimming ==================

## Number of trimmed bases
## put "no" for TRIM3 and TRIM5 if you don't want to trim a fixed number of bases.
TRIM5: 10 #  integer or "no", remove N bp from the 5' end of reads. This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end. 
TRIM3: no # integer or "no", remove N bp from the 3' end of reads AFTER adapter/quality trimming has been performed. 

Mapping

At this step you have to provide the path to your genome index as well as to a GTF annotation file and a BED file with CpG island coordinates.

Use common banks!
Some reference files are shared between cluster users. Before downloading a new reference, check what is available at /shared/bank/ (IFB) or /shared/banks/ (iPOP-UP).

[username@clust-slurm-client ~]$ tree -L 2 /shared/bank/homo_sapiens/
/shared/bank/homo_sapiens/
├── GRCh37
│   ├── bowtie2
│   ├── fasta
│   ├── gff
│   ├── star -> star-2.7.2b
│   ├── star-2.6
│   └── star-2.7.2b
├── GRCh38
│   ├── bwa
│   ├── fasta
│   ├── gff
│   ├── star -> star-2.6.1a
│   └── star-2.6.1a
├── hg19
│   ├── bowtie
│   ├── bowtie2
│   ├── bwa
│   ├── fasta
│   ├── gff
│   ├── hisat2
│   ├── picard
│   ├── star -> star-2.7.2b
│   ├── star-2.6
│   └── star-2.7.2b
└── hg38
    ├── bowtie2
    ├── fasta
    ├── star -> star-2.7.2b
    ├── star-2.6
    └── star-2.7.2b

30 directories, 0 files

If you don’t find what you need, you can ask for it on IFB or iPOP-UP community support. In case you don’t have a quick answer, you can download (for instance here) or produce the indexes you need in your folder (and remove it when it’s available in the common banks). Copy the path to the file you need and then paste the link to wget. When downloading is over, you might have to decompress the file.

[username@clust-slurm-client Methylator]$ wget ???????
[username@clust-slurm-client index]$ tar -zxvf ???? 

GTF files can be downloaded from GenCode (mouse and human), ENSEMBL, NCBI (RefSeq, help here), … Similarly you can download them to the server using wget.
CpG Islands files can be downloaded from UCSC goldenPath. For obtain the bed file, you can use this command (example with hg38):

 wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cpgIslandExt.txt.gz \
   | gunzip -c \
   | awk 'BEGIN{ OFS="\t"; }{ print $2, $3, $4, $5$6, $7, $8, $9, $10, $11, $12 }' \
   | sort-bed - \
   > cpgIslandExt.hg38.bed

Fill common banks!
Don’t forget to give the links to the new references you made/downloaded to IFB or to iPOP-UP support so that they can add them to the common banks.

Be sure you give the right path to those files and adjust the other settings to your need:

## genome
ASSEMBLY: hg38 # short name of the assembly used for the analysis

## Library type
LIBRARY_TYPE_OPT: non_directional   # "directional", "non_directional" or "pbat" sequencing library 

## params in Bismark alignment (--score_min {function},{val1},{val2}). 
# More information about Bismark options: https://felixkrueger.github.io/Bismark/options/alignment/

TYPEFUNC: L # 'L' or 'G' for linear or logarithmic fonction
VARFUNCT1: 0 # first value for the function 0 is default
VARFUNCT2: -0.6 # second value for the function -0.2 is default
MISMATCH: 0 # nomber of mismatch wanted
ADVANCE_OPT: # for advanced bismark user (mapping only), IMPLEMENTED IN THE CODE BUT NOT TESTED

Développer les paramètres pour bismark !!

For an easy visualisation on a genome browser, BigWig files are generated. The path to this file is : /Results/YOUR_PROJECT/mapping_BOWTIE2/BigWig/
Two types of BigWig are generated for each sample, a file with the methylation percentage for each CpG, and a file with the read coverage on each CpG

BigWig/{sample}_methExtract.bw
BigWig/{sample}_cov.bw

You can open and view these files with IGV software.

As at the moment the default project quota in 250 Go you might be exceeding the space you have (and may or may not get error messages…). So if the mapping fails, try removing files to get more space, or ask to increase your quota on IFB Community support or iPOP-UP Community support. To see the space you have you can run:

[username@clust-slurm-client Methylator]$ du -h --max-depth=1 /shared/projects/YourProjectName/

and

[username@clust-slurm-client Methylator]$ lfsgetquota YourProjectName

Process BAM to BED (only for nanopore)

If you’re launching the workflow with nanopore data, the first step is to convert the BAM files to BED using the modbam2bed tool (recommended by oxford nanopore technologie). If you already have BED files from modbam2bed, you can enter the path to the folder containing them. Please note that the BED must be named Sample_Merged.cpg.bed.

# # ===================== Configuration for process BAM files  ================== #
# =========================================================================== #

STARTFROMBED: no # put no if you start from BAM files

Statistical analysis

In this step, you select the global parameters for statistical analysis.

LEVEL: CpG # "CpG" or "Tiles" study by Tiles or by CpG 
TILESIZE: 250 
STEPSIZE: 1  # Tiles relative step size
NB_CPG_TILES: 1 # minimal number of CpG to keep a tile in the analysis

If Tiles was selected, define tile size, step size, and minimal number of CpG by tile.
STEPSIZE corresponds to the relative size of the gap between two tiles. If set to 1, tiles are non-overlapping; if set to 0.5, tiles overlap by half.

Exploratory analysis

# ===== Exploratory analysis ===== #
## params 
MINCOV: 5  # int, minimum coverage depth for the analysis
COV.PERC: 99.9 # to the coverage filter, choose the percentile for remove top ..% 
MINQUALI: 20  # int, minimum quality to keep a CPG for the analysis
DESTRAND: yes # combine information from F et R reads
UNITE: all # 'all' or 'one' (at least one per group) or (2,3,4 ... per group)

COV.PERC corresponds to the percentage of the most covered CpG or tiles to be retained. For example, if COV.PERC = 99.9, only the top 1% are deleted. This parameter eliminates illumina sequencing bias (in repeated regions, for example). Caution: in the case of RRBS, set this argument to 100 to avoid deleting intentionally enriched regions.

UNITE allows you to choose how you unify your replicates. By default (all), a CpG must be present in each replicate for it to be taken into account. Otherwise, specify the minimum number of times it must appear to be retained. For example, if you have three replicates per condition and you select 2, it is sufficient for a CpG to be present in two of the three replicates for it to be retained. If you select a number that is not compatible with your replicates, differential analysis is performed with ‘all’ by default.

Differential methylation analysis

Finally you have to set the parameters for the differential methylation analysis. You have to define the comparisons you want to do (pairs of conditions).

COMPARISON : [["WT","1KO"], ["WT","DKO"], ["1KO","DKO"]] 

And set your analysis parameters :

# ===== Differential analysis ===== #

DMR: yes # if yes, run DMR

## params (CPG or TILES)
LIST_SIGNIDIF: [10, 20, 25, 30] # SigDiffMeth en %, used in MKit_diff_bed.R
LIST_QVALUE : [0.001, 0.01, 0.05]  # used in MKit_diff_bed.R
QVALUE: 0.05  # QValue (used in Mkit_differential.Rmd)
SIGNIDIF: 10

## params (only DMR)
DMR_TYPE: regions  # regions or blocks, by default regions 
MIN_CPG: 5  # minimum number of CpGs to consider for a candidate DMR, by default 5, minimum 3
MAX_GAP: 1000 # maximum number of bp in between neighboring CpGs to be included in the same DMR, by default 1000
CUTOFF: 0.1 # cutoff of the single CpG methylation difference that is used to discover candidate DMR. by default 0.1
FDR: 0.05 # QVALUE for select significant DMR

LIST_SIGNIDIF: list of minimum percentage threshold values to consider that a CpG or a tile is differentially methylated under two distinct biological conditions.
LIST_QVALUE: list of q-values thresholds for considering a Cpg or a tile to be differentially methylated as being significantly different.
LIST_SIGNIDIF and LIST_QVALUE are used to generate bedgraphs. For each difference threshold and for each q-value threshold, a bedgraph is generated. Help with choosing SIGNIDIF and QVALUE parameters for differential analysis.

If DMR is set to YES, you perform a DMR analysis with the same comparison peers as in CpG or Tiles.
The default parameters are designed to focus on local DMRs (call regions), generally in the range of hundreds to thousands of bp. If you choose blocks , the range increase on the order of hundreds of thousands to millions of bp. In this case, it’s advised to decrease the CUTOFF. Blocks are designed to analyze cancer data.

Please note that the package used to infer DMR was designed for WGBS analysis. If you wish to perform an RRBS analysis, you can try removing smoothing, or modifying certain parameters: ?

Start the workflow

When the configuration files are fully adapted to your experimental set up and needs, you can start the workflow by running:

[username@clust-slurm-client Methylator]$ sbatch Workflow.sh wgbs

Description of the log files

The first job is the main script. This job will call one or several snakefiles (.rules files) that define small workflows of the individual steps. There are SLURM outputs at the 3 levels.

Main script
Snakefiles
Individual tasks

The configuration of the run and the timing are also saved:

Configuration and timing

Where to find those outputs and what do they contain?

Main script

The output is in slurm_output and named Methylator-xxx.out (default) or in the specified folder if you modified Workflow.sh. A copy is also available in the logs folder. It contains global information about your run. Typically the main job output looks like :

[username@clust-slurm-client Methylator]$ cat slurm_output/Methylator-???.out 

??? 

You can see at the end if this file if an error occured during the run. See Errors.

Snakefiles

There are 6 snakefiles (visible in the workflow folder) that correspond to the different steps of the analysis:

fastq_dump_QC.rules (QC)
trim.rules (reads trimming/filtering)
mapping.rules (mapping)
exploration.rules (statistical analysis)
differential.rules (statistical analysis)
report.rules (build the final report)

The SLURM outputs of those different steps are stored in the logs folder and named as the date plus the corresponding snakefile: for instance 20230615T1540_trim.txt or 20230615T1540_mapping.txt.

Here is a description of one of those files (splitted):

Building the DAG (directed acyclic graph): Define the jobs that will be launched and in which order.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 30

Start the first job (or jobs if there are several independant jobs). The rule is indicated, with the expected outputs. For the first steps one job is started per sample.

????

You have here the corresponding job ID. You can follow that particular job in slurm-7908074.out.

End of that job, start of the next one:

[Tue May 12 17:48:30 2020]
Finished job 4.
1 of 5 steps (20%) done

[Tue May 12 17:48:30 2020]
rule trimstart:
    input: /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_forward.fastq.gz, /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_reverse.fastq.gz
    output: /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/Test_forward.91bp_3prime.fq.gz, /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/Test_reverse.91bp_3prime.fq.gz
    jobid: 3
    wildcards: sample=Test

Submitted DRMAA job 3 with external jobid 7908075.

At the end of the job, it removes temporary files if any:

Removing temporary output file /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_forward.fastq.gz.
Removing temporary output file /shared/projects/lxactko_analyse/RASflow/data/LXACT_1-test/trim/reads/Test_reverse.fastq.gz.
[Tue May 12 17:49:10 2020]
Finished job 3.
2 of 5 steps (40%) done

And so on… Finally:

[Tue May 12 17:51:10 2020]
Finished job 0.
5 of 5 steps (100%) done
Complete log: /shared/mfs/data/projects/lxactko_analyse/RASflow/.snakemake/log/2020-05-12T174816.622252.snakemake.log

Individual tasks

Every job generate a slurm-JOBID.out file. It is localised in the working directory as long as the workflow is running. It is then moved to the slurm_output folder. SLURM output specifies the rule, the sample (or samples) involved, and gives outputs specific to the tool:

[username@clust-slurm-client Methylator]$ cat slurm_output/slurm-4455334.out

...

[Mon Nov 27 09:14:54 2023]
rule DMR:
    input: Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1104848_5mC.RData, Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1104855_5mC.RData, Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1067571_5mC.RData, Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1067573_5mC.RData, Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1104851_5mC.RData, Big_Data/Test_CD/Methylator_WGBS/Rdata/Prep_SRR1104862_5mC.RData, Big_Data/Test_CD/Annotation/Annotatr.RData
    output: Big_Data/Test_CD/Methylator_WGBS/Rdata/CpG_mincov5/CD14_CD56_DMR.RData
    jobid: 0
    reason: Missing output files: Big_Data/Test_CD/Methylator_WGBS/Rdata/CpG_mincov5/CD14_CD56_DMR.RData
    wildcards: comparison=CD14_CD56
    resources: mem_mb=3399, mem_mib=3242, disk_mb=3399, disk_mib=3242, tmpdir=/tmp

Configuration and timing

Three extra files can be found in the logs folder:

A log file named 20230526T1623_running_time.txt stores running times.

[username@clust-slurm-client Methylator]$ cat logs/20230526T1623_running_time.txt 

Project name: Test
Start time: Fri May 26 16:23:34 2023
Time of running trimming: 0:00:53
Time of running genome alignment: 0:02:55
Finish time: Fri May 26 16:30:09 2023

A log file named 20230526T1623_configuration.txt keeps a track of the configuration of the run (config_main.yaml followed by metadata.tsv, the environment contained in the Apptainer image with all tool versions, and resource configuration workflow/cluster.yaml) ?????A changer en resources.yaml ???????

[username@clust-slurm-client Methylator]$: cat logs/20230526T1623_configuration.txt 
 
... 
## aligner
ALIGNER: BOWTIE2 # "BOWTIE2"

## genome files
GENOMEPATH: /shared/banks/genomes/homo_sapiens/hg38/fasta/hg38.fa #TestDataset/my_bank/mm39_chr19_mini.fa #path to the reference genome's folder 

## genome
ASSEMBLY: hg38 # short name of the assembly used for the analysis
...

==========================================

SAMPLE PLAN
sample	group
SRR1104848	CD14
SRR1104855	CD14
SRR1067571	CD56
SRR1067573	CD56
SRR1104851	CD56d
SRR1104862	CD56d

==========================================

SINGULARITY IMAGE - yaml file 

name: wgbsflow
channels:
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1
  - _openmp_mutex=4.5
  - _r-mutex=1.0.1
  - aioeasywebdav=2.4.0
  - aiohttp=3.8.3
  - aiosignal=1.3.1
  - alsa-lib=1.2.8
  - amply=0.1.5
  - appdirs=1.4.4
  - async-timeout=4.0.2
  - asynctest=0.13.0
  ...


==========================================

CLUSTER

### The name visible in squeue is 8 characters long. mem unit is Mo.  

__default__:
  mem: 500
  name: snakejob
  cpus: 1

fastqDump:
  mem: 10000
  name: fastqDump
  cpus: 4
...

A log file named 20200925T1057_free_disk.txt stores the disk usage during the run (every minute, the remaining space is measured).
```
# quota:1 limit:1.5
time	free_disk
20210726T1611	661
20210726T1612	660
20210726T1613	660
20210726T1614	660
20210726T1615	660
20210726T1616	660
```
If a run stops with no error, it can be that you don’t have enough space in your project. You can see in this file that the free disk tends to 0.

Workflow results

The results are separated into two folders :

the big files : trimmed FASTQ, BAM files, RData objects are in the data folder defined in configs/config_main.yaml at BIGDATAPATH

## paths for intermediate and final results
BIGDATAPATH: /shared/projects/YourProjectName/Methylator/Big_Data # for big files

[username@clust-slurm-client Methylator]$ tree -L 2 Big_Data/EXAMPLE/
Big_Data/EXAMPLE/
.
├── Annotation
│   └── Annotatr.RData
├── hg38.chrom_sizes
├── mapping_BOWTIE2
│   ├── bam_byName
│   ├── bismark_genome
│   ├── chrom.sizes
│   ├── Deduplicate
│   └── Meth_Extractor
├── Methylator_WGBS
│   └── Rdata
└── trimmed
    ├── SRR1067571_R1_clean_fastqc.html
    ├── SRR1067571_R1_clean_fastqc.zip
    ├── SRR1067571_R1_clean.fastq.gz
    ├── SRR1067571_R1.fastq.gz_trimming_report.txt
    ├── SRR1067571_R2_clean_fastqc.html
    ├── SRR1067571_R2_clean_fastqc.zip
    ├── SRR1067571_R2_clean.fastq.gz
    ├── SRR1067571_R2.fastq.gz_trimming_report.txt
    ...

the small files: QC reports, BigWig, Analysis reports, etc. are in the final result folder defined in configs/config_main.yaml at RESULTPATH

RESULTPATH: /shared/projects/YourProjectName/Methylator/Results

[username@clust-slurm-client Methylator]$ tree -L 2 Results/EXAMPLE/
.
├── logs
│   ├── 20231120T1447_configuration.txt
│   ├── 20231120T1447_free_disk.txt
│   ├── 20231120T1447_mapping.txt
│   ├── 20231120T1447_running_time.txt
│   ├── 20231120T1447_trim_report.txt
│   ├── 20231120T1447_trim.txt
|   |    ... 
├── mapping_BOWTIE2
│   ├── alignmentQC
│   ├── BigWig
│   ├── bismark_summary_report.html
│   ├── bismark_summary_report.txt
│   ├── mapping_report_SRR11806587_sub500000_chr19.html
│   ├── mapping_report_SRR11806588_sub500000_chr19.html
|   |   ... 
│   └── multiqc
├── Methylator_WGBS
│   ├── Annotatr.log
│   └── CpG_mincov5
└── trimmed
    ├── fastqc_trimming
    └── trimming

This way you can get all the results on your computer by running (from your computer):

You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Methylator/results/EXAMPLE/ PathTo/WhereYouWantToSave/

and the huge files will stay on the server. You can of course download them as well if you have space (and this is recommended for the long term).

Final report

A report named as 20210727T1030_report.html summarizes your experiment and your results. You’ll find links to fastQC results, to mapping quality report, to exploratory analysis of all the samples and finally to pairwise differential expression analyses.

A compressed archive named 20210727T1030_report.tar.bz2 is also generated and contains the report and the targets of the different links, excluding big files to make it small enough to be sent to your collaborators.

[username@clust-slurm-client Methylator]$ tree -L 2 Results/EXAMPLE/20210727T1030_report 
.
├── 1KO_DKO_DMR.html
├── 1KO_DKO.html
├── bismark_summary_report.html
├── Exploration_5mC.html
├── report_mapping_bismark.html
├── report_quality_control_after_trimming.html
├── WT_1KO_DMR.html
├── WT_1KO.html
├── WT_DKO_DMR.html
└── WT_DKO.html

Tip for Windows users
Unlike Linux and Mac, the tar.bz2 format is not natively supported by Windows, but you can use the free PeaZip or 7-zip softwares to decompress the xxx_report.tar.bz2 archive.

An example of report is visible here.

Detailed description of all the outputs of the workflow is included below.

Trimmed reads

After trimming, the FASTQ are stored in the data folder defined in configs/config_main.yaml at BIGDATAPATH:.

In this examples the trim FASTQ files will be stored in /shared/projects/YourProjectName/Methylator/data/EXAMPLE/trim/. They are named

Sample1_R1_val_1.fq
Sample1_R2_val_2.fq

Trimming report

In results/EXAMPLE/trimming you’ll find trimming reports such as Sample1_forward.fastq.gz_trimming_report.txt for each samples. You’ll find information about the tools and parameters, as well as trimming statistics:

SUMMARISING RUN PARAMETERS
==========================
Input filename: /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 4.2
Python version: could not detect
Number of cores used for trimming: 4
Quality Phred score cutoff: 28
Quality encoding type selected: ASCII+33
Using Illumina adapter for trimming (count: 37651). Second best hit was Nextera (count: 0)
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
All Read 1 sequences will be trimmed by 10 bp from their 5' end to avoid poor qualities or biases
All Read 2 sequences will be trimmed by 10 bp from their 5' end to avoid poor qualities or biases (e.g. M-bias for BS-Seq applications)
All Read 1 sequences will be trimmed by 9 bp from their 3' end to avoid poor qualities or biases
All Read 2 sequences will be trimmed by 9 bp from their 3' end to avoid poor qualities or biases
Running FastQC on the data once trimming has completed
Output file will be GZIP compressed


This is cutadapt 4.2 with Python 3.7.12
Command line parameters: -j 4 -e 0.1 -q 28 -O 1 -a AGATCGGAAGAGC /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
Processing single-end reads on 4 cores ...
Finished in 148.610 s (3.960 µs/read; 15.15 M reads/minute).

=== Summary ===

Total reads processed:              37,527,910
Reads with adapters:                19,815,625 (52.8%)
Reads written (passing filters):    37,527,910 (100.0%)

Total basepairs processed: 1,876,395,500 bp
Quality-trimmed:              11,610,210 bp (0.6%)
Total written (filtered):  1,798,598,760 bp (95.9%)

=== Adapter 1 ===

Sequence: AGATCGGAAGAGC; Type: regular 3'; Length: 13; Trimmed: 19815625 times

Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-13 bp: 1

Bases preceding removed adapters:
  A: 49.8%
  C: 24.1%
  G: 13.9%
  T: 12.2%
  none/other: 0.0%

Overview of removed sequences
length	count	expect	max.err	error counts
1	14655096	9381977.5	0	14655096
2	252919	2345494.4	0	252919
3	408890	586373.6	0	408890
[...]

RUN STATISTICS FOR INPUT FILE: /shared/projects/wgbs_flow/Nanopore_data/ONT_data/rrms_2022.07/bisulfite/raw_data/COLO1_R1.fq.gz
=============================================
37527910 sequences processed in total

This information is summarized in the MultiQC report, see below.

FastQC of trimmed reads

After the trimming, fastQC is automatically run on the new FASTQ and the results are also in the folder results/EXAMPLE/fastqc_trimming/:

Sample1_R1_trimmed_fastqc.html
Sample1_R1_trimmed_fastqc.zip
Sample1_R2_trimmed_fastqc.html
Sample1_R2_trimmed_fastqc.zip

As previously MultiQC gives a summary for all the samples : results/EXAMPLE/fastqc_trimming/report_quality_control_after_trimming.html. You’ll find information from the trimming report (for instance you can rapidly see the % of trim reads for the different samples) as well as from fastQC. It is included in the final report (ie ????.html).

Mapped reads

The mapped reads are stored as deduplicated, sorted bam in the data folder, in our example in Big_Data/EXAMPLE/mapping_BOWTIE2/Deduplicate/, together with their .bai index. They can be visualized using a genome browser such as IGV but this is not very convenient as the files are heavy. BigWig files, that summarize the information converting the individual read positions into a number of reads per bin of a given size, are more adapted.

BigWig

To facilitate visualization on a genome browser, BigWig files are generated (window size of xx ?? bp). There are in ???? .

If not already done, you can specifically get the BigWig files on your computer running:

You@YourComputer:~$ scp -pr username@core.cluster.france-bioinformatique.fr:/shared/projects/YourProjectName/Methylator/??????, PathTo/WhereYouWantToSave/

Snapshot of BigWig tracks visualized on IGV. TODO

drawing

Mapping QC

Qualimap is used to check the mapping quality. You’ll find qualimap reports in Results/EXAMPLE/mapping_BOWTIE2/alignmentQC. Those reports contain a lot of information:

information about the mapper
number and % of mapped reads/pairs
number of indels and mismatches
coverage per chromosome
insert size histogram
…

Once again MultiQC aggregates the results of all the samples and you can have a quick overview by looking at Results/EXAMPLE/mapping_BOWTIE2/multiqc/report_mapping_bismark.html or in the final report (ie ????_report.html).

Statisical analysis results

All files of the staticticals analysis are in Results/EXAMPLE/Methylator_DATATYPE/LEVEL_mincovMINCOV/
DATATYPE corresponding to the type of data ( : WGBS, RRBS or NANOPORE).
LEVEL corresponding to the level of analysis (per base or tiles). And MINCOV corresponding at the minimum of coverage choose for retain CpG. So, with the same samples, if you change this parameters to perform a new analysis a new folder is created for don’t overwrite laste files.

Exploratory methylation analysis results

Exploratory methylation results are in Results/EXAMPLE/Methylator_DATATYPE/{LEVEL}_mincov{MINCOV}/exploratory/ This folder contain :

a HTML report file
a folder Exploration_5mC_files/figure-html with all figures generated by the report in .PNG format.

Differential methylation (DMC/DMT) results

Differential methylation results are in Results/EXAMPLE/Methylator_DATATYPE/{LEVEL}_mincov{MINCOV}/differential/
This folder contain :

a HTML report file for each comparison (indicate in the config file)
a bed folder for each comparisons with :
- a bed file for all DMC/DMT
- a bed file for significants DMC or DMT
- a bed file for hyper-methylated significants DMC or DMT
- a bed file for hypo-methylated significants DMCor DMT

Please note! Each bed is generated for each combination of values in the LIST_DIFF and LIST_QV lists. So if the LIST_DIFF list has 4 threshold values and the LIST_QV list has 4 threshold values, we’ll generate 16 x bed files for each condition, i.e. 16x4x3 = 192 bedgraphs !!!

???

Differential methylation (DMR only) results

Differential methylation results are in Results/EXAMPLE/Methylator_DATATYPE/LEVEL_mincovMINCOV/DMR/
This folder contain for each comparaison a folder with :

a HTML report file (indicate in the config file)
a CSV file with all DMRs detected
a CSV file with only significant DMRs
a CSV file with only significant DMRS associate with genes annotations

The DMR gene file can be a starting point for Gene Set Enrichment Analysis (GSEA). This feature may later be added directly to the workflow.
Care should be taken, however, as the presence of a DMR in a gene does not necessarily mean that the gene is over- or underexpressed. Additional analyses (e.g. RNAseq) are required.

All this HTML files are included in the final report.

Extra help!

How to follow your jobs

Running jobs

You can check the jobs that are running using squeue.

[username@clust-slurm-client Methylator]$ squeue --me

Information about past jobs

The sacct command gives you information about past and running jobs. The documentation is here. You can get different information with the --format option. For instance:

[username@clust-slurm-client Methylator]$ sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State
       JobID    JobName               Start    CPUTime     MaxRSS     ReqMem      State 
------------ ---------- ------------------- ---------- ---------- ---------- ---------- 
...
9875767          BigWig 2020-07-27T16:02:48   00:00:59               80000Mn  COMPLETED 
9875767.bat+      batch 2020-07-27T16:02:48   00:00:59     87344K    80000Mn  COMPLETED 
9875768         BigWigR 2020-07-27T16:02:51   00:00:44               80000Mn  COMPLETED 
9875768.bat+      batch 2020-07-27T16:02:51   00:00:44     85604K    80000Mn  COMPLETED 
9875769             PCA 2020-07-27T16:02:52   00:01:22                2000Mn  COMPLETED 
9875769.bat+      batch 2020-07-27T16:02:52   00:01:22    600332K     2000Mn  COMPLETED 
9875770         multiQC 2020-07-27T16:02:52   00:01:16                2000Mn  COMPLETED 
9875770.bat+      batch 2020-07-27T16:02:52   00:01:16    117344K     2000Mn  COMPLETED 
9875773        snakejob 2020-07-27T16:04:35   00:00:42                2000Mn  COMPLETED 
9875773.bat+      batch 2020-07-27T16:04:35   00:00:42     59360K     2000Mn  COMPLETED 
9875774             DEA 2020-07-27T16:05:25   00:05:49                2000Mn    RUNNING 

Here you have the job ID and name, its starting time, its running time, the maximum RAM used, the memory you requested (it has to be higher than MaxRSS, otherwise the job fails, but not much higher to allow the others to use the resource), and job status (failed, completed, running).

Add -S MMDD to have older jobs (default is today only).

[username@clust-slurm-client Methylator]$ sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State -S 0518

Cancelling a job

If you want to cancel a job: scancel job number

[username@clust-slurm-client Methylator]$ scancel 8016984

Nota: when snakemake is working on a folder, this folder is locked so that you can’t start another DAG and create a big mess. If you cancel the main job, snakemake won’t be able to unlock the folder (see below).

Having errors?

To quickly check if everything went fine, you have to check the main log. If everything went fine you’ll have :

???

If not, you’ll see a summary of the errors:

??????

And you can check the problem looking as the specific log file, here logs/20231104T0921_mapping.txt

???

You can have the description of the error in the SLURM output corresponding to the external jobid, here 13605307:

[username @ clust-slurm-client Methylator]$ cat slurm_output/slurm-13605307.out

Common errors

Error starting gedit on IFB

If you encounter an error starting gedit

[unsername @ clust-slurm-client 16:04]$ ~ : gedit toto.txt
(gedit:6625): Gtk-WARNING **: cannot open display: 

Be sure to include -X when connecting to the cluster (-X option is necessary to run graphical applications remotely). Use :

You@YourComputer:~$ ssh -X unsername@core.cluster.france-bioinformatique.fr

You@YourComputer:~$ ssh -X -o "ServerAliveInterval 10" unsername@core.cluster.france-bioinformatique.fr

The option -o "ServerAliveInterval 10" is facultative, it keeps the connection alive even if you’re not using your shell for a while.

Initial QC fails

If you don’t get MultiQC report_quality_control.html report in results/EXAMPLE/fastqc, you may have some fastq files not fitting the required format:

SampleName_R1.fastq.gz and SampleName_R2.fastq.gz for pair-end data,
SampleName.fastq.gz for single-end data.

Please see how to correct a batch of FASTQ files.

Memory

I set up the memory necessary for each rule, but it is possible that big datasets induce a memory excess error. In that case the job stops and you get in the corresponding Slurm output something like this:

slurmstepd: error: Job 8430179 exceeded memory limit (10442128 > 10240000), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 8430179 ON cpu-node-13 CANCELLED AT 2020-05-20T09:58:05 ***
Will exit after finishing currently running jobs.

In that case, you can increase the memory request by modifying in workflow/resources.yaml the mem entry corresponding to the rule that failed.

[username@clust-slurm-client Methylator]$ cat workflow/resources.yaml
__default__:
  mem: 500
  name: rnaseq
  cpus: 1

qualityControl:
  mem: 6000
  name: QC
  cpus: 2

trim:
  mem: 6000
  name: trimming
  cpus: 8
...  

If the rule that failed is not listed here, you can add it respecting the format. And restart your workflow.

Folder locked

When snakemake is working on a folder, this folder is locked so that you can’t start another DAG and create a big mess. If you cancel the main job, snakemake won’t be able to unlock the folder and next time you run Workflow.sh wgbs, you will get the following error:

Error: Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/shared/mfs/data/projects/awesome/Methylator
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.

In order to remove the lock, run:

[username@clust-slurm-client Methylator]$ sbatch Unlock.sh

Then you can restart your workflow.

Storage space

Sometimes you may reach the quota you have for your project. To check the quota, run:

[username@clust-slurm-client Methylator]$ lfsgetquota YourProjectName

In principle it should raise an error, but sometimes it doesn’t and it’s hard to find out what is the problem. So if a task fails with no error (typically during mapping), try to make more space (or ask for more space on IFB Community support or iPOP-UP Community support) before trying again.

Paired-end or single-end ?

If you’re working with public data, directly download from SRA. Sometimes, sequencing is indicated as pair-end when it is actually single-end and vice versa. In this case, the FASTQ quality control step makes it easy to see. Open one of the sample_R1_fastqc.html files and look at the Per base sequence content section. If the data are pair-end when they were indicated as single-end, you will get a figure like this:

Ajouter image !! (j’en ai une au labo dans mon journal de bord)

If the data are single-end when they were indicated as paired-end, you will get a figure like this:

Ajouter image !!

Good practice

Always save job ID or the dateTtime (ie 20230615T1540) in your notes when launching Workflow.sh wgbs. It’s easier to find the outputs you’re interested in days/weeks/months/years later.

Juggling with several projects

TODO : rename image to methylator.simg ????

If you work on several projects [as defined by cluster documentation], you can either

have one independant installation of Methylator / project with its own Singularity Image (2 Go). To do that, git clone Methylator repository in each project.
have an independant Methylator folder, but share the Singularity Image. To do that, git clone Methylator repository in each project, but, before running any analysis, create a symbolic link of wgbsflow.simg from your first project:
```
[username@clust-slurm-client PROJECT2]$ cd Methylator
[username@clust-slurm-client Methylator]$ ln -s /shared/projects/YourFirstProjectName/Methylator/wgbsflow.simg/ wgbsflow.simg
```
Have only one Methylator folder and start all your analysis from the same project, but reading input files and writing result files in a different project.

Tricks

Make aliases

To save time avoiding typing long commands again and again, you can add aliases to your .bashrc file (change only the aliases, unless you know what you’re doing).

[username@clust-slurm-client ]$ cat ~/.bashrc 
# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
	. /etc/bashrc
fi

# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=

# User specific aliases and functions

alias qq="squeue -u username"
alias sa="sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State"
alias ll="ls -lht --color=always"

It will work next time you connect to the server. When you type sa, you will get the command sacct --format=JobID,JobName,Start,CPUTime,MaxRSS,ReqMeM,State running.

Quickly change fastq names

It is possible to quickly rename all your samples using mv. For instance if your samples are named with dots instead of underscores and without R:

[username@clust-slurm-client Raw_fastq]$ ls
D192T27.1.fastq.gz  
D192T27.2.fastq.gz  
D192T28.1.fastq.gz  
D192T28.2.fastq.gz  
D192T29.1.fastq.gz  
D192T29.2.fastq.gz  
D192T30.1.fastq.gz  
D192T30.2.fastq.gz  
D192T31.1.fastq.gz  
D192T31.2.fastq.gz  
D192T32.1.fastq.gz  
D192T32.2.fastq.gz  

You can modify them using mv and a loop on sample numbers.

[username@clust-slurm-client Raw_fastq]$ for i in `seq 27 32`; do mv D192T$i\.1.fastq.gz D192T$i\_R1.fastq.gz; done
[username@clust-slurm-client Raw_fastq]$ for i in `seq 27 32`; do mv D192T$i\.2.fastq.gz D192T$i\_R2.fastq.gz; done

Now sample names are OK:

[username@clust-slurm-client Raw_fastq]$ ls
D192T27_R1.fastq.gz  
D192T27_R2.fastq.gz  
D192T28_R1.fastq.gz  
D192T28_R2.fastq.gz  
D192T29_R1.fastq.gz  
D192T29_R2.fastq.gz  
D192T30_R1.fastq.gz  
D192T30_R2.fastq.gz  
D192T31_R1.fastq.gz  
D192T31_R2.fastq.gz  
D192T32_R1.fastq.gz  
D192T32_R2.fastq.gz  

Specificity of BSseq data

Ajouter des infos ici

Per base sequence content

In FASTQ quality control, the “Per base sequence content” section displays an error message. This is linked to the conversion of unmethylated cytosine to thymine. This error should be ignored.

Mapping

###

BiBs 2025 parisepigenetics
https://github.com/parisepigenetics/bibs

programming pages theme v0.5.22 (https://github.com/pixeldroid/programming-pages)

s	focus search bar ( enter to select, ▲ / ▼ to change selection)
g c	go to cluster
g e	go to edctools
g f	go to facility
g g	go to guidelines
g t	go to training
h	toggle this help ( esc also exits)

DNA methylation analysis using Methylator

Detailed tutorial

Table of content

Your analysis in a nutshell

Resources

Before starting : create an account and a project on your favorite cluster

Methylator installation and description

Adapt the workflow to your cluster

Quick start with the test dataset

Transfer your data

FASTQ names

Generate md5sum

Copy to the cluster

Check md5sum

Preparing the run

metadata.tsv

config_wgbs.yaml

1) Define a project name, paths and shared parameters

2) Steps of the workflow you want to run

3) Configuration of the specific tools

config_nanopore.yaml

1) Define a project name, paths and shared parameters

2) Steps of the workflow you want to run

3) Configuration of the specific tools

Running your analysis step by step

FASTQ quality control (eventually after SRA data retrieval)

FastQC results

Trimming

Mapping

Process BAM to BED (only for nanopore)

Statistical analysis

Exploratory analysis

Differential methylation analysis

Start the workflow

Description of the log files

Main script

Snakefiles

Individual tasks

Configuration and timing

Workflow results

Final report

Trimmed reads

Trimming report

FastQC of trimmed reads

Mapped reads

BigWig

Mapping QC

Statisical analysis results

Exploratory methylation analysis results

Differential methylation (DMC/DMT) results

Differential methylation (DMR only) results

Extra help!

How to follow your jobs

Running jobs

Information about past jobs

Cancelling a job

Having errors?

Common errors

Error starting gedit on IFB

Initial QC fails

Memory

Folder locked

Storage space

Paired-end or single-end ?

Good practice

Juggling with several projects

Tricks

Make aliases

Quickly change fastq names

Specificity of BSseq data

Per base sequence content

Mapping

Keyboard Shortcuts