Software tools to extract mRNA's features from a list of ENSEMBL gene IDs.
authors: Costas BOUYIOUKOS, Antoine LU and Arnold Franz AKE
A set of computational tools to extract user defined mRNA features from a list of ENSEMBL gene IDs by using the web API of ENSEMBL BioMart and custom computations. Conceived and developed by Costas Bouyioukos @cbouyio at Paris Epigenetics @parisepigenetics and Universite Paris Diderot. Development involved two bioinformatics master students: Antoine LU @antoinezl who started as part of a coding project during his second year in the degree and Franz-Arnold AKE @franzx5 a second year Master’s degree student who mainly worked on the clustering part of the project.
To install the tools in your local python environment (user $HOME directory) type:
./setup.py install --user
(the –user flag installs the software on your personal account (no root privileges required).
All are available for installation via pip install <package_name>
For external tools please follow the installation guidelines in the provided links.
geneIDs2fasta.py ENSEMBL_geneIDs_file fasta_output_file
and
fasta2table.py ENSEMBL_fasta_output_file features_table_file
This program takes a text file with a list of ENSEMBL gene IDs and returns a FASTA formatted file of the corresponding cDNA sequences. The header is formatted and contains various metadata ordered as:
>ENSEMBL_transcript_ID |Gene stable ID | Gene name | cDNA start | cDNA end | TSL | APRIS | HAVANA_ENSEMBL | gene description | Source:|
This program takes the fasta formatted file returned by the previous script geneIDs2fasta in input, and return a semicolon separated table with the following header:
ensembl_gene_id;gene_name;coding_len;5pUTR_len;5pUTR_GC;5pUTR_MFE;5pUTR_MfeBP;3pUTR_len;3pUTR_GC;3pUTR_MFE;3pUTR_MfeBP;TOP_localScore;CAI;Kozak_Sequence;Kozak_Context
Test directory contains two test files to test and demonstrate the functionality of the tools.
test/testENSEMBLids.txt Contains 6 genes with their ENSEMBL IDs.
test/testTransExpr.csv Contains the expression levels of each individual transcript of the above genes from a case study.