Setup
1. Option 1 (preferred): using a Docker image
1.1 Installing Docker
The preferred option to install all softwares and packages is to use a tailor-made Docker image. See this nice introduction to Docker here.
There are two Docker images necessary to complete this RNA-seq lesson:
- The command-line Docker
fastq-2021
image necessary to perform all bioinformatic analyses on the sequencing files: trimming, alignment and count table generation. - The RStudio Docker
rnaseq-2021
image necessary to perform all count-related analyses: EDA, differential expression and downstream functional analyses.
So first thing first, we need to install Docker itself.
Install Docker
Unfortunately, in many common situations installing Docker on your laptop will not straightforward if you do not have a large amount of technical experience. We have helpers on hand that have worked their way through the install process but be prepared for some troubleshooting. Please try to install the appropriate software from the list below depending on the operating system that your laptop is running:
Microsoft Windows
You must have admin rights to run docker! Some parts of the lesson will work without running as admin but if you are unable to
Run as admin
on your machine some of this workshop might not work easily.If you have Windows 10 Pro Edition:
- First try to install the Docker Desktop (Windows), or failing that;
- Install the Docker Toolbox (Windows).
If you have Windows 10 Home Edition:
- Install the Docker Toolbox (Windows).
Apple macOS
Either:
- First, try to install the Docker Desktop (Mac), or failing that:
- Install the Docker Toolbox (Mac).
Linux
There are too many varieties of Linux to give precise instructions here, but hopefully you can locate documentation for getting Docker installed on your Linux distribution. It may already be installed. Note that Docker do list a number of versions of the Docker Engine for different Linux distributions here.
Troubleshooting
Sometimes with git-bash and Windows, you can get issues listed here:
the input device is not a TTY. If you are using mintty, try prefixing the command with 'winpty'
. This can be troubleshooted following this blog post.
1.2 The fastq-2022
image for bioinformatic steps (episodes 03 and 04)
This Docker image will allow you to complete the episodes 03 and 04 that work on .fastq
sequencing files.
The Docker image is called fastq-2022
and contains softwares and data required for the command-line part of the lesson. It can be found found at the Science Park Study Group DockerHub with the tag fastq-2022
.
Before you start
Before the training, please make sure you have done the following:
- First, install Docker desktop for your operating system (Mac OS X or Windows).
- If needed, install Shell Bash: follow these instructions.
- Open a new Shell Bash window and navigate to a folder that will be your workspace. For instance, you could create a folder named
rnaseq-tutorial/
on your Desktop and move inside with the Shell usingcd ~/Desktop/rnaseq-tutorial/
.- In a Shell Bash window, type the following command:
docker run -it --name bioinfo -v $PWD:/workspace/ scienceparkstudygroup/master-gls:fastq-2021
. This will download a Docker image for the bioinformatic part of the course, create and run a container where Bash will be running. You will enter the container directly where you can start working.- To quit, type
exit
and you will exit the container and be on your machine file system again. The container will be stopped.- To go back to the container, type
docker start bioinfo
and thendocker exec -it bioinfo bash
. You will enter inside the container again where you can find all softwares and data.- Type
exit
to go back to your file system.
Docker command-line explanations:
- The
--it
starts an interactive session in which you directly start AND enter the container. - The
--name
gives a name to the container for easy retrieval. - The
-v $PWD:/workspace/
maps your working directory (e.g.~/Desktop/rnaseq-tutorial
) to the container/workspace/
folder.
1.3 The rnaseq-2022
image for the gene counts data analysis (episodes 05, 06 and 07)
This image is based on a Bioconductor Docker image release 3.14 image with additional packages such as pheatmap
or tidyverse
.
The image can be found at the Science Park Study Group DockerHub with the tag rnaseq-2022
.
Before you start
Before the training, please make sure you have done the following:
- First, install Docker desktop for your operating system.
- If needed, install Shell Bash: follow these instructions.
- Open a new Shell Bash window and navigate to a folder that will be your workspace. For instance, you could create a folder named
workspace/
on your Desktop and move inside with the Shell usingcd ~/Desktop/workspace/
.- In a Shell Bash window, type the following command:
docker run --detach --name machine01 -e PASSWORD=mypwd -p 8787:8787 scienceparkstudygroup/master-gls:rnaseq-2021
. This will download a Docker image for the course, create and run a container where RStudio will be running.- Navigate to http://localhost:8787 in your web browser. You should have an RStudio session running. Type
rstudio
as the user name andmypwd
as your password.- To quit, close the web browser window where RStudio is running and exit the Shell too.
Important note
You can save files to your disk when working inside the Docker-powered R session. You need to save them as you would normally. The files (e.g.
my_plot.png
) will be where you were working (the directory from which you launched the Docker container).
Docker command-line explanations:
- The
--rm
removes the container when it has been run. No need to store it into your computer after use. - The
--name
gives a name to the running container for easy retrieval. - The
-p 8787:8787
follow the format-p host_port:container_port
. Therefore the port 8787 inside the container will be exposed to the outside port on the host machine. That way, the running instance of RStudio can be access through the:port format.
2. Option 2: manual installation
This is the second way to install softwares and packages. It should work but there is no guarantee that it will work since R and packages versions on your machine might be different from the software and package versions used in this lesson. Thus, the preferred way is still to use the Docker image (option 1).
2.1 Softwares and packages
Before you start.
Before the training, please make sure you have done the following:
- Download and install up-to-date versions of:
- R: https://cloud.r-project.org.
- RStudio: http://www.rstudio.com/download.
- The
DESeq2
package: https://bioconductor.org/packages/release/bioc/html/DESeq2.html.- The
tidyverse
package: https://www.tidyverse.org/.- The
EnhancedVolcano
package: http://bioconductor.org/packages/release/bioc/html/EnhancedVolcano.html.- The
pheatmap
package: https://cran.r-project.org/web/packages/pheatmap/index.html.- The
biomartr
package: https://cran.r-project.org/web/packages/biomartr/index.html.- The
clusterProfiler
package: https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html.- The
org.At.tair.db
package: https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html.- The
biomaRt
package: https://bioconductor.org/packages/release/bioc/html/biomaRt.html.- The
pwr
package: https://cran.r-project.org/web/packages/pwr/index.html.- Read the workshop Code of Conduct to make sure this workshop stays welcoming for everybody.
- Get comfortable: if you’re not in a physical workshop, be set up with two screens if possible. You will be following along in RStudio on your own computer while also following this tutorial on your own. More instructions are available on the workshop website in the Setup section.
2.2 Data files
What you need to download for the part completed in the Shell (fastq QC, alignment, counting)
Please download the necessary data files for the lesson from the Zenodo archive.
- Arabidopsis_sample1/2/3/4.fq.gz: A
FASTQ
file containing a sample sequenced mRNA-seq reads in the FASTQ format.- AtChromosome1.fa.gz: the gzipped chromosome 1 sequence of the Arabidopsis thaliana genome in
FASTA
format.- ath_annotation.gff3.gz: the gzipped genome annotation of Arabidopsis thaliana for chromosome 1 in the
GFF3
format. This indicates the positions of genes, their exons and 5’ or 3’ UTR on the chromosome and is used to generate the gene counts.- adapters.fasta: the Illumina adapter sequences used for read trimming using Trimmomatic.
What you need to download for the part completed in R (PCA, DEseq2, Clustering)
Please download the necessary data files for the lesson from the Zenodo archive.
- Counts: A
raw_counts.csv
dataframe of the sample raw counts. It is a tab separated file therefore data are in tabulated separated columns.- Samples to experimental conditions: the
samples_to_conditions.csv
dataframe indicates the correspondence between samples and experimental conditions (e.g. control, treated).- Differentially expressed genes:
differential_genes.csv
dataframe contains the result of the DESeq2 analysis.- Please read the original study description below and have a look at the file preview to understand their format.
- These
raw_counts.csv
file was obtained by running thev0.1.1
version of a RNA-Seq bioinformatic pipeline on the mRNA-Seq sequencing files from Vogel et al. (2016).
3. Original study
This RNA-seq lesson will make use of a dataset from a study on the model plant Arabidopsis thaliana inoculated with commensal leaf bacteria (Methylobacterium extorquens or Sphingomonas melonis) and infected or not with a leaf bacterial pathogen called Pseudomonas syringae. Leaf samples were collected from Arabidopsis plantlets from plants inoculated or not with commensal bacteria and infected or not with the leaf pathogen either after two days (2 dpi, dpi: days post-inoculation) or seven days (6 dpi).
All details from the study are available in Vogel et al. in 2016 and was published in New Phytologist.
3.1 Gene counts
The dimension of this table are 33,769 rows x 49 columns.
- 33,769 rows: one for gene and sample names and the rest for gene counts.
- 49 columns: one for the gene id and the rest for sample accession identifiers (from the EBI European Nucleotide Archive).
Geneid | ERR1406259 | ERR1406260 | ERR1406261 | ERR1406262 | ERR1406263 | ERR1406264 | ERR1406265 | ERR1406266 | ERR1406268 | ERR1406269 | ERR1406270 | ERR1406271 | ERR1406272 | ERR1406273 | ERR1406274 | ERR1406275 | ERR1406276 | ERR1406277 | ERR1406278 | ERR1406279 | ERR1406280 | ERR1406281 | ERR1406282 | ERR1406284 | ERR1406285 | ERR1406286 | ERR1406287 | ERR1406288 | ERR1406289 | ERR1406290 | ERR1406291 | ERR1406292 | ERR1406293 | ERR1406294 | ERR1406296 | ERR1406297 | ERR1406298 | ERR1406299 | ERR1406300 | ERR1406301 | ERR1406302 | ERR1406303 | ERR1406304 | ERR1406305 | ERR1406306 | ERR1406307 | ERR1406308 | ERR1406309 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AT1G01010 | 59 | 81 | 40 | 51 | 57 | 110 | 93 | 87 | 99 | 131 | 80 | 79 | 142 | 216 | 102 | 76 | 92 | 116 | 100 | 126 | 151 | 249 | 61 | 189 | 161 | 92 | 80 | 125 | 77 | 106 | 90 | 86 | 164 | 71 | 64 | 83 | 100 | 86 | 91 | 214 | 142 | 76 | 84 | 123 | 91 | 69 | 75 | 85 |
AT1G01020 | 365 | 466 | 440 | 424 | 393 | 567 | 397 | 468 | 465 | 365 | 382 | 365 | 595 | 509 | 323 | 422 | 325 | 358 | 415 | 403 | 498 | 501 | 441 | 498 | 409 | 396 | 472 | 566 | 422 | 462 | 504 | 434 | 717 | 534 | 408 | 346 | 757 | 456 | 443 | 976 | 517 | 467 | 533 | 648 | 457 | 393 | 538 | 579 |
AT1G03987 | 8 | 16 | 13 | 19 | 13 | 20 | 19 | 24 | 8 | 10 | 10 | 14 | 11 | 13 | 10 | 9 | 11 | 20 | 14 | 10 | 10 | 8 | 14 | 25 | 14 | 13 | 18 | 17 | 19 | 4 | 12 | 14 | 29 | 15 | 19 | 47 | 28 | 6 | 21 | 20 | 5 | 5 | 8 | 17 | ||||
AT1G01030 | 111 | 200 | 189 | 164 | 141 | 389 | 200 | 175 | 127 | 186 | 140 | 189 | 147 | 193 | 102 | 101 | 103 | 128 | 136 | 120 | 162 | 229 | 124 | 177 | 125 | 136 | 169 | 197 | 141 | 217 | 214 | 180 | 253 | 161 | 98 | 152 | 371 | 219 | 170 | 566 | 441 | 99 | 207 | 220 | 169 | 117 | 123 | 183 |
AT1G03993 | 131 | 179 | 169 | 157 | 114 | 156 | 138 | 184 | 193 | 143 | 135 | 155 | 218 | 236 | 159 | 194 | 149 | 156 | 168 | 128 | 174 | 269 | 183 | 215 | 176 | 165 | 171 | 247 | 179 | 181 | 177 | 199 | 313 | 236 | 154 | 169 | 313 | 201 | 202 | 332 | 169 | 218 | 203 | 250 | 190 | 188 | 223 | 218 |
AT1G01040 | 1491 | 1617 | 1418 | 1543 | 1224 | 1635 | 1524 | 1665 | 1565 | 1566 | 1496 | 1499 | 2244 | 1881 | 1177 | 1751 | 1444 | 1631 | 1393 | 1407 | 1880 | 2311 | 1529 | 1919 | 1662 | 1537 | 1691 | 2142 | 1469 | 1733 | 1910 | 1873 | 3079 | 2179 | 1486 | 1471 | 2840 | 1891 | 1924 | 3136 | 1520 | 1901 | 1950 | 2596 | 1802 | 1851 | 2133 | 1984 |
AT1G01046 | 35 | 30 | 48 | 32 | 28 | 50 | 51 | 56 | 36 | 26 | 29 | 38 | 48 | 30 | 15 | 44 | 23 | 31 | 22 | 27 | 33 | 51 | 41 | 35 | 48 | 38 | 41 | 49 | 27 | 36 | 39 | 50 | 57 | 49 | 41 | 30 | 54 | 41 | 43 | 85 | 42 | 42 | 59 | 65 | 49 | 64 | 50 | 46 |
ath-miR838 | 12 | 11 | 22 | 18 | 15 | 21 | 22 | 24 | 16 | 12 | 10 | 15 | 17 | 16 | 7 | 20 | 11 | 14 | 6 | 11 | 16 | 17 | 17 | 15 | 26 | 12 | 17 | 13 | 15 | 12 | 18 | 25 | 26 | 25 | 15 | 15 | 22 | 20 | 14 | 37 | 20 | 20 | 22 | 27 | 17 | 21 | 23 | 23 |
AT1G01050 | 1484 | 1483 | 1237 | 1544 | 1119 | 1453 | 1280 | 1256 | 1768 | 1869 | 1709 | 1649 | 2431 | 1858 | 1195 | 1518 | 1325 | 2013 | 1645 | 1666 | 2056 | 2258 | 1530 | 1834 | 1477 | 1532 | 1609 | 2220 | 1552 | 1976 | 1706 | 1807 | 2656 | 1873 | 1329 | 1512 | 2915 | 1646 | 1983 | 2687 | 1548 | 1740 | 1632 | 2330 | 1578 | 1521 | 1970 | 1977 |
… many more lines …
3.2 Experimental design table
dpi: days post-inoculation.
sample | growth | infected | dpi |
---|---|---|---|
ERR1406259 | MgCl2 | mock | 2 |
ERR1406271 | MgCl2 | mock | 2 |
ERR1406282 | MgCl2 | mock | 2 |
ERR1406294 | MgCl2 | mock | 2 |
ERR1406305 | MgCl2 | mock | 7 |
ERR1406306 | MgCl2 | mock | 7 |
ERR1406307 | MgCl2 | mock | 7 |
ERR1406308 | MgCl2 | mock | 7 |
ERR1406260 | MgCl2 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406261 | MgCl2 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406262 | MgCl2 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406309 | MgCl2 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406263 | MgCl2 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406264 | MgCl2 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406265 | MgCl2 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406266 | MgCl2 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406287 | Methylobacterium_extorquens_PA1 | mock | 2 |
ERR1406288 | Methylobacterium_extorquens_PA1 | mock | 2 |
ERR1406289 | Methylobacterium_extorquens_PA1 | mock | 2 |
ERR1406290 | Methylobacterium_extorquens_PA1 | mock | 2 |
ERR1406291 | Methylobacterium_extorquens_PA1 | mock | 7 |
ERR1406292 | Methylobacterium_extorquens_PA1 | mock | 7 |
ERR1406293 | Methylobacterium_extorquens_PA1 | mock | 7 |
ERR1406296 | Methylobacterium_extorquens_PA1 | mock | 7 |
ERR1406297 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406298 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406299 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406300 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406301 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406302 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406303 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406304 | Methylobacterium_extorquens_PA1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406268 | Sphingomonas_melonis_Fr1 | mock | 2 |
ERR1406269 | Sphingomonas_melonis_Fr1 | mock | 2 |
ERR1406270 | Sphingomonas_melonis_Fr1 | mock | 2 |
ERR1406272 | Sphingomonas_melonis_Fr1 | mock | 2 |
ERR1406273 | Sphingomonas_melonis_Fr1 | mock | 7 |
ERR1406274 | Sphingomonas_melonis_Fr1 | mock | 7 |
ERR1406275 | Sphingomonas_melonis_Fr1 | mock | 7 |
ERR1406276 | Sphingomonas_melonis_Fr1 | mock | 7 |
ERR1406277 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406278 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406279 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406280 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 2 |
ERR1406281 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406284 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406285 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 7 |
ERR1406286 | Sphingomonas_melonis_Fr1 | Pseudomonas_syringae_DC3000 | 7 |