Introduction
|
Tidy data principles are essential to increase data analysis efficiency and code readability.
Using R and RStudio, it becomes easier to implement good practices in data analysis.
I can make my workflow more reproducible and collaborative by using git and Github.
|
R & RStudio, Rmarkdown
|
R and RStudio make a powerful duo to create R scripts and Rmarkdown notebooks.
RStudio offers a text editor, a console and some extra features (environment, files, etc.).
R is a functional programming language: everything resolves around functions.
|
Visualizing data with ggplot2
|
ggplot2 relies on the grammar of graphics, an advanced methodology to visualise data.
ggplot() creates a coordinate system that you can add layers to.
You pass a mapping using aes() to link dataset variables to visual properties.
You add one or more layers (or geoms ) to the ggplot coordinate system and aes mapping.
Building a minimal plot requires to supply a dataset, mapping aesthetics and geometric layers (geoms).
ggplot2 offers advanced graphical visualisations to plot extra information from the dataset.
|
Data transformation with dplyr
|
The filter() function subsets a dataframe by rows.
The select() function subsets a dataframe by columns.
The mutate function creates new columns in a dataframe.
The group_by() function creates groups of unique column values.
This grouping information is used by summarize() to make new columns that define aggregate values across groupings.
The then operator %>% allows you to chain successive operations without needing to define intermediary variables for creating the most parsimonious, easily read analysis.
|
Data tidying with tidyr
|
The gather function turns columns into rows (make a dataset tidy).
The spread function turns rows into columns (make a dataset wide).
Tidy dataset go hand in hand with ggplot2 plotting.
The complete function fills in implicitely missing observations (balance the number of observations).
|
Programming with R
|
An R script is a plain text file with an .R extension that you can execute.
Comments in an R script can be written with a # (hastag).
Loops allow you to automatize a series of similar actions.
Condition if/else helps you to control the execution of your R script.
|
Version control with git and Github
|
git and Github allow you to version control files and go back in time if needed.
In a version control system, file names do not reflect their versions.
An RStudio project folder can be fully version controlled and synchronized online with Github.
Working locally in RStudio with a synchronised online folder will make your work more stable and understandable for you and others.
|
Collaborating with Github
|
Github allows you to synchronise work efforts and collaborate with other scientists on (R) code.
Github can be used to make custom website visible on the internet.
Merge conflicts can arise between you and yourself (different machines).
Merge conflicts arise when you collaborate and are a safe way to handle discordance.
Efficient collaboration on data analysis can be made using Github.
|
Become a champion of open (data) science
|
Make your data and code available to others
Make your analyses reproducible
Make a sharp distincion between exploratory and confirmatory research
|