This lesson is in the early stages of development (Alpha version)

Introduction to Open Data Science with R: Glossary

Key Points

Introduction
  • Tidy data principles are essential to increase data analysis efficiency and code readability.

  • Using R and RStudio, it becomes easier to implement good practices in data analysis.

  • I can make my workflow more reproducible and collaborative by using git and Github.

R & RStudio, Rmarkdown
  • R and RStudio make a powerful duo to create R scripts and Rmarkdown notebooks.

  • RStudio offers a text editor, a console and some extra features (environment, files, etc.).

  • R is a functional programming language: everything resolves around functions.

Visualizing data with ggplot2
  • ggplot2 relies on the grammar of graphics, an advanced methodology to visualise data.

  • ggplot() creates a coordinate system that you can add layers to.

  • You pass a mapping using aes() to link dataset variables to visual properties.

  • You add one or more layers (or geoms) to the ggplot coordinate system and aes mapping.

  • Building a minimal plot requires to supply a dataset, mapping aesthetics and geometric layers (geoms).

  • ggplot2 offers advanced graphical visualisations to plot extra information from the dataset.

Data transformation with dplyr
  • The filter() function subsets a dataframe by rows.

  • The select() function subsets a dataframe by columns.

  • The mutate function creates new columns in a dataframe.

  • The group_by() function creates groups of unique column values.

  • This grouping information is used by summarize() to make new columns that define aggregate values across groupings.

  • The then operator %>% allows you to chain successive operations without needing to define intermediary variables for creating the most parsimonious, easily read analysis.

Data tidying with tidyr
  • The gather function turns columns into rows (make a dataset tidy).

  • The spread function turns rows into columns (make a dataset wide).

  • Tidy dataset go hand in hand with ggplot2 plotting.

  • The complete function fills in implicitely missing observations (balance the number of observations).

Programming with R
  • An R script is a plain text file with an .R extension that you can execute.

  • Comments in an R script can be written with a # (hastag).

  • Loops allow you to automatize a series of similar actions.

  • Condition if/else helps you to control the execution of your R script.

Version control with git and Github
  • git and Github allow you to version control files and go back in time if needed.

  • In a version control system, file names do not reflect their versions.

  • An RStudio project folder can be fully version controlled and synchronized online with Github.

  • Working locally in RStudio with a synchronised online folder will make your work more stable and understandable for you and others.

Collaborating with Github
  • Github allows you to synchronise work efforts and collaborate with other scientists on (R) code.

  • Github can be used to make custom website visible on the internet.

  • Merge conflicts can arise between you and yourself (different machines).

  • Merge conflicts arise when you collaborate and are a safe way to handle discordance.

  • Efficient collaboration on data analysis can be made using Github.

Become a champion of open (data) science
  • Make your data and code available to others

  • Make your analyses reproducible

  • Make a sharp distincion between exploratory and confirmatory research

Glossary

FIXME