This lesson is still being designed and assembled (Pre-Alpha version)

Project and file management

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • How can I consistently organise my folder during my research project?

  • Are there any good consistent way to name files?

  • What should I avoid when it comes to file naming?

  • How can I add simple explanations about a folder content?

  • Are there specific tips regarding publications and scripts (version control)?

Objectives

Table of Contents

1. Folder structure

To avoid chaotic Desktop filled with hundreds of files in different formats and with different creation dates, you must set up a consistent and meaningful folder hierarchy. This is a worthy investment at the beginning that will save you time and brain power later on. It has also the potential to save your research project as a whole by helping backups and external reliable storage.

Chaotic bunch of files on the Desktop

Discussion

Pair up with your neighbour. Compare the folder and file structure from a research project from each other. List what is clear and what’s not!

When you start a research project, it is often a good idea to think about a way to structure the information you will collect. Think about a way to organise the datasets, additional information, etc. that are related to your project.

How many did you come up with? The list of research data types is actually quite long and differ depending on your research domain.

Question

How many different types of reaearch data can you list from the back of your head?

Solution

This is rather a proposition than the solution, nevertheless, here a few suggestions:

  1. Image files.
  2. Spreadsheet documents: comma-separated .csv files from measurements.
  3. Presentations in open (PDF format, .pdf) or proprietary formats (e.g. Microsoft .pptx).
  4. Document files: open (ASCII UTF, plain text format) or proprietary file formats (.doc). This is a data type that comes in many different flavors.
  5. Domain-specific files: mass spectrometry gas chromatography (.mzxml)

Yet, we’ve probably forgotten a few of these. Here is a more complete list taken from the RDM section of the University of Leeds (UK).

A more complete list

Complete list

Types of research data

  • documents,
  • spreadsheets,
  • laboratory and field notebooks,
  • questionnaires,
  • transcripts from conversations,
  • audiotapes,
  • videotapes,
  • photographs,
  • database contents (video, audio, text, images)
  • models, algorithms, scripts
  • methodologies and workflows
  • standard operating procedures and protocols

1.1. Good practices for folder organisation

Use a number in front of your folder (01.admin/) so that, even if the name of the folder is changed, the folder stays at the same location.

1.2 The README user guide file

Adding a small text file in some folders can help to disambiguify file naming conventions, provide guidance for future users, etc.

For example, you could have an “01.Experiments/Molecular_Biology/” folder with a README.txt file at the top of this folder.
Raw data need a README.txt file in the same folder with some human-readable explanation. This README.txt file should for instance contain:

  1. Lab notebook comments such as “This experiment will determine whether a loss-of-function of the gene EATME of lettuce (Lactuca sativa) has an effect on the fresh weight of first stage caterpillar larvae from the Manduca sexta (tobacco moth) species after 14 days. etc.”.
  2. A data dictionnary for the column names in your spreadsheets (e.g. “weight: weight of the larvae in mg”) to explain the meaning of columns and explicit measurement units or other ambiguous things.

1.2 A real-life example

This is an example taken from a research group.

Example folder file organisation

Discussion

Do you see mssing information that could go into its own folder?
Do you find it easy to understand what is inside each folder at a glance?

2. File naming

2.1 Consistency and convention

When starting a new research project (e.g. your PhD), one of the most profitable RDM practice is to implement a consistent and understandable folder and file anming structure.

This consistent scheme will be applied to the same type of data for example. Think about one experiment where you collect samples from a field site and you will want to collect information on:

2.2 Good practices

  1. Date formatting is perhaps both the most important and easiest to set. If you use the YYYYMMDD format then your files will be easily sorted by date.
  2. Avoid special characters in file names: avoid special characters and spaces when separating the different elements of your file name, do not use spaces or characters like “?@{[<>”” etc. Some software programs (e.g. R) do not work well with file names with these characters. You can use dashes and underscores instead. Find a consistent naming scheme and stick to it: for instance, for images recorded from a stereomicroscope, a naming scheme could be:
    • YYYYMMDD_genotype1_treated_rep1_original.tiff for original images.
    • YYYYMMDD_genotype1_treated_rep1_modified.tiff for modified/cropped images.


3. Hosting publication and scripts on a git hosting service (e.g. GitHub)

We often think of a scientific publication as the holy grail of one scientist’s work. While this is particularly true in a given academic environment where scientific prestige and reputation are directly related to the journal and citations you have, the pièce de résistance (main content) lies in the collection of files underlying your research publication:

3.1 Host your publication on a social network of code (e.g. GitHub)

3.2 Set up a paper repository

In version control hosting service jargon, a “repository” is a place to put your precious publication-related data. This is obviously the first step. This is what it could look like in the end:

Hosting a publication on GitHub

You can add a tag to a certain version of the paper such as “version submitted to Nature journal on December 4, 2020”. This would point to a specific commit number making it easy to trace back what has been submitted, when and who contributed at the time to the paper.

3.3 Display contributions and paper history

In combination with git, GitHub will act as a “time machine” where all changes to files and figures are recorded. This makes it easy to undo things if necessary or to point to a specific change in history.

commit history

GitHub helps to list the number of contributions of each author. It can also be a way to see who contributed the most to a publication for instance in a rationale way.

Discussion

How does this version

GitHub view of contributors in a given repository

As you can see, two authors have contributed relatively equally to this publication as can be seen from the number of commits. Yet, one commit can be rather small (changing one file and committing) or be the result from a more extensive work (creating a figure, adding the related script, etc.). This depends on the “commit behaviour” of the person in this case.

Discussion

What do you think are some pros and cons for hosting your publication on a git hosting service? Can you think of short-term and long-term benefits for your research?

One drawback of using a version control system in parallel with a hosting service is that it is not meant to store BIG data file. Ignore these BIG data files by listing directories that should be ignored in a .gitignore file.
Assuming your big raw data files are in a directory called raw_data/, you could type the following command in your favorite Shell terminal to ignore them:

echo "raw_data/" >> .gitignore

4. Software Carpentry episode on data management

Some simple tips from the Software Carpentry Data Management lesson:

5. Resources

5.1 Blog post

5.2 Tutorials

Key Points

  • .