Analysing Data

Overview

Teaching: 30 min
Exercises: 30 min

Questions

What happens to my data when I start analysing them?

What is the difference between raw and processed data?

How can I ensure the traceability of my results in regards to raw and processed data?

What is a scientific workflow manager?

Objectives

Find out where to process large datasets and store research-data.

Find out how to store data to be archived.

How can I analyse a dataset using an HPC environment?

Overview
What you will learn
Tools To Connect
Connecting
Transferring Data
Analysing Data/Running a Job

What you will learn

Where to process large datasets and store research-data
- How does the current high performance compute environment (HPCe) job system work?
Where to store data to be archived
- What is a tape archiving system?
- How can I store my data on tape?

Tools To Connect

The tools used for uploading/accessing/transferring are common for both data processing and managing archive data.

Some differ depending on operating systems.

Windows

SSH client Mobaxterm is recommended if you don’t already have a preferred method.
To use a windows 10 machine as if it were linux: Windows subsystem for linux (WSL) - details on setting this up can be found here. THIS IS NOT A NECESSARY STEP. The above can be used instead.

MacOS/Linux

Terminal is built in for both Mac and Linux machines. Can be used as an ssh client.

Files can also be transferred across the data processing/archiving systems using GUI filemanagers like Cyberduck.

Connecting

Using command line

Open up preferred method for ssh connection and type the following:

For Genseq

ssh -X USER@genseq-cn02.science.uva.nl

For Crunchomics

ssh -X USER@omics-h0.science.uva.nl

For SURF Data Archive

ssh -X USER@archive.surfsara.nl

Using a GUI

Cyberduck is the recommended GUI for file transfer with the Data Archive. See the Cyberduck website for more information. Or for some tips from SURF.

Transferring data

Using command line

Using a GUI

When using the GUI for file transfer between the servers be sure to open two separate windows with separate connections.

Analysing Data/Running a Job

The transition from Genseq to Crunchomics means a more regulated compute environment. This is beneficial to everyone as jobs are processed with more constraints on time and computing power. Therefore, you need to estimate how long and how many nodes need to be used for your job. Crunchomics uses Slurm - “Simple Linux Utility for Resource Management”.

Slurm

Slurm is a resource manager and job scheduler. Slurm is used to allocate resources within the cluster. It has built in knowledge about nodes/sockets/cores/hyperthreads (network topology).

Slurm is used to launch and manage jobs.When there is more work than resources Slurm will consider the following:

Network topology
Fair share scheduling
Advanced reservations

More in depth information can be found here.

Key points to run a job:

You typically need a job file which essentially is a list of the (shell) commands you want to run. This file is put in the queue, if there is already sufficient resources to run it then it will start right away.

Example Job File:

#!/bin/bash
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100
#

command -options -file -arguments -etc.

This file should have the file extension .sh

To submit the job use the command:

sbatch jobfile.sh

Importantly you should calculate an estimation of how long your job will take on a certain number of nodes. This needs to be included in your jobfile. The above example specifies one CPU for 10 minutes and 100MB of RAM.

This is of course a very small amount of processing power and time. For the moment there are no limits set by Crunchomics on power and time, apart from the limits of the system. These are: 64 CPUS (or cores) and 512GB of memory. However it is recommended to use a maximum of 8 cores for your jobs. This is because not all parallel processes scale linearly. For example: alignment using hisat2 on 32 cores is not 4 times faster than on 8 cores. The less cores you specify for your job means that it is likely to start quicker - more likely that 8 cores will be freed up than 32!

For step by step tutorials on job scheduling see here and here.

Slurm and Snakemake

Snakemake is a commonly used language for data analysis pipelines. If you have pre-existing pipelines made for a non-slurm system (i.e. Genseq) these can still be run using the snakemake command and no adaptations to the original files are necessary. What you need to do is create an additional json file specifying the requirements for each rule.

To run the pipeline it is still necessary to activate the required python environment.

References

https://slurm.schedmd.com/faq.html#steps

Teaching materials

This lesson was originally formatted according to the Carpentries Foundation lesson template and following their recommendations on how to teach researchers good practices in programming and data analysis.

This material has been collected from the various links as well as communications with the Crunchomics team.

Key Points

A workflow (e.g. an R script) link raw to processed data.

A workflow is essential to keeping trace of the steps that a dataset underwent.

previous episode

Research Data Management

next episode

Analysing Data

Overview

Table of Contents

What you will learn

Tools To Connect

Windows

MacOS/Linux

Connecting

Using command line

Using a GUI

Transferring data

Using command line

Using a GUI

Analysing Data/Running a Job

Slurm

Key points to run a job:

Example Job File:

Slurm and Snakemake

References

Teaching materials

Key Points

previous episode

next episode