This lesson is still being designed and assembled (Pre-Alpha version)

Data archiving

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • What is data archiving?

  • What is SURFsara Data Archive?

  • What is Tape Archive?

  • How can I connect and upload data to SURF Data Archive?

Objectives
  • Find out what to do with non-research data and where to back up data.

  • Be able to use the command line to connect to SURF data archive.

  • Transfer data between your machine and the archive.

Table of contents

1. Introduction

1.1 Backup and archiving

This alone means that storing data on various hard drives scattered around the university or at home is not great practice. Hard drives are prone to physical destruction if they fall or being lost or stolen for instance.

Data archiving should indeed not be be confused with data backup:

1.2 Long-term data preservation becomes more crucial

Due to the increasing volume of data being generated in research (e.g. DNA sequencing or imaging), it has become more complex to store large and heterogeneous volumes of data easily. Yet, the possibility to re-use research data rather than automatically regenerate new data also becomes important as techniques to aggregate numerous datasets become more commonplace. One such example is Machine Learning that requires large amount of well-curated and labeled data in order to generate meaningful results.

In addition, research funding agencies or Universities increasingly require that research data are stored for 3 to 10+ years as can be seen from several national research agencies:

2. Tape Archive

When it comes to data archiving, storing on hard drives can become quite expensive rapidly. One of the reasons is that hard drives need to be electrically powered in order to be readable by the disk driver. Current estimate are 120€ per terabyte per year (source: SURFsara, 2019).

Data volumes have significanly Short-term versus long-term storage

Need staging slower access to data

hard disk drive versus tape drive

Hard disk drive
Magnetic tape drive
Short-term storage Long-term storage (no electricity required)
Maximum storage capacity per unit = 100TB (SSD, 2018) Maximum storage capacity of 45TB (2018,LTO9)
Instantaneous data access Sequential access: data needs to be staged on disk first

2. The SURFsara Data Archive

SURF is defined as:

“a cooperative association of Dutch educational and research institutions in which its members join forces. The members are the owners of SURF.” Source: SURF website The SURF organisation runs a data archive service

Data stored on the archive is stored in two physical locations. This is much more secure, and can be accessed from any machine with an internet connection plus login credentials.

For example, the SURFsara data archive is located at two locations: one in Almere and one in Amsterdam within the Amsterdam Data Center which is a 72m-tall building with 13 floors (see below).

Amsterdam Data Tower
The Amsterdam Data Tower


Access to the data center is being checked through ID and fingerprints at the entrance.

Security at the entrance of the Amsterdam Data Tower
Identity check to enter the servers within the Amsterdam Data Tower


SURFsara rents some space inside the tower to host its Tape Archive service. On the picture, you can see the magnetic tapes on the different rows. When requested, the tapes are read by a robot which stages them on a hard disk thereby allowing data to be read.

Security at the entrance of the Amsterdam Data Tower
The SURF tape archiver

3. How to archive data with SURF data archive

3.1 Tools To Connect

The tools used for uploading/accessing/transferring are common for both data processing and managing archive data.

Some differ depending on operating systems.

Cross-platform tools with a GUI: Cyberduck and FileZilla

Some filemanagers have a Graphical User Interface (GUI) that can be installed on both Windows or Mac OS X operating systems. Files can be transferred across the data processing/archiving systems using a GUI filemanager like Cyberduck or filezilla Warning Filezilla not yet suitable for cross server transfer.

Windows specific (choose one option):

MacOS/Linux

3.2 Connecting to SURF data archive

Using a GUI: Cyberduck is the recommended GUI for file transfer with the SURF Data Archive. See the Cyberduck website for more information. Or for some tips from SURF.

Using command line/ssh client: If using an ssh client on windows then you can input the required details to connect similarly to the GUI option.

Otherwise open up preferred method for ssh connection and type the following:

ssh USER@archive.surfsara.nl

Then input your password when prompted.

Transferring data

When transferring data; remember that the tape system is not instant and your files will get stored distributed across their system. For this reason if you want to store lots of small files then these need to be collected together into a tar archive beforehand.

How to create a tar archive

Non-Command Line Interface

No built in tarballing in windows. 7-zip is currently recommended to add this functionality. A tutorial on how to use it can be found here. A 7-ZIp equivalent is also available for MAC, see the 7-zip link.

Mac/Linux/WSL Using terminal:

tar -cvf NEWFILENAME.tar path/to/originalfile

More details on tar can be found here

Using a GUI

When using cyberduck for file transfer between servers be sure to open two separate windows with separate connections. This is the easiest method of data transfer, but also the slowest as there is no way to stage your data before making the transfer.

Using command line

At the command line there are a number of different ways to transfer data. Currently recommended by SURF is rclone. High performance transfer is available but needs to be setup on your local machine/server, info here.

Easy method: rsync. Rsync is installed already. Currently you must log into crunchomics and “push” to the archive using the following command:

screen
rsync -ravz FILES USERNAME@archive.surfsara.nl:/nfs/archive02/whiteprj

Ensuring to replace the FILES and USERNAME with your own details (and that the file is already a tar archive). Screen command is needed for larger files unless you want to keep the terminal open the whole time. It will ask for your archive password, this will then sync it to the shared folder (called whiteprj).

Please ensure adequate metadata is provided in your tar archive.

Archive Internal Data Transfer

Transferring data between members of the group can be done using cyberduck. There is a shared folder for the group this can be accessed through cyberduck, all group members have access to this folder. In order to allow permissions to someone else copy or deposit your data here. The data manager can then ensure it is transferred to the correct account and delete it from shared folder. It is recommended to bookmark this folder within cyberduck, otherwise finding it is a bit tedious. In this example the shared folder is called whiteprj.

References

Key Points

  • SURF data archive is to be used for larger files or a tar collection of smaller files and not for day to day access.