Arrays in bash

Basics

In Bash, an array is a collection of values stored under a single variable name.

To explain its usage assume that we are working with 3 protein fasta files and want to perform an action on them, for example concatenate all three of them. To do this, we could store the names of all three files in an array and then use the array to then concatenate them into a single file.

We begin by creating three dummy files that contain some imaginary protein sequences:

# Create dummy files
echo ">protein1" > s1.faa
echo ">protein2" > s2.faa
echo ">protein3" > s3.faa

Next we store the names of these three files in an array by doing the following:

# Declare an empty array (the parenthesis tell BASH that we want to generate an array)
FASTA_FILES=()

# Append each individual file to the array using `+=`
for ID in s1 s2 s3; do
    FASTA_FILES+=("${ID}.faa")
done

Note, the above is the same as running the code below but just with a for-loop to be more effective. If you have not used for-loops before, have a look at this tutorial.

FASTA_FILES=() 

# Append the first file 
FASTA_FILES+=(s1.faa)

# Append the second file 
FASTA_FILES+=(s2.faa)

# Append the third file 
FASTA_FILES+=(s3.faa)

Now that we have generated the array, we might want to print its content. There are several ways to do this as shown below:

# Show what is in the array
## `[0]` accesses the first element in the array
## Bash is 0-indexed which is why the first element is 0 not 1
echo "First file: ${FASTA_FILES[0]}"

## `[@]` accesses all elements in the array
echo "All files: ${FASTA_FILES[@]}"

## Adding # before the variable name inside ${} counts the elements inside an array
echo "Number of files: ${#FASTA_FILES[@]}"

These commands return:

First file: s1.faa
All files: s1.faa s2.faa s3.faa
Number of files: 3

Next, we might want to use it in a command that takes as input a number of files. Note that the below is equivalent to cat s1.faa s2.faa s3.faa

# Use the array
cat "${FASTA_FILES[@]}"

Common bugs

When using arrays you need to be VERY precise with the used syntax. Just changing += to = creates a major issue that might be hard to detect on large files since the code still runs just incorrectly:

FASTA_FILES=()

# A common bug: files are not appended
for ID in s1 s2 s3; do
    FASTA_FILES=("${ID}.faa") # = replaces the array every iteration instead of appending
done

# Bash does not output an error here - it silently uses only the last value
echo "Files that will be used: ${FASTA_FILES[@]}"
cat "${FASTA_FILES[@]}"

Which produces:

Files that will be used: s3.faa
>protein3

Similarly you always want to declare an array using ( ). Failing to do so might also create unexpected behaviors.

Recommendations

We have seen how not using arrays correctly can very easily create a problematic result. It is therefore recommended to:

  1. Always verify your code. For example, always run echo ${FASTA_FILES[@]} to verify that the array contains the information you want
  2. Keep it simple. Often times you can replace an array with a simpler solution that often is also easier to debug. For the example here, a simple glob (a bash way to match multiple file names at once) would be the recommended alternative.
# Check all files are captured by the glob 
ls s*.faa

# Combine the files
cat s*.faa

The benefit of the code above is that a glob cannot silently exclude files the way an array can.

A note on GenAI-generated code: Arrays are a pattern that GenAI tools add frequently when code contains loops. That is because for production level code arrays are useful when files do not share a common path or naming pattern. In bioinformatics workflows however, input files are almost always generated by a previous step in a task-specific folder. This means that a glob should be sufficient in most cases. If you encounter an array in GenAI-generated code, always check if a glob can be used instead.