For-loops in bash

What are we talking about?

A for loop in Bash is a control structure that allows you to execute a sequence of commands repeatedly. It’s particularly useful when you need to perform an operation a fixed number of times or iterate over a list of items.

In its simplest form, a for loop in Bash has the following syntax:

for item in list
do
    # commands to execute for each item
done

Here’s what each part of the for loop means:

  • for: This keyword indicates the start of the loop.
  • item: This is a variable that will hold each item from the list in each iteration of the loop.
  • in: This keyword separates the variable name from the list of items.
  • list: This is a list of items over which the loop will iterate. It can be specified explicitly (e.g., 1 2 3) or through a command substitution (e.g., $(ls) to loop over files in a directory).
  • do: This keyword indicates the start of the block of commands to execute in each iteration.
  • done: This keyword marks the end of the loop.

During each iteration of the loop, the variable item takes on the value of the next item in the list, and the commands within the loop are executed with this value. The loop continues until all items in the list have been processed.

A for loop to get started

Let’s delve into a basic example to kickstart your understanding of for loops in Bash. Imagine the task is to print numbers from 01 to 10, each on a separate line. Here’s how you can achieve this with a for loop:

for i in {01..10};
do
echo $i
done

Here:

  • {01..10} is a Bash brace expansion. Brace expansion is a feature in Bash that generates strings based on a specified pattern. In the case of {01..10}, it expands to 01 02 03 04 05 06 07 08 09 10. The for loop itself then iterates over each of these values.
  • echo $i: This command prints the value of the loop variable i to the terminal.

We can add some spaces to enhance readability. Although this formatting is optional, it’s a good practice to improve code clarity. Personally, I prefer using a tab (or 4 spaces) for indentation, but feel free to adopt your preferred style.

for i in {01..10};
do
    echo $i
done

You might also see something like this:

for i in {01..10};  do  echo $i; done

This is the same code as above condensed into a single line. For more complex loops its useful to display the code in multiple lines for readability but you might encounter this syntax in the wild as well.

Preparing some example files

Let’s explore a more practical example that you might encounter in biology, such as working with genome files and modifying their headers. First, let’s download some example genome files from the NCBI database:

mkdir data 

#download some example genomes from NCBI
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/085/GCA_000008085.1_ASM808v1/GCA_000008085.1_ASM808v1_protein.faa.gz  -P data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/945/GCA_000017945.1_ASM1794v1/GCA_000017945.1_ASM1794v1_protein.faa.gz  -P data
gzip -d data/*gz

#view the header of one file
head data/GCA_000017945.1_ASM1794v1_protein.faa 

Upon opening the file, you’ll notice that the header looks something like this: >ABU81185.1 translation initiation factor aIF-2 [Ignicoccus hospitalis KIN4/I]. The header contains unnecessary characters, such as spaces, which can potentially disrupt downstream analyses. We can remedy this using the cut command.

cut -f1 -d " " data/GCA_000017945.1_ASM1794v1_protein.faa | head

Running this command shortens the header to >ABU81185.1.

In bioinformatics, it might be useful to modify your files that way since downstream processes, such as functional annotation, might not run successfully if there are special symbols (i.e. spaces) in the fasta header or if the fasta header is too long. Generally, its recommended to:

  • Make sure the file header is concise and does not have ANY spaces and that ideally uses a ‘-’ (or any other unique delimiter) to separate the genome ID from the protein ID. Also avoid any unusual symbols, such as |, (, ), {, }…
  • If you work with more than one genome, it might be useful to not only have the protein ID but also the genome ID in the sequence header
  • If you have a concise header + the bin ID in the header, it is easy to concatenate the protein sequences of your genomes into one single file and still easily know from what genome the sequence originally came from

Now, let’s learn how to apply this operation to multiple files, ranging from 2 to 100 (or more), using a loop.

Building the core structure

Let’s lay the groundwork for our loop by starting with the most basic command: printing a list for the files that we want to iterate over. Once we confirm that this works, we’ll gradually build up the loop step by step.

for i in data/*faa;
do
echo $i
done

When we execute this code snippet, we observe that both files, along with their relative paths, are printed:

data/GCA_000008085.1_ASM808v1_protein.faa
data/GCA_000017945.1_ASM1794v1_protein.faa

This output confirms that the loop is correctly identifying the files we want to process.

Adding the first variable

To ensure flexibility in naming the output files, let’s extract the file names using the basename command and storing the output in a variable. We will see why this really is useful once we made it to the final command:

for i in data/*faa;
do
basename=$(basename $i)
echo $basename
done

Executing this code snippet should yield the following names without the relative path. Making this change is extremely useful to later store our output files somewhere else than the data folder:

GCA_000008085.1_ASM808v1_protein.faa
GCA_000017945.1_ASM1794v1_protein.faa

Let’s delve into some new syntax we have used in this section:

  • In bash a variable acts as temporary storage for a string or a number. Here, we want to store a file name in the variable called basename.
  • The $(..) syntax represents command substitution, where Bash executes a command in a subshell environment and then replaces the command substitution with the standard output of the command. In other words, whatever the basename command prints gets stored in the basename variable.
  • Defining variables and the use of quotes:
    • variable_name=“xxx”: This assigns the string “xxx” to the variable variable_name.
    • variable_name=xxx: This assigns the value of the expression xxx to the variable variable_name. If xxx is a command, the result of that command will be assigned to variable_name. If xxx is a variable or literal string, its value will be assigned to variable_name.
    • In our example, we use command substitution to assign the result of the basename command to the basename variable, which is why no quotes are used.
    • It’s important NOT to add spaces between the variable name, =, and variable assignment, as it’s part of the syntax that Bash uses for variables.

Removing the extension from the basename

To ensure flexibility in naming the output files, let’s extract the file names AND exclude the extension by using the basename command with a slight modification:

for i in data/*faa;
do
basename=$(basename $i .faa)
echo $basename
done

Executing this code snippet should yield the following names:

GCA_000008085.1_ASM808v1_protein 
GCA_000017945.1_ASM1794v1_protein

Simplifying the basename

Now the basename looks cleaner, however, I don’t like that it is so long. When naming files its generatelly good to keep the filenames short and precise. We can do this by combining the basename command with a pipe.

Put simply, a pipe allows us to chain the output of the basename command into the cut command to further modify the filename. Here, we use cut to shorten the basename by cutting everything off after the second underscore using cut.

for i in data/*faa;
do
basename=$(basename $i .faa | cut -f1-2 -d "_")
echo $basename
done

Executing this code snippet should yield the following names:

GCA_000008085.1 
GCA_000017945.1

Additionally, let’s delve into some new syntax:

  • The cut command is used to extract sections from each line of input (or from files) and write the result to standard output.
  • In our example, -f1-2 specifies that we want to retain fields 1 to 2 from each line, and -d “_” specifies that the field delimiter is underscore .
  • So, the cut command ensures that only the part of the basename before the second underscore is retained, effectively shortening the basename.
  • You would need to adjust this for your own file names, but hopefully this gives you an idea about how flexible the command line is

Adding a variable for the output file

To facilitate the creation of output file names, let’s introduce a variable named outfile_name. We’ll use this variable to specify the path and name for the modified files:

mkdir data/renamed 

for i in data/*faa;
do
basename=$(basename $i .faa | cut -f1-2 -d "_")
outfile_name="data/renamed/${basename}.faa"
echo $outfile_name
done

This should print:

data/renamed/GCA_000008085.1.faa 
data/renamed/GCA_000017945.1.faa

Now, we can finally see how this whole exercise becomes useful. We were able to (a) simplify the file name itself by making it shorter and (b) ensure that we can store the modified file in a new directory.

Also, notice how we now use quotes to assign a string to the variable outfile_name?

Let’s also discuss a new syntax: using brackets {} around our variable:

  • We enclose variables in {} whenever they are part of a string that includes other text, as demonstrated in the line with the outfile_name variable. This is especially important when the variable is adjacent to other characters that might be confused as part of the variable name.
  • However, if the variable is not surrounded by other characters, as in the echo command, we don’t need to use the brackets.

Finalizing the loop

Let’s complete the loop by implementing the desired modifications to the files.

for i in data/*faa;
do
basename=$(basename $i .faa | cut -f1-2 -d "_")
outfile_name="data/renamed/${basename}.faa"
cut -f1 -d " " $i > $outfile_name
done

#check if that worked 
ls data/renamed/
head data/renamed/GCA_000008085.1.faa

After running this code, you should observe two new files in the data/renamed folder, each with a shortened file name compared to the original:

GCA_000008085.1.faa
GCA_000017945.1.faa

Furthermore, you’ll notice that the fasta headers have been shortened:

>AAR38856.1
MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLH
NKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRG
EAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
>AAR38857.1
MKKPQPYKDEEIYSILEEPVKQWFKEKYKTFTPPQRYAIMEIHKRNNVLISSPTGSGKTLAAFLAIINELIKLSHKGKLE
NRVYAIYVSPLRSLNNDVKKNLETPLKEIKEKAKELNYYIGDIRIAVRTSDTKESEKAKMLKQPPHILITTPESLAIILS
TKKFREHIKKVEFVVVDEIHALAESKRGTHLALSLERLNYLTNFVRIGLSATIHPLEEVAKFLFGYENGKPREGYIIDVS
FEKPIEIQVYSPVDDIIYSSQEELMRNLYKFIGEQLKKYRTILIFTNTRHGAESVAYHLKKAFPDLEKYIAVHHSSLSRE

Do you see how we elegantly use $outfile_name to create a new file with a cleaner name in a different folder compared to the imput file? This is a very good example for how flexible you can be with your files with some small bash knowledge.

Using Loops with Files

In addition to iterating over files directly within a directory using wildcard patterns (e.g., *.txt), Bash also allows you to iterate over a list of filenames stored in a file. This can be achieved by using command substitution to capture the output of a command that generates the list of filenames. Here’s how you can use this approach:

# Assume 'list' is a file containing a list of filenames, with each filename on a separate line
# Iterate over each filename in the list
for filename in $(cat list); do
    # Perform operations on each filename
    echo "Processing file: $filename"
    # Add more commands here
done

In this example, cat list is used within command substitution $(...) to generate the list of filenames stored in the file named list. The loop then iterates over each filename in the list, performing operations as needed.

Why This Might Be Useful:

Using loops with files allows for greater flexibility and automation in handling sets of files that may not follow a consistent naming pattern or reside in different directories. This approach is particularly useful in scenarios where:

  1. Dynamic File Lists: The list of files to process may vary over time or depend on external factors. By storing filenames in a separate file, you can easily update the list without modifying the loop structure.
  2. Complex Filtering: You need to filter files based on specific criteria or conditions that cannot be expressed using wildcard patterns alone. For example, you may need to select files based on metadata stored in another file or database. Imagine that part of your genomes are eukaryotic, the other are bacteria. Maybe you need to process them separately because eukaryotes require the use of different options compared to bacteria.
  3. Integration with External Tools: The filenames need to be processed or transformed before being used in the loop. You can leverage command-line tools or scripting languages to generate the list of filenames dynamically based on certain criteria or conditions.
  4. Improved Readability and Maintainability: Storing filenames in a separate file can make your scripts more readable and maintainable, especially when dealing with large sets of files or complex directory structures. It also allows for easier collaboration and sharing of scripts among team members.

By leveraging loops with files, you can streamline your workflow and efficiently process large sets of files with ease and flexibility.

Addon: Alternative syntax

For more complex code, it might be useful to break things up for readability. Consider this command (the details are not important here, however, see how we use exactly the same loop structure to run a completely different command?):

for i in results/prokka/*/bin*/*.gbk; do
    base_name=$(basename "$i" .gbk)
    outdir="results/pseudofinder/${base_name}"
    pseudofinder.py annotate -g $i -op $outdir -db /zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta -t 20
done

The line that runs the tool pseudofinder.py is becoming longer, and for readability, we might want to break this command down and display each option on an individual line. We can do this as follows:

for i in results/prokka/*/bin*/*.gbk; do
    base_name=$(basename "$i" .gbk)
    outdir="results/pseudofinder/${base_name}"
    pseudofinder.py annotate \
      -g $i \
      -op $outdir \
      -db /zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta \
      -t 20
done

Here, we use the \ to split a long line into two shorter lines, without signifying an end of the statement. If you use this, be careful with the syntax, as \ needs to be immediately followed by a new line. So don’t add any spaces afterwards.

Error handling or why my loop does not work?

While writing and executing loops in Bash can greatly streamline your workflow, it’s essential to anticipate and handle potential errors that may arise during script execution. Here are some common issues you might encounter and strategies for troubleshooting them:

0. General recommendations

When getting started with loops, I recommend starting as simple as possible and building your way up, similar to the approach used in this workflow. Here’s a suggested workflow to follow:

  1. Begin by running your command or software on a single file. Explore the outputs generated and observe how they are named. Understanding these outputs will help you determine how to structure your loop effectively.
  2. Build the most basic loop by simply printing the names of the files you want to iterate over. This step allows you to verify that your loop is correctly identifying the files you intend to process.
  3. Extend the loop structure step-by-step, adding functionality incrementally. This iterative approach enables you to catch mistakes early, as it’s easier to debug a simple command compared to a complex loop.

Additionally, keep in mind that syntax matters a lot in Bash scripting. Common syntax-related mistakes to avoid include:

  1. Using straight quotes (“), as opposed to curly quotes (“), which may not be recognized by the shell. Depending on your text editor, you might inadvertently use curly quotes, causing syntax errors. Be mindful of this difference and adjust your editor settings if necessary. For example, for Mac users, check out the 3rd comment in this post
  2. When defining variables ensure to use proper syntax and not use spaces between the variable name and variable content. I.e. variable="hello".
  3. Whitespace and Indentation: Ensure consistent whitespace and indentation in your Bash scripts to enhance readability and maintainability. Pay attention to spaces around operators and arguments to commands within your script to avoid syntax errors or unexpected behavior. Also ensure that you always use spaces after command names and command options.
  4. Forgetting to Quote Variables or Use Brackets Around Variables: Properly quote variables, especially when dealing with filenames or strings containing spaces or special characters, to ensure their values are interpreted correctly by the shell. Additionally, use brackets {} around variables when they are part of a string with other text to avoid ambiguity and potential issues.
  5. Failure to Handle Empty or Missing Files: Be vigilant about handling scenarios where files might be empty or missing in your loops. Implement checks within your script to gracefully handle such situations, either by skipping empty or missing files or by providing appropriate error messages or default values.

1. File Not Found

If your loop is not working as expected, one common issue could be that the specified files are not found in the designated directory. This could occur due to incorrect file paths or missing files. To address this:

  • Double-check the file paths specified in your loop to ensure they are accurate and point to the correct directory.
  • Simplify your command and use echo to see if the correct filename is printed and
  • Verify that the files you intend to process actually exist in the specified directory. You can do this using the ls command or by manually inspecting the directory with nano or head.

2. Incorrect File Format

Another potential issue is if the files you’re attempting to process are not in the expected format. For example, if your loop expects fasta files but encounters files in a different format, it may not function correctly. To handle this:

  • Confirm that the files in the specified directory are in the expected format. You can examine file contents using commands like head or cat to verify their format.
  • Implement checks within your loop to validate file formats before processing them. This can help prevent errors and ensure that only compatible files are processed.

3. Permissions

Permissions issues can also prevent your loop from executing properly, particularly if you’re attempting to write output to a directory where you don’t have sufficient permissions. To resolve this:

  • Ensure that you have the necessary permissions to read input files and write output files to the designated directories. You can use the ls -l command to view file permissions and ownership.
  • If necessary, adjust file permissions using the chmod command to grant yourself the required access.

4. Syntax Errors

Syntax errors in your loop script can also cause it to fail. Common syntax errors include missing semicolons, incorrect variable assignments, or mismatched quotes. To address syntax errors:

  • Carefully review your loop script for any syntax errors or typos. Pay close attention to variable assignments, loop syntax, usage of spaces and command usage.

5. Handling DOS File Format Issues

In some cases, you may encounter issues related to the file format when working with loops in Bash, particularly if your files originate from a Windows environment. DOS (or Windows) text files use different line-ending characters compared to Unix-based systems, which can lead to unexpected behavior when processing files in Bash.

DOS text files typically use a combination of carriage return ( and line feed () characters to denote line endings, whereas Unix-based systems use only a line feed character. When processing DOS files in Bash, the presence of extra carriage return characters can cause issues, such as unexpected line breaks or syntax errors.

To handle DOS line endings:

  • Use utilities like dos2unix to convert DOS-formatted text files to Unix format before processing them in your loop. This command removes any extraneous carriage return characters, ensuring consistent line endings.

  • If dos2unix is not available, you can use tr (translate or delete characters) to remove carriage return characters from DOS-formatted files. Here’s how you can accomplish this:

      #Remove carriage return characters from DOS-formatted files
      for file in *.txt; do
      tr -d '\r' < "$file" > "$file.tmp" && mv "$file.tmp" "$file"
      done
  • This loop iterates over all .txt files in the current directory, removes carriage return characters ( using tr, and then overwrites the original files with the processed versions. This approach achieves the same result as dos2unix, but using built-in Bash functionality instead.

  • You can modify the file extension (*.txt) to match the specific file types you’re working with in your environment.