# Generate a list of samples that we want to analyse
echo -e "s1\ns2\ns3\ns4" > samples
# Generate an empty file to test some commands below
touch empty_fileFile verifications in bash
In job scripts you sometimes want to check whether a file exists and only run an action if it doesn’t exist. For example you might want to run a quality cleaning step on multiple files on an HPC. Here, you might want to skip the cleaning step if the output file already exists to avoid re-running computationally costly tasks. Or you might want to generate a clear error message if an input file is missing for easier troubleshooting.
For this tutorial, first create two files: a sample list and empty file which we will use in later sections:
Action when a file does not exist
Let’s begin with a basic example and add a file verification in case a file that doesn’t exist yet.
For this tutorial, we assume that you want to do some quality cleaning but to keep your jobs efficient, you only want to run the cleaning step if the file does not exist. For this example, we assume that the cleaning step produces a file called cleaned.fastq.
The -f operator returns TRUE if the file exists. We can use this in an if/else operation and perform a specific action only when the operator returns TRUE and do something else if it does not.
Notes:
- When writing this yourself a common error is to not add spaces here
[ -fand here]. This is an important part of the syntax, which when not added will create an error. - In the code below the
touchcommand is used to simulate the cleaning script generating an output file. In a real scenario you would remove this line and add your actual command there instead.
OUTPUT="cleaned.fastq"
# Check if file exists
if [ -f "$OUTPUT" ]; then
echo -e "\n${OUTPUT} exists. Nothing to be done"
else
echo -e "\nRun cleaning script. Generating output:"
touch "$OUTPUT"
fi
# confirm the file was created
ls "$OUTPUT"This returns:
Run cleaning script. Generating output:
cleaned.fastq
So we see that the cleaning script would be executed. If we would re-run the command, then the script recognizes that the cleaned.fastq file already exists and won’t rerun the cleaning step:
OUTPUT="cleaned.fastq"
if [ -f "$OUTPUT" ]; then
echo -e "\n${OUTPUT} exists. Nothing to be done"
else
echo -e "\nRun cleaning script. Generating output:"
touch "$OUTPUT"
ficleaned.fastq exists. Nothing to be done
You can also negate things and check if a file does not exist using !:
# Remove the file created in the previous example to start fresh
rm cleaned.fastq
OUTPUT="cleaned.fastq"
# Check if the file does not exist and then do something
if [ ! -f "$OUTPUT" ]; then
echo -e "\n${OUTPUT} does not exist. Run cleaning script. Generating output:"
touch "$OUTPUT"
fi
# confirm the file was created
ls "$OUTPUT"cleaned.fastq does not exist. Run cleaning script. Generating output:
cleaned.fastq
Action when a file exists but is empty
Sometimes tools might run incorrectly and generate an output but this output is empty. In this case you would want to rerun your cleaning step on that specific file.
The -s operator returns true if the file exists and has a size greater than zero (i.e., is non-empty).
OUTPUT="empty_file"
if [ -s "$OUTPUT" ]; then
echo -e "\n${OUTPUT} exists and is non-empty. Nothing to be done"
else
echo -e "\n${OUTPUT} is empty, run cleaning script. Generating output:"
touch "$OUTPUT"
fi
# confirm the file was created
ls "$OUTPUT"empty_file is empty, run cleaning script. Generating output:
empty_file
Note: in a real script, the else branch would rerun the tool that generates the file. Here we use touch as a placeholder, which creates again an empty file and the check would trigger again on the next run.
Combining checks
We can also check if two files exist, which might be useful if your tool requires two inputs. To do this, you can combine checks with the AND statement using &&.
Note: When combining conditions, use[[ instead of [. This construct is a bash built-in that handles compound conditions more reliably. You can use [[ even if you are not combining two checks. That is because the [[ has some other features such as handling empty variables without returning a syntax error.
if [[ -f samples && -f cleaned.fastq ]]; then
echo -e "\nBoth files exist. Nothing to be done"
else
echo "File missing. Re-run file generation step"
fiBoth files exist. Nothing to be done
You can be more defensive and use an elif statement to know exactly which file does not exist. Here, negation is again useful.
if [[ ! -f samples ]]; then
echo -e "\nsamples does not exist."
elif [[ ! -f cleaned.fastq ]]; then
echo -e "\ncleaned.fastq does not exist."
else
echo "All files exist. Nothing to be done."
fiAll files exist. Nothing to be done.
Note: elif checks conditions sequentially and stops at the first match. If both files are missing, only the first missing file will be reported. If you wanted to report each file, you can instead use two separate if statements.
Putting everything together
Looping across all samples
Assume now that we want to run our cleaning step on all samples. A loop would be a natural way to do this. If you want to learn more about loops go here. In our case we could do the following:
# Generate a dummy output file that exists
echo "some content" > s1.cleaned.fastq
for ID in $(cat samples); do
OUTPUT="${ID}.cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ -s "$OUTPUT" ]]; then
echo -e " ${OUTPUT} exists and is non-empty, skipping\n"
else
echo -e "Running cleaning step and generating ${OUTPUT} \n"
fi
doneWe see that s1 is handled differently since it already exists:
Processing s1
---------------
s1.cleaned.fastq exists and is non-empty, skipping
Processing s2
---------------
Running cleaning step and generating s2.cleaned.fastq
Processing s3
---------------
Running cleaning step and generating s3.cleaned.fastq
Processing s4
---------------
Running cleaning step and generating s4.cleaned.fastq
break: Stopping a loop if something is wrong
Sometimes you want to stop the entire loop if something is wrong, rather than skipping a sample. This is useful when a missing or empty file indicates a problem in a previous step that needs to be resolved before continuing.
For this example let’s look at this case: You performed the quality cleaning step and want to proceed with the next step and assembly the reads into contigs, i.e. into longer stretches of DNA. Maybe something went wrong during the quality cleaning and some of your cleaned reads are empty. Here, you actually might want to stop the pipeline since the assembly part is computationally expensive and first get a useful error message that allows you to troubleshoot before trying again.
The assembly process typically needs two inputs called the forward and reverse reads. While the details of what these files are is not important, this is a good usage case of using the OR keyword || to test whether forward or reverse file is missing. We also will now make use a new keyword called break that result in the loop stopping if an issue occurs.
# Only generate dummy files for s1 to simulate a partially completed pipeline
# Forward and reverse reads are typically input files for many computation tasks
echo "some content" > s1_forward_cleaned.fastq
echo "some content" > s1_reverse_cleaned.fastq
# Run the loop
for ID in $(cat samples); do
CLEANED_READS_F="${ID}_forward_cleaned.fastq"
CLEANED_READS_R="${ID}_reverse_cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Stopping pipeline."
break
fi
echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done
echo -e "\nScript finished"We see:
Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1
Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Stopping pipeline.
Script finished
So the loop now stops when the one of the S2 files is missing.
exit 1: Stopping a script if something is wrong
Important for longer scripts: break stops the loop but does not stop the script. Any commands written after the done keyword will still be executed, even if break was triggered. To stop the entire script after break, add an exit 1 immediately after the error message. The 1 signals to the system that the script ended due to an error (exit code 0 means success):
for ID in $(cat samples); do
CLEANED_READS_F="${ID}_forward_cleaned.fastq"
CLEANED_READS_R="${ID}_reverse_cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Stopping pipeline."
exit 1
fi
echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done
echo -e "\nScript finished" # This line is never reached because exit 1 stops the scriptYou now will see:
Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1
Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Stopping pipeline.
logout
If run interactively on your computer you can press enter to restart the terminal. The logout line is produced by the terminal itself when the shell session ends and is not part of the script output. In a SLURM job script, exit 1 will not produce logout or the exit code message. Instead, the script will simply stop, and the job will appear as FAILED when checked with sacct or squeue. The error message printed by echo will appear in the SLURM error file.
continue keyword
The above works but we might want a different behavior and run the assembly on all files if the input files exist and for all other cases output a clear error message. The continue keyword does just that. Let’s begin and see what happens if the we don’t use the break keyword:
# Create dummy files for every sample except S2 to illustrate the process
echo "some content" > s3_forward_cleaned.fastq
echo "some content" > s3_reverse_cleaned.fastq
echo "some content" > s4_forward_cleaned.fastq
echo "some content" > s4_reverse_cleaned.fastq
# Run the loop
for ID in $(cat samples); do
CLEANED_READS_F="${ID}_forward_cleaned.fastq"
CLEANED_READS_R="${ID}_reverse_cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Skipping sample."
fi
echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
doneThis prints:
Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1
Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Skipping sample.
Both input files exist and are non-empty. Running assembly for s2
Processing s3
---------------
Both input files exist and are non-empty. Running assembly for s3
Processing s4
---------------
Both input files exist and are non-empty. Running assembly for s4
We proceed to work through all 4 files but looking closely at what happens with S2, we see that the script:
- Acknowledges that files are missing
- Tries to run the Action anyhow, which in a real example would produce an error
By using continue we can skip the full loop for a sample where something is missing:
# Run the loop
for ID in $(cat samples); do
CLEANED_READS_F="${ID}_forward_cleaned.fastq"
CLEANED_READS_R="${ID}_reverse_cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Skipping sample."
continue
fi
echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
doneReturns:
Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1
Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Skipping sample.
Processing s3
---------------
Both input files exist and are non-empty. Running assembly for s3
Processing s4
---------------
Both input files exist and are non-empty. Running assembly for s4
Now we see that nothing is done for S2, exactly what we want.
Important: if you use this inside a SLURM script, the error message should be captured in the SLURM error file. Make sure to always read these files to catch any issues with your script.
Alternative approach: The same behavior can be written using if/else, which makes both the error and the action branch explicitly visible. This can be easier to read when the loop body is long or when sharing code with others who are less familiar with continue.
for ID in $(cat samples); do
CLEANED_READS_F="${ID}_forward_cleaned.fastq"
CLEANED_READS_R="${ID}_reverse_cleaned.fastq"
echo "Processing $ID"
echo "---------------"
if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
echo "ERROR: files missing or empty for ${ID}"
else
echo -e "Running assembly for $ID\n"
fi
doneSummary
Now you learned how to design code to more explicitly handle missing input files:
- Use
-fto check whether a file exists and-sto check whether a file exists and is non-empty. Prefer[[over[when writing conditions in bash scripts - Use
breakto stop a loop entirely when a missing or empty file indicates a problem that needs to be resolved before continuing - Use
continueto skip the current sample and proceed to the next iteration when a file is missing. Note thatcontinueis only meaningful when there is code after the if block that should be skipped — without that, the loop advances to the next iteration anyway, making continue redundant - Use if/else as a readable alternative to continue, particularly for longer loop bodies or when sharing code with others
A note on GenAI-generated code
Code generated by GenAI tools will often include file verification patterns. While these patterns are not wrong, they are frequently written with production pipelines in mind rather than scientific workflows. Concretely this means:
- Silent skipping of samples using continue without a clear error message, meaning a sample can disappear from your results without any warning
- File existence checks (
-f) where an integrity check would be more appropriate. For example, a file can exist and be non-empty but still be corrupt, for example after a SLURM timeout - Defensive patterns like
|| continueplaced directly in command chains where they are difficult to read and harder to reason about. Avoid these and use the patterns you have seen above.
When reviewing GenAI-generated code, always ask: if this check fails, what actually happens to my data? A script that runs without errors is not the same as a script that produced correct results.