File verifications in bash

In job scripts you sometimes want to check whether a file exists and only run an action if it doesn’t exist. For example you might want to run a quality cleaning step on multiple files on an HPC. Here, you might want to skip the cleaning step if the output file already exists to avoid re-running computationally costly tasks. Or you might want to generate a clear error message if an input file is missing for easier troubleshooting.

For this tutorial, first create two files: a sample list and empty file which we will use in later sections:

# Generate a list of samples that we want to analyse
echo -e "s1\ns2\ns3\ns4" > samples

# Generate an empty file to test some commands below
touch empty_file

Action when a file does not exist

Let’s begin with a basic example and add a file verification in case a file that doesn’t exist yet.

For this tutorial, we assume that you want to do some quality cleaning but to keep your jobs efficient, you only want to run the cleaning step if the file does not exist. For this example, we assume that the cleaning step produces a file called cleaned.fastq.

The -f operator returns TRUE if the file exists. We can use this in an if/else operation and perform a specific action only when the operator returns TRUE and do something else if it does not.

Notes:

When writing this yourself a common error is to not add spaces here [ -f and here ]. This is an important part of the syntax, which when not added will create an error.
In the code below the touch command is used to simulate the cleaning script generating an output file. In a real scenario you would remove this line and add your actual command there instead.

OUTPUT="cleaned.fastq"

# Check if file exists
if [ -f "$OUTPUT" ]; then 
    echo -e "\n${OUTPUT} exists. Nothing to be done"
else 
    echo -e "\nRun cleaning script. Generating output:" 
    touch "$OUTPUT"
fi

# confirm the file was created
ls "$OUTPUT"

This returns:

Run cleaning script. Generating output:
cleaned.fastq

So we see that the cleaning script would be executed. If we would re-run the command, then the script recognizes that the cleaned.fastq file already exists and won’t rerun the cleaning step:

OUTPUT="cleaned.fastq"

if [ -f "$OUTPUT" ]; then 
    echo -e "\n${OUTPUT} exists. Nothing to be done"
else 
    echo -e "\nRun cleaning script. Generating output:" 
    touch "$OUTPUT"
fi

cleaned.fastq exists. Nothing to be done

You can also negate things and check if a file does not exist using !:

# Remove the file created in the previous example to start fresh
rm cleaned.fastq

OUTPUT="cleaned.fastq"

# Check if the file does not exist and then do something 
if [ ! -f "$OUTPUT" ]; then 
    echo -e "\n${OUTPUT} does not exist. Run cleaning script. Generating output:" 
    touch "$OUTPUT"
fi

# confirm the file was created
ls "$OUTPUT"

cleaned.fastq does not exist. Run cleaning script. Generating output:
cleaned.fastq

Action when a file exists but is empty

Sometimes tools might run incorrectly and generate an output but this output is empty. In this case you would want to rerun your cleaning step on that specific file.

The -s operator returns true if the file exists and has a size greater than zero (i.e., is non-empty).

OUTPUT="empty_file"

if [ -s "$OUTPUT" ]; then 
    echo -e "\n${OUTPUT} exists and is non-empty. Nothing to be done"
else 
    echo -e "\n${OUTPUT} is empty, run cleaning script. Generating output:" 
    touch "$OUTPUT"
fi

# confirm the file was created
ls "$OUTPUT"

empty_file is empty, run cleaning script. Generating output:
empty_file

Note: in a real script, the else branch would rerun the tool that generates the file. Here we use touch as a placeholder, which creates again an empty file and the check would trigger again on the next run.

Combining checks

We can also check if two files exist, which might be useful if your tool requires two inputs. To do this, you can combine checks with the AND statement using &&.

Note: When combining conditions, use[[ instead of [. This construct is a bash built-in that handles compound conditions more reliably. You can use [[ even if you are not combining two checks. That is because the [[ has some other features such as handling empty variables without returning a syntax error.

if [[ -f samples && -f cleaned.fastq ]]; then 
    echo -e "\nBoth files exist. Nothing to be done"
else    
    echo "File missing. Re-run file generation step"
fi

Both files exist. Nothing to be done

You can be more defensive and use an elif statement to know exactly which file does not exist. Here, negation is again useful.

if [[ ! -f samples ]]; then 
    echo -e "\nsamples does not exist."
elif [[ ! -f cleaned.fastq ]]; then
   echo -e "\ncleaned.fastq does not exist."
else    
    echo "All files exist. Nothing to be done."
fi

All files exist. Nothing to be done.

Note: elif checks conditions sequentially and stops at the first match. If both files are missing, only the first missing file will be reported. If you wanted to report each file, you can instead use two separate if statements.

Putting everything together

Looping across all samples

Assume now that we want to run our cleaning step on all samples. A loop would be a natural way to do this. If you want to learn more about loops go here. In our case we could do the following:

# Generate a dummy output file that exists 
echo "some content" > s1.cleaned.fastq

for ID in $(cat samples); do
    OUTPUT="${ID}.cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ -s "$OUTPUT" ]]; then
        echo -e " ${OUTPUT} exists and is non-empty, skipping\n"
    else
        echo -e "Running cleaning step and generating ${OUTPUT} \n"
    fi
done

We see that s1 is handled differently since it already exists:

Processing s1
---------------
s1.cleaned.fastq exists and is non-empty, skipping

Processing s2
---------------
Running cleaning step and generating s2.cleaned.fastq

Processing s3
---------------
Running cleaning step and generating s3.cleaned.fastq

Processing s4
---------------
Running cleaning step and generating s4.cleaned.fastq

`break`: Stopping a loop if something is wrong

Sometimes you want to stop the entire loop if something is wrong, rather than skipping a sample. This is useful when a missing or empty file indicates a problem in a previous step that needs to be resolved before continuing.

For this example let’s look at this case: You performed the quality cleaning step and want to proceed with the next step and assembly the reads into contigs, i.e. into longer stretches of DNA. Maybe something went wrong during the quality cleaning and some of your cleaned reads are empty. Here, you actually might want to stop the pipeline since the assembly part is computationally expensive and first get a useful error message that allows you to troubleshoot before trying again.

The assembly process typically needs two inputs called the forward and reverse reads. While the details of what these files are is not important, this is a good usage case of using the OR keyword || to test whether forward or reverse file is missing. We also will now make use a new keyword called break that result in the loop stopping if an issue occurs.

# Only generate dummy files for s1 to simulate a partially completed pipeline
# Forward and reverse reads are typically input files for many computation tasks
echo "some content" > s1_forward_cleaned.fastq
echo "some content" > s1_reverse_cleaned.fastq

# Run the loop
for ID in $(cat samples); do
    CLEANED_READS_F="${ID}_forward_cleaned.fastq"
    CLEANED_READS_R="${ID}_reverse_cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
        echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Stopping pipeline."
        break
    fi

    echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done

echo -e "\nScript finished"

We see:

Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1

Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Stopping pipeline.

Script finished

So the loop now stops when the one of the S2 files is missing.

`exit 1`: Stopping a script if something is wrong

Important for longer scripts: break stops the loop but does not stop the script. Any commands written after the done keyword will still be executed, even if break was triggered. To stop the entire script after break, add an exit 1 immediately after the error message. The 1 signals to the system that the script ended due to an error (exit code 0 means success):

for ID in $(cat samples); do
    CLEANED_READS_F="${ID}_forward_cleaned.fastq"
    CLEANED_READS_R="${ID}_reverse_cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
        echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Stopping pipeline."
        exit 1
    fi

    echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done

echo -e "\nScript finished" # This line is never reached because exit 1 stops the script

You now will see:

Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1

Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Stopping pipeline.
logout

If run interactively on your computer you can press enter to restart the terminal. The logout line is produced by the terminal itself when the shell session ends and is not part of the script output. In a SLURM job script, exit 1 will not produce logout or the exit code message. Instead, the script will simply stop, and the job will appear as FAILED when checked with sacct or squeue. The error message printed by echo will appear in the SLURM error file.

`continue` keyword

The above works but we might want a different behavior and run the assembly on all files if the input files exist and for all other cases output a clear error message. The continue keyword does just that. Let’s begin and see what happens if the we don’t use the break keyword:

# Create dummy files for every sample except S2 to illustrate the process
echo "some content" > s3_forward_cleaned.fastq
echo "some content" > s3_reverse_cleaned.fastq
echo "some content" > s4_forward_cleaned.fastq
echo "some content" > s4_reverse_cleaned.fastq

# Run the loop
for ID in $(cat samples); do
    CLEANED_READS_F="${ID}_forward_cleaned.fastq"
    CLEANED_READS_R="${ID}_reverse_cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
        echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Skipping sample."
    fi

    echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done

This prints:

Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1

Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Skipping sample.
Both input files exist and are non-empty. Running assembly for s2

Processing s3
---------------
Both input files exist and are non-empty. Running assembly for s3

Processing s4
---------------
Both input files exist and are non-empty. Running assembly for s4

We proceed to work through all 4 files but looking closely at what happens with S2, we see that the script:

Acknowledges that files are missing
Tries to run the Action anyhow, which in a real example would produce an error

By using continue we can skip the full loop for a sample where something is missing:

# Run the loop
for ID in $(cat samples); do
    CLEANED_READS_F="${ID}_forward_cleaned.fastq"
    CLEANED_READS_R="${ID}_reverse_cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
        echo "ERROR: ${CLEANED_READS_F} or ${CLEANED_READS_R} is missing or empty. Skipping sample."
        continue
    fi

    echo -e "Both input files exist and are non-empty. Running assembly for $ID\n"
done

Returns:

Processing s1
---------------
Both input files exist and are non-empty. Running assembly for s1

Processing s2
---------------
ERROR: s2_forward_cleaned.fastq or s2_reverse_cleaned.fastq is missing or empty. Skipping sample.
Processing s3
---------------
Both input files exist and are non-empty. Running assembly for s3

Processing s4
---------------
Both input files exist and are non-empty. Running assembly for s4

Now we see that nothing is done for S2, exactly what we want.

Important: if you use this inside a SLURM script, the error message should be captured in the SLURM error file. Make sure to always read these files to catch any issues with your script.

Alternative approach: The same behavior can be written using if/else, which makes both the error and the action branch explicitly visible. This can be easier to read when the loop body is long or when sharing code with others who are less familiar with continue.

for ID in $(cat samples); do
    CLEANED_READS_F="${ID}_forward_cleaned.fastq"
    CLEANED_READS_R="${ID}_reverse_cleaned.fastq"

    echo "Processing $ID"
    echo "---------------"

    if [[ ! -s "$CLEANED_READS_F" || ! -s "$CLEANED_READS_R" ]]; then
        echo "ERROR: files missing or empty for ${ID}"
    else
        echo -e "Running assembly for $ID\n"
    fi
done

Summary

Now you learned how to design code to more explicitly handle missing input files:

Use -f to check whether a file exists and -s to check whether a file exists and is non-empty. Prefer [[ over [ when writing conditions in bash scripts
Use break to stop a loop entirely when a missing or empty file indicates a problem that needs to be resolved before continuing
Use continue to skip the current sample and proceed to the next iteration when a file is missing. Note that continue is only meaningful when there is code after the if block that should be skipped — without that, the loop advances to the next iteration anyway, making continue redundant
Use if/else as a readable alternative to continue, particularly for longer loop bodies or when sharing code with others

A note on GenAI-generated code

Code generated by GenAI tools will often include file verification patterns. While these patterns are not wrong, they are frequently written with production pipelines in mind rather than scientific workflows. Concretely this means:

Silent skipping of samples using continue without a clear error message, meaning a sample can disappear from your results without any warning
File existence checks (-f) where an integrity check would be more appropriate. For example, a file can exist and be non-empty but still be corrupt, for example after a SLURM timeout
Defensive patterns like || continue placed directly in command chains where they are difficult to read and harder to reason about. Avoid these and use the patterns you have seen above.

When reviewing GenAI-generated code, always ask: if this check fails, what actually happens to my data? A script that runs without errors is not the same as a script that produced correct results.