Autocycler

Introduction

Autocycler is a tool for generating consensus long-read assemblies for microbial genomes (Wick 2025). It is the successor to Trycyler, and for most users it is recommended to Autocycler over Trycycler. If you want to use Autocycler, keep in mind that the input assemblies need to mostly be complete: one sequence per piece of DNA in the genome. Therefore, its best used for microbial genomes (archaea, bacteria, mitochondria, chloroplast).

For installation instructions, usage, deeper explanations and more, head over to the Autocycler wiki!

Installation

Installed on Crunchomics: Yes,

  • Autocycler v0.1.2 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
  • Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run the following to install the tool from source:

# Change to directory you want to download the github repository to
cd software/

# Create an empty conda env 
mamba create -p /zfs/omics/projects/bioinformatics/software/miniconda3/envs/autocycler_0.1.2 
conda activate autocycler_0.1.2

# Install autocycler
# This will build an executable in ``target/release/autocycler``
git clone https://github.com/rrwick/Autocycler.git
cd Autocycler
cargo build --release 

# Install various different assemblers 
# for this first remove the name field to install it in our empty conda environment we setup before
# ensure that you run mamba env update only after you activated the autocycler_0.1.2 environment
grep -v "name" scripts/environment.yml > scripts/environment_edit.yml
mamba env update --file scripts/environment_edit.yml

# Optional but recommended: Install GNU parallel
mamba install conda-forge::parallel 

# Check that all packages exist
conda list 

# Copy scripts to conda bin folder to be able to access them without having to write out the path path
cp target/release/autocycler "$CONDA_PREFIX"/bin/
cp scripts/*.py scripts/*.sh "$CONDA_PREFIX"/bin/

Usage

Single sample, one assembly at a time

If you want to know what is happening at each step, please visit the tools wiki.

# Activate the right conda environment
conda activate autocycler_0.1.2

# Go to the project directory in which you also have the sequencing data
cd autocyler_analysis

# Define necessary input variables
# The genome size can also be set manually if you know the value
reads=data/my_data.fastq.gz
threads=8  
genome_size=$(genome_size_raven.sh "$reads" "$threads") 

# View estimated genome size:
# In this example, we have an estimated genome size of 4,497,636 bp
echo $genome_size

# Step 1: subsample the long-read set into multiple files. This generates 4 sub-sampled read sets by default
# Make sure to record some of the useful statistics such as: used default for min_read_depth 25, input fastq read count: 213459 and N50 length: 3760 bp; Total read depth: 131.5x; Mean read length: 2770 bp
autocycler subsample --reads "$reads" --out_dir subsampled_reads --genome_size "$genome_size"

# Step 2: assemble each of the 4 subsampled files with 6 different assemblers (this can take a bit, for suggestions to speed this up, see below)
# Adjust in case you could not install all assemblers or don't want to use any of the listed assemblers
mkdir assemblies
for assembler in canu flye miniasm necat nextdenovo raven; do
    for i in 01 02 03 04; do
        srun --cpus-per-task $threads --mem=50G \
            "$assembler".sh subsampled_reads/sample_"$i".fastq assemblies/"$assembler"_"$i" "$threads" "$genome_size"
    done
done

# Sanity check: Count number of contigs/assembly
grep -c ">" assemblies/*fasta

# Optional step: remove the subsampled reads to save space
rm subsampled_reads/*.fastq

# Step 3: compress the input assemblies into a unitig graph
autocycler compress -i assemblies -a autocycler_out

# Step 4: cluster the input contigs into putative genomic sequences
autocycler cluster -a autocycler_out

# Steps 5 and 6: trim and resolve each QC-pass cluster
for c in autocycler_out/clustering/qc_pass/cluster_*; do
    autocycler trim -c "$c"
    autocycler resolve -c "$c"
done

# Step 7: combine resolved clusters into a final assembly
# Record assembly statistics: 1 unitig, 1 link, total length:  4491993 bp
# The final consensus assembly will be named: autocycler_out/consensus_assembly.fasta
autocycler combine -a autocycler_out -i autocycler_out/clustering/qc_pass/cluster_*/5_final.gfa

# Optional: generate a TSV line from the various metrix
autocycler table > metrics.tsv
autocycler table -a autocycler_out >> metrics.tsv

Single sample, several assemblies run in parallel

To speed things up we can run the assembly (Step 2), which is the most time intensive step, with GNU parallel. This allows us to run several assemblies at the same time.

# Define our input variables
# Ensure that this works with your computer specs, i.e. here we run 4 jobs in parallel each with 8 cpus
# So here we need to have 4x8 = 24 threads available for things to run
jobs=4
threads=8 

mkdir -p assemblies
rm -f assemblies/jobs.txt

for assembler in canu flye miniasm necat nextdenovo raven; do
    for i in 01 02 03 04; do
        echo "srun --cpus-per-task $threads --mem=50G $assembler.sh subsampled_reads/sample_$i.fastq assemblies/${assembler}_$i $threads $genome_size" >> assemblies/jobs.txt
    done
done

parallel --jobs "$jobs" --joblog assemblies/joblog.txt --results assemblies/logs < assemblies/jobs.txt

Several samples run in parallel

To run Autocycler on more than one genome you can run the code above with either a for loop or with a job scheduler, such as SLURM. To do so, we provide two bash scripts.

For these to work ensure that both Autocycler and GNU parallel are installed on your system (when running this outside of Crunchomics).

Version 1: Without a job scheduler

To run Autocycler on several genomes when you have no job scheduler available you can use autocycler_bash.sh. This scripts requires the following arguments to run:

  • -d: The path to the folder with the raw read in .fastq.gz format. Note, that the name of the file will be used as the sample name throughout the script, so name those files accordingly.
  • -t: The number of threads to use
  • -m: The minimum number of CPUs per assembly. These assemblies will be run in parallel based on that number so set this accordingly. For example, if you use a total of 16 threads and want to run at least four assemblies in parallel you would set this to 4.

The results of this script will be generated in the results folder which will contain a sub-directory for each genome. The final assembly will be found in results/<Genome_name>/autocycler_out/consensus_assembly.fasta.

cd project_dir

# Make the bash script executable (needs to be run only once if you run this outside of Crunchomics)
chmod +x scripts/autocycler_bash.sh 

# Run Autocycler on several genomes
conda activate autocycler_0.1.2

autocycler_bash.sh -d data -t 32 -m 8

# Generate metrics file
autocycler table > metrics.tsv 

for i in data/*fastq.gz; do
    sample=$(basename $i .fastq.gz)
    echo $sample
    autocycler table -a results/${sample}/autocycler_out -n ${sample} >> metrics.tsv  # append a TSV row
done

conda deactivate

# Optional: The scripts generates a log file for each genome in the log folder, we can extract some key information as follows 
for i in logs/*log; do 
  sample=$(basename $i .log)
  
  echo -e "\nInformation on Assembly $sample"
  echo "----------------------------------------"
  echo "Estimated $(grep "Genome size for" $i | uniq)"
  
  echo -e "\nInformation on Input FastQ for $sample"
  grep "Read count:" $i | sed 's/^[ \t]*//'
  grep "Read bases:" $i | sed 's/^[ \t]*//'
  grep "Read N50 length:" $i | sed 's/^[ \t]*//'
  grep "Total read depth:" $i
  grep "Mean read length:" $i
  grep "reads per subset" $i | sed 's/^[ \t]*//'

  echo -e "\nInformation on the assembled genome for $sample"
  grep -E "[0-9]* unitig, [0-9]* link"  $i | tail -n1 | sed 's/^[ \t]*//'
  grep "total length:" logs/Genome1.log | tail -n1 | sed 's/^[ \t]*//'
  grep "fully resolv" $i
done 
Version 2: With the SLURM scheduler

If you want to run Autocycler on an HPC with SLURM, such as Crunchomics, you can use autocycler_array.sh which is available on Crunchomics at /zfs/omics/projects/bioinformatics/scripts/ or can be downloaded here.

This is the recommended option when working with many genomes, since some assemblers, such as CANU, can take some time to finish.

Before running the script, you should open the script once and adjust the following:

  1. #SBATCH --cpus-per-task=32: Provide the number of threads to use, adjust based on what the system has available.
  2. #SBATCH --array=1-2: Define the job array size, i.e. the number of genomes to run this script on. Set the second number to the total number of genomes that you want to analyse (here: 2). If you have many genomes to analyse it is recommended to not start all at once to leave resources for others. For example, if you have 100 genomes and want to start them in batches of five, you would set this to #SBATCH --array=1-100%5
  3. data_folder="data": Provide the path to the folder that contains the assemblies (here: data). Note, that the name of the file will be used as the sample name throughout the script, so name those files accordingly.
  4. MIN_CPUS_PER_ASSEMBLY=8. Provide the minimum number of CPUs per assembly. These assemblies will be run in parallel based on that number so set this accordingly. For example, if you use a total of 32 threads and want to run at least four assemblies in parallel you would set this to 8.
  5. conda activate autocycler_0.1.2: Defines the name of the conda environment that has autocycler and GNU parallel installed. If you have these tools installed outside of a conda environment you can delete this line.

The results of this script will be generated in the results folder which will contain a sub-directory for each genome. The final assembly will be found in results/<Genome_name>/autocycler_out/consensus_assembly.fasta.

cd project_dir

# Run Autocycler on several genomes using SLURM
sbatch scripts/autocycler_array.sh

# Generate metrics file
conda activate autocycler_0.1.2

autocycler table > metrics.tsv 

for i in data/*fastq.gz; do
    sample=$(basename $i .fastq.gz)
    echo $sample
    autocycler table -a results/${sample}/autocycler_out -n ${sample} >> metrics.tsv  # append a TSV row
done

conda deactivate

# Optional: The scripts generates a log file for each genome in the log folder, we can extract some key information as follows 
for i in logs/*log; do 
  sample=$(basename $i .log)
  
  echo -e "\nInformation on Assembly $sample"
  echo "----------------------------------------"
  echo "Estimated $(grep "Genome size for" $i | uniq)"
  
  echo -e "\nInformation on Input FastQ for $sample"
  grep "Read count:" $i | sed 's/^[ \t]*//'
  grep "Read bases:" $i | sed 's/^[ \t]*//'
  grep "Read N50 length:" $i | sed 's/^[ \t]*//'
  grep "Total read depth:" $i
  grep "Mean read length:" $i
  grep "reads per subset" $i | sed 's/^[ \t]*//'

  echo -e "\nInformation on the assembled genome for $sample"
  grep -E "[0-9]* unitig, [0-9]* link"  $i | tail -n1 | sed 's/^[ \t]*//'
  grep "total length:" logs/Genome1.log | tail -n1 | sed 's/^[ \t]*//'
  grep "fully resolv" $i
done 

Things to keep in mind

  • When to use autocycler: For Autocycler to work, the input assemblies need to mostly be complete: one sequence per piece of DNA in the genome. Therefore, its best used for microbial genomes (archaea, bacteria, mitochondria, chloroplast). However, if you sequence eukaryotes and if T2T assemblies are possible, then Autocycler should work as well.
  • Polishing: Since Autocycler assemblies are long-read-only, they may still contain errors. If assembling Oxford Nanopore reads you can consider polishing with for example Medaka. If you also have short reads Polypolish and Pypolca are options to consider
  • Genome orientation: Autocycler does not rotate circular sequences to start at a particular gene (e.g. dnaA). To do this, check out Dnaapler.
  • Assessing the assembly
    • The Autocycler combine command produces a final assembly by combining all of the clusters. Hopefully, each cluster resolved to a single contig, in which case it will print this at the end of its stderr output: Consensus assembly is fully resolved. The Metrics page describes all of the assembly metrics generated by Autocycler, but for assessment purposes, the most useful is likely the consensus_assembly_fully_resolved metric, which can be true or false. You can find this metric in the consensus_assembly.yaml file made by Autocycler combine in your Autocycler output directory.
    • If your assembly went poorly you can consider doing the following:
      • Try using other assemblers
      • Try different parameters when making your input assemblies. Some assemblers (e.g. Canu) have a large number of parameters that can influence the result.
      • Manually curate your input assemblies before using them with Autocycler. Specifically, discard any assemblies that appear to be incomplete.
      • If none of the above work well, then your read set is likely insufficient, in which case you may need to sequence again aiming for deeper and longer reads.

References

Wick, Ryan. 2025. Rrwick/Autocycler. https://github.com/rrwick/Autocycler.