BUSCO

Introduction

BUSCO can be used to evaluated the quality of different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome-assembled genomes where the taxonomic origin of the species is unknown (Manni et al. 2021).

BUSCO provides a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs, where “BUSCOs” is shorthand for “BUSCO marker genes”.

For more information, check out the User guide and for updates check the tools Gitlab.

Installation

Installed on crunchomics: Yes,

  • BUSCO v5.8.0 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
  • Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -n busco_5.8.0 -c bioconda -c conda-forge busco=5.8.0

Usage

# List available lineages
# Note, BUSCO will automatically download the requested dataset if it is not already present
busco --list-datasets

# Run BUSCO and automatically select the closest lineage on a eukaryotic genome using 20 cores
busco -i genome.fasta -m genome \
    -c 20 --out results/busco/ \
    --auto-lineage-euk

    # Generate a plot with your results
generate_plot.py -wd results/busco/

Please note:

BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore, you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for.

Mandatory parameters:

  • -i SEQUENCE_FILE, --in SEQUENCE_FILE Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
  • -o OUTPUT, --out OUTPUT Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with –out_path.
  • -m MODE, --mode MODE Specify which BUSCO analysis mode to run. There are three valid modes:
    • geno or genome, for genome assemblies (DNA)
    • tran or transcriptome, for transcriptome assemblies (DNA)
    • prot or proteins, for annotated gene sets (protein)

Recommended parameters:

  • -l LINEAGE, --lineage_dataset LINEAGE Specify the name of the BUSCO lineage to be used, e.g. hymenoptera_odb10. A full list of available datasets can be viewed by entering busco --list-datasets. You should always select the dataset that is most closely related to the assembly or gene set you are assessing. If you are unsure, you can use the --auto-lineage option to automatically select the most appropriate dataset. BUSCO will automatically download the requested dataset if it is not already present in the download folder. You can optionally provide a path to a local dataset instead of a name, e.g. -l /path/to/my/dataset.
  • -c N, --cpu N Specify the number (N=integer) of threads/cores to use.

Optional parameters

  • --augustus Use augustus gene predictor for eukaryote runs
  • --augustus_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2" Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  • --augustus_species AUGUSTUS_SPECIES Specify a species for Augustus training.
  • --auto-lineage Run auto-lineage to find optimum lineage path
  • --auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
  • --auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
  • --contig_break n Number of contiguous Ns to signify a break between contigs. Default is n=10.
  • --download [dataset ...] Download dataset. Possible values are a specific dataset name, “all”, “prokaryota”, “eukaryota”, or “virus”. If used together with other command line arguments, make sure to place this last.
  • --download_base_url DOWNLOAD_BASE_URL Set the url to the remote BUSCO dataset location
  • --download_path DOWNLOAD_PATH Specify local filepath for storing BUSCO dataset downloads
  • -e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
  • -f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
  • --limit N How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
  • --list-datasets Print the list of available BUSCO datasets
  • --long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
  • --metaeuk Use Metaeuk gene predictor
  • --metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2" Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  • --metaeuk_rerun_parameters “–PARAM1=VALUE1,–PARAM2=VALUE2” Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  • --miniprot Use Miniprot gene predictor
  • --skip_bbtools Skip BBTools for assembly statistics
  • --offline To indicate that BUSCO cannot attempt to download files
  • --opt-out-run-stats Opt out of data collection. Information on the data collected is available in the user guide.
  • -q, --quiet Disable the info logs, displays only errors
  • -r, --restart Continue a run that had already partially completed.
  • --scaffold_composition Writes ACGTN content per scaffold to a file scaffold_composition.txt
  • --tar Compress some subdirectories with many files to save space

References

Manni, Mosè, Matthew R Berkeley, Mathieu Seppey, Felipe A Simão, and Evgeny M Zdobnov. 2021. “BUSCO Update: Novel and Streamlined Workflows Along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.” Edited by Joanna Kelley. Molecular Biology and Evolution 38 (10): 4647–54. https://doi.org/10.1093/molbev/msab199.