conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/BUSCO
Introduction
BUSCO can be used to evaluated the quality of different data types that can range from genome assemblies of single isolates and assembled transcriptomes and annotated gene sets to metagenome-assembled genomes where the taxonomic origin of the species is unknown (Manni et al. 2021).
BUSCO provides a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set. The results are simplified into categories of Complete and single-copy, Complete and duplicated, Fragmented, or Missing BUSCOs, where “BUSCOs” is shorthand for “BUSCO marker genes”.
For more information, check out the User guide and for updates check the tools Gitlab.
Installation
Installed on crunchomics: Yes,
- BUSCO v5.8.0 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -n busco_5.8.0 -c bioconda -c conda-forge busco=5.8.0Usage
# List available lineages
# Note, BUSCO will automatically download the requested dataset if it is not already present
busco --list-datasets
# Run BUSCO and automatically select the closest lineage on a eukaryotic genome using 20 cores
busco -i genome.fasta -m genome \
-c 20 --out results/busco/ \
--auto-lineage-euk
# Generate a plot with your results
generate_plot.py -wd results/busco/Please note:
BUSCO completeness results make sense only in the context of the biology of your organism. You have to understand whether missing or duplicated genes are of biological or technical origin. For instance, a high level of duplication may be explained by a recent whole duplication event (biological) or a chimeric assembly of haplotypes (technical). Transcriptomes and protein sets that are not filtered for isoforms will lead to a high proportion of duplicates. Therefore, you should filter them before a BUSCO analysis. Finally, focusing on specific tissues or specific life stages and conditions in a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome. In this case, consistency across your samples is what you will be aiming for.
Mandatory parameters:
-i SEQUENCE_FILE,--in SEQUENCE_FILEInput sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.-o OUTPUT,--out OUTPUTGive your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with –out_path.-m MODE,--mode MODESpecify which BUSCO analysis mode to run. There are three valid modes:- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
Recommended parameters:
-l LINEAGE,--lineage_dataset LINEAGESpecify the name of the BUSCO lineage to be used, e.g.hymenoptera_odb10. A full list of available datasets can be viewed by enteringbusco --list-datasets. You should always select the dataset that is most closely related to the assembly or gene set you are assessing. If you are unsure, you can use the--auto-lineageoption to automatically select the most appropriate dataset. BUSCO will automatically download the requested dataset if it is not already present in the download folder. You can optionally provide a path to a local dataset instead of a name, e.g.-l /path/to/my/dataset.-c N,--cpu NSpecify the number (N=integer) of threads/cores to use.
Optional parameters
--augustusUse augustus gene predictor for eukaryote runs--augustus_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.--augustus_species AUGUSTUS_SPECIES Specify a species for Augustus training.--auto-lineageRun auto-lineage to find optimum lineage path--auto-lineage-eukRun auto-placement just on eukaryote tree to find optimum lineage path--auto-lineage-prokRun auto-lineage just on non-eukaryote trees to find optimum lineage path--contig_break nNumber of contiguous Ns to signify a break between contigs. Default is n=10.--download [dataset ...]Download dataset. Possible values are a specific dataset name, “all”, “prokaryota”, “eukaryota”, or “virus”. If used together with other command line arguments, make sure to place this last.--download_base_url DOWNLOAD_BASE_URL Set the url to the remote BUSCO dataset location--download_path DOWNLOAD_PATHSpecify local filepath for storing BUSCO dataset downloads-e N, --evalue NE-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)-f,--forceForce rewriting of existing files. Must be used when output files with the provided name already exist.--limit NHow many candidate regions (contig or transcript) to consider per BUSCO (default: 3)--list-datasetsPrint the list of available BUSCO datasets--longOptimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms--metaeukUse Metaeuk gene predictor--metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.--metaeuk_rerun_parameters“–PARAM1=VALUE1,–PARAM2=VALUE2” Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.--miniprotUse Miniprot gene predictor--skip_bbtoolsSkip BBTools for assembly statistics--offlineTo indicate that BUSCO cannot attempt to download files--opt-out-run-statsOpt out of data collection. Information on the data collected is available in the user guide.-q,--quietDisable the info logs, displays only errors-r,--restartContinue a run that had already partially completed.--scaffold_compositionWrites ACGTN content per scaffold to a file scaffold_composition.txt--tarCompress some subdirectories with many files to save space