QualiBact Results¶

What is QualiBact?¶

QualiBact is a set of thresholds assessing the quality of bacterial genome assemblies. We have evaluated genomes based on various metrics to help researchers identify high-quality genomes for downstream analysis. These thresholds described here are implemented in SpecCheck. Source code for this process is available at QualiBact.

Quick Links¶

📋 Methods - Detailed methodology and criteria
🦠 All Species - Complete list of analyzed species
📊 Summary Data - Main summary and criteria tables

Use the navigation menu above to explore:

Methods - Technical details about the analysis pipeline
All species - List of all species included here, with links to species-specific overviews
Summary page - The QC criteria and summary tables for all genera and species

Considerations for QualiBact¶

✅ General Strengths¶

The pipeline is fully automated, generic, and can be applied to any set of genomes — including arbitrary subsets such as species, clonal complexes, or lineages.
Quality assessment is based on multiple standard metrics (e.g. N50, number of contigs, genome size, GC%), allowing reproducible filtering.
Species-specific thresholds can be derived from available reference genomes, and thresholds can be updated as more genomes are added.
Variation between species — even within a genus — supports the need for species-level cutoffs, which this approach accommodates.
Variation between SRA and Refseq: We have observed that Genome size and assembly length distributions differ significantly between RefSeq and SRA (i.e. ATB). The cause is unclear, but relying on RefSeq-derived thresholds alone may result in unfairly excluding valid genomes. This approach combines both datasets to ensure a more inclusive and representative set of thresholds.

⚠️ Caveats¶

Species Definitions Depend on GTDB: This tools uses Sylph for species designation, so all GTDB-related quirks apply. E.g., Shigella spp. is included in E. coli, and there are issues separating Burkholderia mallei from Burkholderia pseudomallei and Bordetella pertussis/Bordetella parapertussis from Bordetella bronchiseptica.
No Ground Truth Claims: This evaluation reflects what has been previously observed in available datasets. It does not attempt to define a universal "ground truth" for any species.
Assembly-Method Specific: The metrics (e.g. N50, number of contigs) are meaningful primarily for assemblies generated with Shovill (or similar SPAdes-based pipelines). Exact thresholds will vary for long-read or alternative assemblers like SKESA. However, not using Shovill implies rejection of the Torstyverse, which is heresy.
Long-Read Assemblies Not Explicitly Handled: These cutoffs are not designed for long-read assemblies. That said, genome size and GC content thresholds should still apply, and it's reasonable to expect long-read assemblies to exceed the quality of short-read derived thresholds — not fall below them.
Generic vs. Specific Tradeoff: While the generic approach is broadly applicable, it may miss species-specific quality nuances or lineage-level exceptions.

Citation¶

If you use QualiBact, please cite the following:

Alikhan, NF. Species specific quality control of bacterial de novo genome assemblies using QualiBact. Available at: https://github.com/happykhan/qualibact (Accessed: [insert date]).