Contributing to QualiBact¶
We welcome contributions to QualiBact! Major contributions to source code, manuscript development, metric validation and adoption, and providing additional data for calibrating quality thresholds will be granted authorship on publications. Pull requests are welcome through GitHub.
We Need More Data!¶
Currently, the QC thresholds are based on AllTheBacteria data, which used Shovill as the genome assembly software. We are aware that the choice of assembler will affect certain metrics (such as N50 and number of contigs). To address this limitation and improve the robustness of our quality thresholds, we are actively seeking contributions of:
- Genome assembly statistics
- Nucleotide composition data for genome assemblies
- Metadata describing assembly methods and parameters
- OPTIONAL: CheckM2 quality assessment results
The data you provide should be for genomes you consider to be of sufficient quality for tasks such as genotyping (e.g., cgMLST, nkST), antimicrobial resistance gene detection, and SNP-based phylogeny.
Data Requirements¶
Please provide the following files for each dataset:
- Nucleotide counts (TSV format, compressed as .xz or .gz).
- Metadata description TSV file with assembly/sequencing details.
- Genome assembly stats TSV file with genome assembly stats, like N50.
- CheckM2 results (TSV format, compressed as .xz or .gz) - This is optional.
File Naming Convention
- Nucleotide counts:
nucleotide_counts_[dataset_name].tsv.xz
- Metadata:
metadata_[dataset_name].tsv.xz
- Assembly stats:
assembly_stats_[dataset_name].tsv.xz
- CheckM2 results:
checkm2_results_[dataset_name].tsv.xz
You can email links to these files to: nabil.alikhan@cgps.group
Metadata Requirements¶
Please include the following information in your metadata file: - Assembly software/pipeline and version - Assembly parameters used - Sequencing platform and instrument - Species name - Data source and accession numbers (if applicable)
That would be a csv file like:
filename platform instrument species accession software version custom_parameters Notes
SAMN40089455.fa Illumina NextSeq550 Salmonella enterica SAMN40089455 SKESA 1.1 None SKESA, default parameters
SAMN40089456.fa Illumina NextSeq550 Escherichia coli SAMN40089456 SPAdes 2.3 --isolate SPAdes
my_reads_pri.fa ONT MinIon Escherichia coli NA Flye 1.3 None None
The Filename should match what you have provided in the other files.
Genome assembly statistics¶
We require the following metrics:
Metric | Description |
---|---|
total_length |
Total number of base pairs across all contigs/scaffolds |
number |
Total number of contigs/scaffolds |
mean_length |
Average contig length (total_length / number ) |
longest |
Length of the longest contig |
shortest |
Length of the shortest contig |
N_count |
Total number of ambiguous bases (N ) in the assembly |
Gaps |
Number of separate gaps (usually inferred from stretches of Ns) |
N50 |
Contig length such that 50% of the assembly is in contigs ≥ this size |
N50n |
Number of contigs ≥ N50 in length |
N70 |
Contig length such that 70% of the assembly is in contigs ≥ this size |
N70n |
Number of contigs ≥ N70 in length |
N90 |
Contig length such that 90% of the assembly is in contigs ≥ this size |
N90n |
Number of contigs ≥ N90 in length |
DO NOT APPLY A MINIMUM CONTIG SIZE FILTER.
For the sake of consistency, please use assembly-stats
as described by AllTheBacteria.
Assembly-stats is available on conda and biocontainers, you can also download the source from GitHub and compile it yourself.
It is very easy to use, this will run on all files in the folder matching the wildcard:
assembly-stats -t /workdir/assembly/*.fa.gz
The file should look something like this:
filename total_length number mean_length longest shortest N_count Gaps N50 N50n N70 N70n N90 N90n
SAMD00127152.fa.gz 5180815 165 31398.88 486012 206 0 0 143489 12 83834 22 25975 43
SAMD00127152.fa.gz 5180815 165 31398.88 486012 206 0 0 143489 12 83834 22 25975 43
SAMD00127153.fa.gz 5278830 113 46715.31 633980 211 100 1 245414 6 143082 12 43792 24
SAMD00127154.fa.gz 5263743 218 24145.61 390782 208 100 1 147090 12 94642 22 26394 41
SAMD00127155.fa.gz 5261650 147 35793.54 533596 209 100 1 178971 10 119101 16 33179 32
SAMD00127156.fa.gz 5262861 207 25424.45 588167 208 100 1 147090 11 96125 20 30157 39
Nucleotide Composition Analysis¶
Other tools do provide GC content, but for some analyses we require more detailed nucleotide composition data. We need a table containing counts for each nucleotide in genome assemblies:
Required Output Format¶
Filename A T G C N Other
SAMN40089455.fa 1234567 1200000 1345000 1320000 1000 50
SAMN40089456.fa 1000000 980000 1010000 990000 500 10
The Filename should match what you have provided in the other files.
Example script¶
Here is an example script that you can adapt:
#!/bin/bash
# Usage: ./count_bases.sh path/to/file.fa
FASTA="$1"
FILENAME=$(basename "$FASTA")
# Write header if needed
echo -e "Filename\tA\tT\tG\tC\tN\tOther"
# Count nucleotides
grep -v "^>" "$FASTA" | tr -d '\n' | awk -v file="$FILENAME" '
BEGIN {
A=0; T=0; G=0; C=0; N=0; other=0;
}
{
for (i = 1; i <= length($0); i++) {
b = toupper(substr($0, i, 1));
if (b == "A") A++;
else if (b == "T") T++;
else if (b == "G") G++;
else if (b == "C") C++;
else if (b == "N") N++;
else other++;
}
}
END {
printf "%s\t%d\t%d\t%d\t%d\t%d\t%d\n", file, A, T, G, C, N, other;
}'
Running CheckM2 - Optional¶
I am aware that CheckM2 can be a tall order with thousands of genomes, and hence for the sake of submission this is an optional inclusion.
To ensure consistency with existing analyses, please follow the same protocol used by AllTheBacteria:
Requirements¶
- CheckM2 version 1.0.1
- CheckM2 database: uniref100.KO.1.dmnd
We recommend using the same singularity container used by AllTheBacteria:
Container download:
- Source: https://osf.io/7vpy3
wget -O checkm2.1.0.1--pyh7cba7a3_0.img https://osf.io/download/7vpy3/
CheckM2 database download:
- Source: https://osf.io/x5vtj
wget -O uniref100.KO.1.dmnd https://osf.io/download/x5vtj/
Example CheckM2 Command¶
# Define variables
WORKDIR="/path/to/working_directory"
IMG="$WORKDIR/checkm2.1.0.1--pyh7cba7a3_0.img"
DB="$WORKDIR/path/to/uniref100.KO.1.dmnd"
OUTDIR="$WORKDIR/output"
FASTA="$WORKDIR/path/to/assembly.fa"
# Set up the CheckM2 command
singularity exec --bind $WORKDIR $IMG checkm2 predict --allmodels --lowmem --database_path $DB --remove_intermediates --force -i "$FASTA" --threads 4 -o $OUTDIR
The output from CheckM2 will look like:
Name Completeness_General Contamination Completeness_Specific Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Additional_Notes
SAMD00127152 100.0 0.09 100.0 Neural Network (Specific Model) 11 0.875 143489 303.8069048574869 5180815 0.5 4982 None
Getting Help¶
If you have questions about: - Data formats: Check our example files in the repository - Technical issues: Open an issue on GitHub - Collaboration opportunities: Contact nabil.alikhan@cgps.group
Thank you for contributing to improving bacterial genome quality assessment!