Xanthomonas oryzae¶
This is the QualiBact page for Xanthomonas oryzae. For detailed methods on how these thresholds were calculated, please see Methods. The suggested thresholds are:
metric | lower_bounds | upper_bounds |
---|---|---|
N50 | 4000.0 | |
no_of_contigs | 1630.0 | |
GC_Content | 63.0 | 65.0 |
Completeness | 84.0 | |
Contamination | 5.0 | |
Total_Coding_Sequences | 4000.0 | 5300.0 |
Genome_Size | 4000000.0 | 5300000.0 |
These thresholds are based on 212 genomes from RefSeq and 328 genomes from ATB / SRA.
These thresholds were applied to all the bacteria dataset, which resulted in removing 19 and retaining 309.
The list of genomes retained (i.e. high quality) and the list of genomes rejected (filtered) can be downloaded below. These files are in .xz
format. The rejected genomes file, also includes the reason why.
Download high quality genomes list
Download rejected genomes list
Summary Tables¶
These tables provide a summary of the distribution of each metric, including SDeviation, Mean, Median, and Percentiles.
Download simple summary tables
Plots and Visualizations¶
This plot is a histogram comparing genome sizes between the SRA and RefSeq datasets. Each bar represents the density of genomes within a specific size range for both datasets. By comparing the shapes and positions of the bars, you can identify differences in genome size distributions, such as shifts, peaks, or outliers. This visualization helps reveal whether one dataset tends to have larger or smaller genomes, or if there are notable differences in variability or coverage between SRA and RefSeq.
This plot is a QQ (quantile-quantile) plot, which compares the distribution of the SRA data with RefSeq. Points falling along the diagonal line indicate that the data follows the expected distribution. Deviations from the line suggest departures from normality, such as skewness or outliers. This helps assess whether the dataset is consistently distributed or if there are systematic differences.
This plot shows the relationship between the number of coding sequences (CDS) and genome size. It helps to visualize how genome size correlates with the number of genes. This should be linear - as the genome size increases, the number of coding sequences should also increase. Any secondary trend lines or non-linear behaviour indicates bone fide seperate populations within the retained genomes, or some remaining contaminant.
Additional Plots¶
These plots provide additional insights into the genome characteristics:
- GC Content Histogram
- GC Content QQ Plot
- Total Coding Sequences Histogram
- Total Coding Sequences QQ Plot
- Genome Size Histogram
- Genome Size QQ Plot
Illustrating the filtering process¶
These plots illustrate the data, pre and post filtering to demostrate what type of outliers have been removed. While this was applied to all metrics, we will demonstrate using total assembly length and N50.
N50 vs total length for all genomes in the dataset.
N50 vs total length for genomes in the dataset, coloured according to whether they are an anomaly or not.
N50 vs total length post filtering on the dataset.
Additional Plots¶
These plots provide additional insights into the genome characteristics:
- N50 vs number of contigs, all genomes
- N50 vs number of contigs, sampled genomes
- N50 vs number of contigs, filtered genomes
- GC Content vs Total Length, all genomes
- GC Content vs Total Length, sampled genomes
- GC Content vs Total Length, filtered genomes
- Longest Contig vs Total Length, all genomes
- Longest Contig vs Total Length, sampled genomes
- Longest Contig vs Total Length, filtered genomes