We demonstrate the utility of Spacedust by detecting conserved clusters in an all-versus-all comparison of 1,308 representative bacterial genomes from different genera with a total of 4.2 million protein-coding genes. Spacedust recovers previously annotated gene clusters, for example, operons, antiviral defense systems and BGCs. It is able to assign 58% of all 4.2 million genes and 35% of genes without any annotation to conserved gene clusters. Spacedust also discovers the vast majority of antiphage defense systems in this dataset and achieves better results in identifying 207 manually annotated BGCs than three specialized tools.
Spacedust takes as input a set Q of query genomes and a set T of target genomes (which may be equal to Q) and, for each pair (q, t) ∈ Q × T of query and target genome, it finds all gene clusters whose gene arrangement is at least partially conserved between q and t (Fig. 1 and Methods). For that purpose, Spacedust first identifies homologous matches ('hits') between proteins in Q and proteins in T using our sensitive structure search tool Foldseek and our sequence search tool MMseqs2 (ref. ; steps 1-5 in Fig. 1a). For every Q-T pair, it detects clusters of hits with significant conservation of gene neighborhood, using a greedy cluster detection algorithm (step 6 in Fig. 1a,b). The cluster detection starts with each protein hit in its own cluster and adds protein hits to the cluster matches one at a time. If the significance score of the cluster match improves, the addition is accepted and the algorithm continues, until the significance of the cluster matches cannot be improved further. The significance score is calculated as the sum of the negative logarithms of a clustering P value and an ordering P value. The clustering P value is the probability of finding 'by chance' at least k matches within a window of at most m genes in both the query and the target genome. The ordering P value is the probability to find 'by chance' at least n pairs of genes of the cluster match in conserved order in both genomes. The cluster detection algorithm thereby identifies positionally conserved clusters between all Q-T pairs of genomes. Optionally, the cluster matches for each query genome, aggregated across multiple reference genomes, can be visualized as a measure of conservation strength.
Despite the availability of tens of thousands of complete bacterial genomes, the gene cluster conservation landscape has yet to be surveyed systematically. To address this gap, we curated a dataset of 1,308 bacterial reference genomes, covering a broad phylogenetic range. These genomes were selected such that they belong to different bacterial genera (Methods), to focus on detecting remote homology and globally conserved clusters across higher taxonomic ranks. This choice means species-specific and genus-specific gene clusters cannot be found in this analysis. We subjected all predicted genes (4.19 million) from the 1,308 genomes to an all-versus-all Foldseek+MMseqs2 search using Spacedust. The all-versus-all homology search and hit-filtering process took 72 h to complete on two servers with two 64-core AMD EPYC ROME 7,742 CPUs each, or 150 ms per genome-genome comparison (Supplementary Information), and the subsequent cluster detection required 51 min. The runtime of Spacedust scales quadratically with the total number of genomes and proteins in the genomes, owing to the all-versus-all search and cluster detection. The search yielded 321.2 million cluster hits in 106.6 million cluster matches, with an average of three genes per cluster match (Fig. 2a).
These pairwise cluster matches were subsequently grouped on the level of genome and genes, which yielded 72,483 nonredundant clusters comprising 2.45 million genes, representing 58% of the dataset. We classified 4.19 million genes based on their eggNOG-mapper annotations: 'annotated' if the gene was assigned a specific function (3.13 million, or 75% of the dataset), or 'unannotated' if the gene was labeled as 'hypothetical protein', 'protein with unknown function', or lacking any annotation (1.06 million, or 25% of the dataset). Notably, 66% of the annotated genes were found in nonredundant clusters present in more than one genome. Additionally, 35% of the unannotated genes were found in nonredundant clusters (Fig. 2b).
To evaluate the functional associations within the nonredundant clusters, we evaluated the congruence of KEGG module IDs for gapped gene pairs separated by up to four genes (i, i + 1)...(i, i + 4; Fig. 2c and Extended Data Fig. 1). A gene pair is considered a true positive if both genes share a common KEGG module ID, and false positive if not. The area under the precision-recall curve indicates that Spacedust identifies cluster matches with considerably higher accuracy than the baseline model, which assumes any neighboring gene pair (i, i + x) to be functionally associated. Similarly, we assessed the functional association within the nonredundant clusters predicted by the Foldseek-only search mode (Fig. 2e,f), with the area under the precision-recall curve for gapped gene pairs (i, i + 1)...(i, i + 4) slightly higher than that of the default Foldseek+MMseqs search mode.
To illustrate how Spacedust can support functional annotation of genomes, we took one example genome of a unicellular cyanobacterium Synechocystis sp. Pasteur Culture Collection (PCC) 6803 from the reference database with the 1,308 genomes. This genome comprises one chromosome (GenBank accession: BA000022.2) and four plasmids (AP004311.1, AP004312.1, AP004310.1, AP006585.1), totaling 3,551 protein-coding genes. All the detected clusters are visualized as an interactive cluster heat map (Extended Data Figs. 2-4). For better visibility, we zoomed in on a specific region spanning protein location indices 500 to 800 (Fig. 3a; corresponding to 0.0007% of the total dataset) and integrated functional annotation data obtained from eggNOG-mapper. This allowed us to assign functions to many of the proteins. From this selected genomic region containing 300 genes, we identified three distinct cyanobacteria-specific clusters, indicative of functional conservation across related species. Additionally, we detected 21 clusters shared with other phyla (Supplementary Tables 1-3). Some clusters corresponded to single operons, while others spanned multiple operons.
Cluster 1 (Fig. 3b and Extended Data Fig. 5) comprises genes associated with photosystem II (PSII), the protein-pigment complex that drives oxygenic photosynthesis. The first two genes, rubredoxin and ycf48, are crucial for PSII activity and assembly. The remaining genes, psbEFLJ, form an operon encoding components of the core PSII complex. In many cases, psbL and psbJ are absent, possibly owing to poor conservation or the short length of their sequences.
Cluster 2 (Extended Data Fig. 6) forms an operon encompassing components of the phycobilisome complex rod, a large protein complex in cyanobacteria responsible for capturing sunlight and transferring energy to the photosynthetic reaction centers. The genes cpcA and cpcB encode two major subunits of the rod, while cpcD, cpcC and cpcC2 encode linker components connecting the rod to the PBS core. Conversely, homologous clusters in some other cyanobacteria only contain one copy of the cpcC gene, suggesting that cpcC and cpcC2 might have been created by gene duplication. In some genomes, the genes are still colocalized despite the order of the genes being only partially conserved.
In cluster 3 (Extended Data Fig. 7), the first two genes are both annotated as spkA by eggNOG-mapper, encoding a eukaryotic-type serine/threonine protein kinase involved in signal transduction and mobility. Alignment with other clusters revealed gene fusion of these two genes in other cyanobacteria. The third gene in the cluster is highly conserved in other genomes but could not be annotated using eggNOG-mapper.
To further assess the ability of Spacedust to identify conserved gene clusters, we focused on two categories of known, specialized gene clusters, antiviral defense systems and BGCs.
We used PADLOC (v1.1.0) to identify all known antiviral defense systems in the 1,308 bacterial reference genomes. We removed any predicted region consisting only of single genes. Spacedust was able to recover 5,255 (95%) of 5,520 multi-gene defense system clusters detected by PADLOC (Fig. 4a,b), with 93% (4,888) of the clusters matching fully and 7% (367) matching partially to the PADLOC prediction. Most partial cluster matches resulted from missing matches to one or two short genes at the edge of longer clusters such as CRISPR-Cas systems. For 73 of the 106 defense system types, more than 90% of all defense system clusters were discovered in their entirety by Spacedust (Fig. 4a), despite that restriction-modification type II clusters are the most abundant type of defense systems in the dataset yet most challenging for Spacedust to detect.
To evaluate Spacedust's ability to recover BGCs, we utilized a gold-standard dataset consisting of nine complete genomes available at the NCBI that were fully annotated with BGC and non-BGC regions. We queried these nine genomes against our reference set of 1,308 bacterial genomes using Spacedust. We compared the results with three tools specialized in identifying BGCs, ClusterFinder, DeepBGC (using a cutoff of 10% false positive rate as recommended) and GECCO. As Spacedust returns all conserved clusters and not exclusively BGCs, it was not feasible to compare the precision of BGC detection based on all predictions. Therefore, we evaluated the F1 score, equal to the harmonic mean of precision and recall, for each of the annotated BGCs (Fig. 5). The precision is the fraction of genes in the overlapping predicted region that were annotated as BGC genes, and the recall is the fraction of genes in the annotated BGC region that were predicted as BGC region by the tool. Spacedust achieves higher F1 scores than ClusterFinder, DeepBGC and GECCO (Fig. 5a-c) owing to its higher precision than DeepBGC and GECCO and its higher recall than ClusterFinder (Extended Data Figs. 8 and 9). All tools failed to detect a few instances of BGCs that were identified by Spacedust. Figure 5d shows the cumulative distribution of the F1 score for the three tools. The average F1 score over all BGCs is 0.44 for ClusterFinder, 0.39 for DeepBGC, 0.43 for GECCO and 0.61 for Spacedust.
We used AntiSMASH (version 8), a tool for profile-based BGC detection, to functionally annotate the genes as either 'biosynthetic-related' (biosynthetic, biosynthetic additional, transport, regulatory) or 'other genes'. We manually inspected the clusters reported by Spacedust, ClusterFinder, DeepBGC and GECCO. We observed that the regions reported by Spacedust often miss the transport and regulatory genes but cover the core and additional biosynthetic-related genes, sometimes with multiple short gene clusters within the annotated BGC region (Extended Data Fig. 10).
Next, we investigated Spacedust's utility in identifying new instances of known gene cluster families. One such example is the recently discovered CRISPR subtype III-E, which comprises a single effector protein known as Cas7-11 (ref. ). Notably, Cas7-11 is a protein fusing four Cas7 proteins with a putative Cas11-like protein. The fusion yields a single-protein programmable RNase that shows high sequence specificity and no evidence of collateral activity. Previous screening for Cas7-11 across bacterial genomic sequences led to the identification of subtype III-E systems in 17 loci.
To expand our knowledge of subtype III-E systems beyond the reported loci, we queried the proteins in the 17 loci reported by ref. against the GTDB database. Because we were unable to map a substantial portion of the query proteins and GTDB proteins to known structures, we used the MMseqs2 iterative search with three iterations to perform the homology search. We identified an additional seven instances of subtype III-E clusters in the GTDB database by demanding the presence of the gene encoding Cas7-11 (Fig. 6). In three out of seven genomes, all components of the respective system were identified, demonstrating the high sensitivity of the method.
To facilitate the use of Spacedust for a broad user base, we have set up a Google Colaboratory environment, which allows users to easily run tests and reproduce results without requiring a local installation or configuration. We provide a comprehensive IPython notebook (ipynb file) that includes steps for installing all dependencies and databases, executing the program, and interactively visualizing the clusters. Within the Colab framework, users can either run Spacedust in an all-versus-all mode or annotate query genomes against a pre-compiled reference database with just a single click. With the help of interactive visualization, they can explore the evolutionary conservation of their genome of interest in other genomes at different resolutions and generate gene neighborhood plots for any gene clusters.