The R package spARI is designed to compute two novel clustering evaluation metrics—the spatially aware Rand index (spRI) and its adjusted version (spARI)—for assessing spatial transcriptomics clustering. Unlike the traditional Rand index (RI) and adjusted Rand index (ARI), spRI and spARI incorporate spatial distance information into clustering evaluation. When comparing two partitions, spRI assigns a weight to each disagreement pair—two objects in the same cluster of one partition but in different clusters of the other—based on their spatial distance, allowing for a more refined distinction between disagreement pairs. The spRI value ranges between zero and one, and the spARI value is less than or equal to one with an expected value of zero.
To implement this package, users are required to input two partitions and either spatial coordinates or a precomputed distance matrix. The main function “spARI” then automatically computes spRI and spARI in a fast and user-friendly manner.
If spatial coordinates are provided, a built-in normalization step is applied to remove the effect of spatial location unit.
If a distance matrix is provided, the built-in distance matrix calculation step is skipped.
For large-scale spatial data (e.g., 100,000 objects), users can input a sparse distance matrix of class “dgCMatrix” or “dgTMatrix”, for instance, by retaining only distances to the \(k\) nearest neighbors for each object.
The spARI
package is compatible with Windows, Linux, and macOS, and can be easily installed on all three platforms.
Users can install the spARI
package from GitHub.
devtools::install_github("yinqiaoyan/spARI", dependencies = TRUE)
Other required packages also need to be installed in advance.
install.packages("FNN") # find k-nearest neighbors
The following code shows an example (the third simulation study in the manuscript) that runs the main function “spARI” in our package.
Import the required R package.
library(spARI)
Load the example data stored in this package. The example data includes:
data("spARI_example_data")
true_labels <- spARI_example_data$true_labels
c1_labels <- spARI_example_data$c1_labels
c2_labels <- spARI_example_data$c2_labels
coords <- spARI_example_data$coords
Run “spARI” function to compute spRI and spARI. The meaning of each argument in the main function is listed below.
res_value1 <- spARI(r_labels=true_labels, c_labels=c1_labels, coords=coords)
res_value2 <- spARI(r_labels=true_labels, c_labels=c2_labels, coords=coords)
print(res_value1)
## spRI spARI
## 0.9260064 0.6716089
print(res_value2)
## spRI spARI
## 0.917020 0.631726
The reference partition and the two clustering partitions are shown in the following figure.
cols <- c("#D27786","#A5D7F1","#c599f3","#ffb610")
par(mfrow=c(1,3), mar=c(3,3,1,1))
plot(coords[, 1], coords[, 2], col=cols[true_labels], pch=16, main="Reference")
plot(coords[, 1], coords[, 2], col=cols[c1_labels], pch=16, main="Clustering i")
plot(coords[, 1], coords[, 2], col=cols[c2_labels], pch=16, main="Clustering ii")
It can be seen that spRI and spARI prefer clustering partitions with more compact spatial structures.
Finally, we provide a scenario where the differences of spRI/spARI between two clustering partitions are noticeably large.
set.seed(1)
coords <- data.frame(x = c(runif(30,0,1), runif(30,1.2,2.2)),
y = runif(60,0,1))
# reference
ref <- c(rep(1,30), rep(2,30))
# clustering A
cluA <- sample(ref)
# clustering B
cluB <- ref
par(mfrow=c(1,2), mar=c(3,3,1,1))
plot(coords$x, coords$y, col=cluA, pch=16, main="Clustering A (lower spRI/spARI)")
plot(coords$x, coords$y, col=cluB, pch=16, main="Clustering B (higher spRI/spARI)")
# compute spRI and spARI
res_A <- spARI(ref, cluA, coords = as.matrix(coords))
res_B <- spARI(ref, cluB, coords = as.matrix(coords))
print(res_A)
## spRI spARI
## 0.6612889 -0.1293616
print(res_B)
## spRI spARI
## 1 1
The following code shows an example that runs the main function “spARI” using the distance matrix as input.
library(spARI)
data("spARI_example_data")
true_labels <- spARI_example_data$true_labels
c1_labels <- spARI_example_data$c1_labels
c2_labels <- spARI_example_data$c2_labels
coords <- spARI_example_data$coords
## Compute the distance matrix
coords_norm <- coords
coords_norm[,1] <- (coords[,1] - min(coords[,1])) / (max(coords[,1]) - min(coords[,1]))
coords_norm[,2] <- (coords[,2] - min(coords[,2])) / (max(coords[,2]) - min(coords[,2]))
dist_mat <- as.matrix(stats::dist(coords_norm))
Run “spARI” function to compute spRI and spARI using dist_mat
.
res_value1 <- spARI(r_labels=true_labels, c_labels=c1_labels, dist_mat=dist_mat)
res_value2 <- spARI(r_labels=true_labels, c_labels=c2_labels, dist_mat=dist_mat)
print(res_value1)
## spRI spARI
## 0.9260064 0.6716089
print(res_value2)
## spRI spARI
## 0.917020 0.631726
The computation yields identical results to those obtained when using coordinates as input.
In this example, we generate two partitions for 100,000 objects with simulated spatial coordinates to examine the computational efficiency of the spARI
package for large-scale spatial data.
Since storing a dense matrix of 100,000x100,000 would require more than 16GB of memory, we generate a sparse distance matrix of the objects, containing approximately 600,000 non-empty entries.
library(FNN)
library(Matrix)
## Define sparse distance matrix generation function
build_symmetric_knn_distance_matrix <- function(coord_mat, k = 5) {
N <- nrow(coord_mat)
# Find k nearest neighbors for each object (exclude itself)
knn_res <- FNN::get.knn(coord_mat, k = k)
i_idx <- rep(1:N, each = k)
j_idx <- as.vector(t(knn_res$nn.index))
d_val <- as.vector(t(knn_res$nn.dist))
# Generate sparse matrix (symmetric)
D_temp <- sparseMatrix(
i = i_idx,
j = j_idx,
x = d_val,
dims = c(N, N)
)
D1 <- as(D_temp, "dgTMatrix")
D2 <- as(t(D_temp), "dgTMatrix")
i_all <- c(D1@i, D2@i)
j_all <- c(D1@j, D2@j)
x_all <- c(D1@x, D2@x)
# Combine and aggregate max
df <- data.frame(i = i_all + 1, j = j_all + 1, x = x_all) # +1 for R indexing
df_agg <- aggregate(x ~ i + j, data = df, FUN = max)
# Create symmetric sparse matrix
D_sym <- sparseMatrix(i = df_agg$j, j = df_agg$i, x = df_agg$x)
return(D_sym)
}
## Generate sparse distance matrix
set.seed(123)
N <- 1e5 # 100,000 objects
coords <- matrix(runif(N*2), N, 2) # coordinates ranging between 0 and 1
dist_mat <- build_symmetric_knn_distance_matrix(coords, k = 5)
dim(dist_mat)
## [1] 100000 100000
# 100000 100000
length(dist_mat@x)
## [1] 595850
# 595850
Remark1. The sparse matrix should be stored in the R classes “dgCMatrix” or “dgTMatrix”.
Remark2. For each object, we initially retain only the distances to its \(k=5\) nearest neighbors (recorded entries in the sparse matrix). However, since the \(k\)-nearest neighbor relationship is not symmetric (i.e., object \(i\) may be among the \(k\) nearest neighbors of object \(j\), but \(j\) is not necessarily among the \(k\) nearest neighbors of \(i\)), the resulting matrix \(D_{\text{temp}}\) is asymmetric. To ensure symmetry, we include the distance between two objects if either one is among the \(k\) nearest neighbors of the other.
Remark3. For object pairs whose distances are empty (i.e., not recorded) in the sparse matrix, their corresponding values of the weights \(\widetilde{W}(i,j)\)’s in spRI/spARI are set to zero if they form a disagreement pair and still to one if they form an agreement pair. In addition, if we denote the set of cell pairs with recorded distances by \(\Xi\), then \(F=\sum_{(i,j) \in \Xi}f_{ij}\) and \(H=\sum_{(i,j) \in \Xi}h_{ij}\) for the sparse distance matrix situation.
Then we generate the synthetic reference and clustering partitions, with the clustering result containing 10,000 misclustered objects.
K <- 15
set.seed(123)
# reference partition
true_labels <- sample(1:K, N, replace = TRUE)
# clustering partition
c_labels <- true_labels
ids <- sample(which(true_labels == 1), 5000)
c_labels[ids] <- 2
ids <- sample(which(true_labels == 3), 5000)
c_labels[ids] <- 4
Compute spRI and spARI values.
library(spARI)
stime <- Sys.time()
res <- spARI(true_labels, c_labels, dist_mat=dist_mat)
etime <- Sys.time()
print(res)
## spRI spARI
## 0.9834065 0.8752946
# spRI spARI
# 0.9834065 0.8752946
print(etime-stime)
## Time difference of 22.77945 secs
# Execution time is about 4.8 seconds
# at a MacBook Air powered by Apple M4 CPU with 16GB of RAM
spARI is also compatible with spatial data characterized by an undirected graph structure, where each node corresponds to an object and edges denote connections. In this setting, the spatial proximity between any two objects is represented by an adjacency matrix, where the \((i,j)\)-th entry equals one if objects \(i\) and \(j\) are connected and equals zero otherwise. This 0-1 symmetric adjacency matrix can be viewed as a special case of a distance matrix. We notice that the zero corresponds to a non-recorded or missing distance that is not accounted for in computing spRI/spARI rather than a zero distance. The calculation procedure with the 0-1 adjacency matrix as input is the same as that in a sparse distance matrix.
The following code exhibits the use of a synthetic adjacency matrix as input, constructed for a dataset of 10 objects.
set.seed(12)
## Generate adjacency matrix
n <- 10
p <- 0.4
adj_mat <- matrix(0L, n, n)
up_tri <- upper.tri(adj_mat)
adj_mat[up_tri] <- rbinom(sum(up_tri), size = 1, prob = p)
adj_mat <- adj_mat + t(adj_mat)
diag(adj_mat) <- 0
print(adj_mat)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 1 0 0 0 0 1 0 1
## [2,] 0 0 1 0 1 1 0 0 1 1
## [3,] 1 1 0 0 0 0 0 1 0 0
## [4,] 0 0 0 0 0 0 1 0 1 0
## [5,] 0 1 0 0 0 0 0 0 1 0
## [6,] 0 1 0 0 0 0 0 0 1 0
## [7,] 0 0 0 1 0 0 0 0 1 1
## [8,] 1 0 1 0 0 0 0 0 1 0
## [9,] 0 1 0 1 1 1 1 1 0 1
## [10,] 1 1 0 0 0 0 1 0 1 0
## Generate synthetic reference and clustering partitions
ref <- sample(1:3, n, replace = TRUE)
clu <- ref
clu[c(6,7)] <- 1
Run “spARI” function to compute spRI and spARI.
library(spARI)
res <- spARI(r_labels = ref, c_labels = clu, dist_mat = adj_mat)
print(res)
## spRI spARI
## 0.7403798 0.3479276
We conduct the permutation test for the observed spARI value of the two clustering partitions in the example data.
First, we import spARI
package and load the example data
library(spARI)
data("spARI_example_data")
true_labels <- spARI_example_data$true_labels
c1_labels <- spARI_example_data$c1_labels
c2_labels <- spARI_example_data$c2_labels
coords <- spARI_example_data$coords
Then we carry out the permutation test for the clustering partitions “c1_labels” and “c2_labels”. By default, the number of permutations is set to 100.
set.seed(42)
perm_test(r_labels=true_labels, c_labels=c1_labels, coords=coords,
use_parallel=FALSE)
## spARI_obs p-value
## 0.6716089 0.0000000
perm_test(r_labels=true_labels, c_labels=c2_labels, coords=coords,
use_parallel=FALSE)
## spARI_obs p-value
## 0.631726 0.000000
Both the p-values are identical to zero, indicating that the observed spARI value is significantly larger than zero, and the null hypothesis “spARI=0” is rejected.
SpatialExperiment
objectsThe R package is also compatible with SpatialExperiment
object, which is a commonly used R object to store ST data.
First, we generate a synthetic ST data stored in SpatialExperiment
object.
library(SpatialExperiment)
library(S4Vectors)
set.seed(123)
count_matrix <- matrix(
sample(0:10, 100, replace = TRUE),
nrow = 10, # 10 genes
ncol = 10 # 10 spots
)
rownames(count_matrix) <- paste0("gene", 1:10)
colnames(count_matrix) <- paste0("spot", 1:10)
# Construct gene annotations (rowData)
gene_annotation <- DataFrame(
gene_id = rownames(count_matrix),
gene_name = paste0("Gene_", 1:10)
)
# Construct spot metadata (colData)
spot_metadata <- DataFrame(
spot_id = colnames(count_matrix),
sample_id = rep("sample1", 10),
cell_type = c(2, 3, 2, 2, 1, 3, 1, 3, 1, 3),
cluster = c(2, 3, 2, 3, 1, 3, 3, 3, 1, 3)
)
# Construct spatial coordinates (spatialCoords)
coords_matrix <- cbind(
x = runif(10, min = 0, max = 100),
y = runif(10, min = 0, max = 100)
)
rownames(coords_matrix) <- colnames(count_matrix)
# Construct the SpatialExperiment object
spe <- SpatialExperiment(
assays = list(counts = count_matrix),
rowData = gene_annotation,
colData = spot_metadata,
spatialCoords = coords_matrix
)
Then we use the SpatialExperiment object to calculate the spRI and spARI values and conduct the permutation test.
library(spARI)
spARI(spe=spe)
## spRI spARI
## 0.8650783 0.4584890
set.seed(42)
perm_test(spe=spe, use_parallel=FALSE)
## spARI_obs p-value
## 0.458489 0.010000
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialExperiment_1.19.1 SingleCellExperiment_1.31.1
## [3] SummarizedExperiment_1.39.1 Biobase_2.69.0
## [5] GenomicRanges_1.61.1 Seqinfo_0.99.2
## [7] IRanges_2.43.0 S4Vectors_0.47.0
## [9] BiocGenerics_0.55.1 generics_0.1.4
## [11] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [13] Matrix_1.7-4 FNN_1.1.4.1
## [15] spARI_0.99.15 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] jsonlite_2.0.0 rjson_0.2.23 compiler_4.5.1
## [4] BiocManager_1.30.26 crayon_1.5.3 tinytex_0.57
## [7] Rcpp_1.1.0 magick_2.8.7 jquerylib_0.1.4
## [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-7
## [13] R6_2.6.1 XVector_0.49.0 S4Arrays_1.9.1
## [16] knitr_1.50 DelayedArray_0.35.2 bookdown_0.44
## [19] bslib_0.9.0 rlang_1.1.6 cachem_1.1.0
## [22] xfun_0.53 sass_0.4.10 SparseArray_1.9.1
## [25] cli_3.6.5 magrittr_2.0.3 digest_0.6.37
## [28] grid_4.5.1 lifecycle_1.0.4 evaluate_1.0.5
## [31] abind_1.4-8 rmarkdown_2.29 tools_4.5.1
## [34] htmltools_0.5.8.1