Here, we demonstrate a grid search of clustering parameters with a mouse hippocampus VeraFISH dataset. BANKSY currently provides four algorithms for clustering the BANKSY matrix with clusterBanksy: Leiden (default), Louvain, k-means, and model-based clustering. In this vignette, we run only Leiden clustering. See ?clusterBanksy for more details on the parameters for different clustering methods.

Loading the data

The dataset comprises gene expression for 10,944 cells and 120 genes in 2 spatial dimensions. See ?Banksy::hippocampus for more details.

# Load libs
library(Banksy)

library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)

library(scater)
library(cowplot)
library(ggplot2)

# Load data
data(hippocampus)
gcm <- hippocampus$expression
locs <- as.matrix(hippocampus$locations)

Here, gcm is a gene by cell matrix, and locs is a matrix specifying the coordinates of the centroid for each cell.

head(gcm[,1:5])
#>         cell_1276 cell_8890 cell_691 cell_396 cell_9818
#> Sparcl1        45         0       11       22         0
#> Slc1a2         17         0        6        5         0
#> Map            10         0       12       16         0
#> Sqstm1         26         0        0        2         0
#> Atp1a2          0         0        4        3         0
#> Tnc             0         0        0        0         0
head(locs)
#>                 sdimx    sdimy
#> cell_1276  -13372.899 15776.37
#> cell_8890    8941.101 15866.37
#> cell_691   -14882.899 15896.37
#> cell_396   -15492.899 15835.37
#> cell_9818   11308.101 15846.37
#> cell_11310  14894.101 15810.37

Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:

se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs)
colData(se) <- cbind(colData(se), spatialCoords(se))

# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]

Next, perform normalization of the data.

# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)

Parameters

BANKSY has a few key parameters. We describe these below.

AGF usage

For characterising neighborhoods, BANKSY computes the weighted neighborhood mean (H_0) and the azimuthal Gabor filter (H_1), which estimates gene expression gradients. Setting compute_agf=TRUE computes both H_0 and H_1.

k-geometric

k_geom specifies the number of neighbors used to compute each H_m for m=0,1. If a single value is specified, the same k_geom will be used for each feature matrix. Alternatively, multiple values of k_geom can be provided for each feature matrix. Here, we use k_geom[1]=15 and k_geom[2]=30 for H_0 and H_1 respectively. More neighbors are used to compute gradients.

For datasets generated using Visium v1/v2, use k_geom=18 (or k_geom <- c(18, 18) if compute_agf = TRUE), since that corresponds to taking as neighbourhood two concentric rings of spots around each spot.

We compute the neighborhood feature matrices using normalized expression (normcounts in the se object).

k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=15
#> Done
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=30
#> Done
#> Computing harmonic m = 0
#> Using 15 neighbors
#> Done
#> Computing harmonic m = 1
#> Using 30 neighbors
#> Centering
#> Done

computeBanksy populates the assays slot with H_0 and H_1 in this instance:

se
#> class: SpatialExperiment 
#> dim: 120 10205 
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_id

lambda

The lambda parameter is a mixing parameter in [0,1] which determines how much spatial information is incorporated for downstream analysis. With smaller values of lambda, BANKY operates in cell-typing mode, while at higher levels of lambda, BANKSY operates in domain-finding mode. As a starting point, we recommend lambda=0.2 for cell-typing and lambda=0.8 for zone-finding, except for datasets generated using the Visium v1/v2 technology, for which we recommend lambda=0.2 for domain finding. See the note in the tutorial on the main page for more info.

Here, we run lambda=0 which corresponds to non-spatial clustering, and lambda=0.2 for spatially-informed cell-typing. We compute PCs with and without the AGF (H_1).

lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

runBanksyPCA populates the reducedDims slot, with each combination of use_agf and lambda provided.

reducedDimNames(se)
#> [1] "PCA_M0_lam0"   "PCA_M0_lam0.2" "PCA_M1_lam0"   "PCA_M1_lam0.2"

Clustering parameters

Next, we cluster the BANKSY embedding with Leiden graph-based clustering. This admits two parameters: k_neighbors and resolution. k_neighbors determines the number of k nearest neighbors used to construct the shared nearest neighbors graph. Leiden clustering is then performed on the resultant graph with resolution resolution. For reproducibiltiy we set a seed for each parameter combination.

k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000

clusterBanksy populates colData(se) with cluster labels:

colnames(colData(se))
#> [1] "sample_id"                "sdimx"                   
#> [3] "sdimy"                    "sizeFactor"              
#> [5] "clust_M0_lam0_k50_res1"   "clust_M0_lam0.2_k50_res1"
#> [7] "clust_M1_lam0_k50_res1"   "clust_M1_lam0.2_k50_res1"

Comparing cluster results

To compare clustering runs visually, different runs can be relabeled to minimise their differences with connectClusters:

se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1

Visualise spatial coordinates with cluster labels.

cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
    plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
        coord_equal() +
        labs(title = cnm) +
        theme(legend.title = element_blank()) +
        guides(colour = guide_legend(override.aes = list(size = 2)))
})

plot_grid(plotlist = cplots, ncol = 2)

Compare all cluster outputs with compareClusters. This function computes pairwise cluster comparison metrics between the clusters in colData(se) based on adjusted Rand index (ARI):

compareClusters(se, func = "ARI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                     0.67
#> clust_M0_lam0.2_k50_res1                  0.670                     1.00
#> clust_M1_lam0_k50_res1                    1.000                     0.67
#> clust_M1_lam0.2_k50_res1                  0.747                     0.87
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.747
#> clust_M0_lam0.2_k50_res1                  0.670                    0.870
#> clust_M1_lam0_k50_res1                    1.000                    0.747
#> clust_M1_lam0.2_k50_res1                  0.747                    1.000

or normalized mutual information (NMI):

compareClusters(se, func = "NMI")
#>                          clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.741
#> clust_M0_lam0.2_k50_res1                  0.741                    1.000
#> clust_M1_lam0_k50_res1                    1.000                    0.741
#> clust_M1_lam0.2_k50_res1                  0.782                    0.915
#>                          clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1                    1.000                    0.782
#> clust_M0_lam0.2_k50_res1                  0.741                    0.915
#> clust_M1_lam0_k50_res1                    1.000                    0.782
#> clust_M1_lam0.2_k50_res1                  0.782                    1.000

See ?compareClusters for the full list of comparison measures.

Session information

Vignette runtime:

#> Time difference of 27.6314 secs
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Detroit
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] cowplot_1.1.3               scater_1.30.1              
#>  [3] ggplot2_3.4.4               scuttle_1.12.0             
#>  [5] SpatialExperiment_1.12.0    SingleCellExperiment_1.24.0
#>  [7] SummarizedExperiment_1.32.0 Biobase_2.62.0             
#>  [9] GenomicRanges_1.54.1        GenomeInfoDb_1.38.6        
#> [11] IRanges_2.36.0              S4Vectors_0.40.2           
#> [13] BiocGenerics_0.48.1         MatrixGenerics_1.14.0      
#> [15] matrixStats_1.2.0           Banksy_0.99.12             
#> [17] BiocStyle_2.30.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-7              gridExtra_2.3            
#>  [3] rlang_1.1.3               magrittr_2.0.3           
#>  [5] compiler_4.3.2            sccore_1.0.4             
#>  [7] DelayedMatrixStats_1.24.0 systemfonts_1.0.5        
#>  [9] vctrs_0.6.5               stringr_1.5.1            
#> [11] pkgconfig_2.0.3           crayon_1.5.2             
#> [13] fastmap_1.1.1             magick_2.8.2             
#> [15] XVector_0.42.0            labeling_0.4.3           
#> [17] utf8_1.2.4                rmarkdown_2.25           
#> [19] ggbeeswarm_0.7.2          ragg_1.2.7               
#> [21] purrr_1.0.2               xfun_0.42                
#> [23] zlibbioc_1.48.0           cachem_1.0.8             
#> [25] beachmat_2.18.0           jsonlite_1.8.8           
#> [27] highr_0.10                DelayedArray_0.28.0      
#> [29] BiocParallel_1.36.0       irlba_2.3.5.1            
#> [31] parallel_4.3.2            aricode_1.0.3            
#> [33] R6_2.5.1                  bslib_0.6.1              
#> [35] stringi_1.8.3             leidenAlg_1.1.2          
#> [37] jquerylib_0.1.4           Rcpp_1.0.12              
#> [39] bookdown_0.37             knitr_1.45               
#> [41] Matrix_1.6-5              igraph_2.0.1.1           
#> [43] tidyselect_1.2.0          viridis_0.6.5            
#> [45] rstudioapi_0.15.0         abind_1.4-5              
#> [47] yaml_2.3.8                codetools_0.2-19         
#> [49] lattice_0.22-5            tibble_3.2.1             
#> [51] withr_3.0.0               evaluate_0.23            
#> [53] desc_1.4.3                mclust_6.0.1             
#> [55] pillar_1.9.0              BiocManager_1.30.22      
#> [57] generics_0.1.3            dbscan_1.1-12            
#> [59] RCurl_1.98-1.14           sparseMatrixStats_1.14.0 
#> [61] munsell_0.5.0             scales_1.3.0             
#> [63] glue_1.7.0                tools_4.3.2              
#> [65] BiocNeighbors_1.20.2      data.table_1.15.0        
#> [67] ScaledMatrix_1.10.0       fs_1.6.3                 
#> [69] grid_4.3.2                colorspace_2.1-0         
#> [71] GenomeInfoDbData_1.2.11   RcppHungarian_0.3        
#> [73] beeswarm_0.4.0            BiocSingular_1.18.0      
#> [75] vipor_0.4.7               cli_3.6.2                
#> [77] rsvd_1.0.5                textshaping_0.3.7        
#> [79] fansi_1.0.6               viridisLite_0.4.2        
#> [81] S4Arrays_1.2.0            dplyr_1.1.4              
#> [83] uwot_0.1.16               gtable_0.3.4             
#> [85] sass_0.4.8                digest_0.6.34            
#> [87] ggrepel_0.9.5             SparseArray_1.2.4        
#> [89] farver_2.1.1              rjson_0.2.21             
#> [91] memoise_2.0.1             htmltools_0.5.7          
#> [93] pkgdown_2.0.7             lifecycle_1.0.4