vignettes/parameter-selection.Rmd
parameter-selection.RmdHere, we demonstrate a grid search of clustering parameters with a
mouse hippocampus VeraFISH dataset. BANKSY currently provides
four algorithms for clustering the BANKSY matrix with
clusterBanksy: Leiden (default), Louvain, k-means, and
model-based clustering. In this vignette, we run only Leiden clustering.
See ?clusterBanksy for more details on the parameters for
different clustering methods.
The dataset comprises gene expression for 10,944 cells and 120 genes
in 2 spatial dimensions. See ?Banksy::hippocampus for more
details.
# Load libs
library(Banksy)
library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)
library(scater)
library(cowplot)
library(ggplot2)
# Load data
data(hippocampus)
gcm <- hippocampus$expression
locs <- as.matrix(hippocampus$locations)Here, gcm is a gene by cell matrix, and
locs is a matrix specifying the coordinates of the centroid
for each cell.
head(gcm[,1:5])
#> cell_1276 cell_8890 cell_691 cell_396 cell_9818
#> Sparcl1 45 0 11 22 0
#> Slc1a2 17 0 6 5 0
#> Map 10 0 12 16 0
#> Sqstm1 26 0 0 2 0
#> Atp1a2 0 0 4 3 0
#> Tnc 0 0 0 0 0
head(locs)
#> sdimx sdimy
#> cell_1276 -13372.899 15776.37
#> cell_8890 8941.101 15866.37
#> cell_691 -14882.899 15896.37
#> cell_396 -15492.899 15835.37
#> cell_9818 11308.101 15846.37
#> cell_11310 14894.101 15810.37Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:
se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs)
colData(se) <- cbind(colData(se), spatialCoords(se))
# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]Next, perform normalization of the data.
# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)BANKSY has a few key parameters. We describe these below.
For characterising neighborhoods, BANKSY computes the
weighted neighborhood mean (H_0) and the azimuthal Gabor
filter (H_1), which estimates gene expression gradients.
Setting compute_agf=TRUE computes both H_0 and
H_1.
k_geom specifies the number of neighbors used to compute
each H_m for m=0,1. If a single value is
specified, the same k_geom will be used for each feature
matrix. Alternatively, multiple values of k_geom can be
provided for each feature matrix. Here, we use k_geom[1]=15
and k_geom[2]=30 for H_0 and H_1
respectively. More neighbors are used to compute gradients.
For datasets generated using Visium v1/v2, use
k_geom=18(ork_geom <- c(18, 18)ifcompute_agf = TRUE), since that corresponds to taking as neighbourhood two concentric rings of spots around each spot.
We compute the neighborhood feature matrices using normalized
expression (normcounts in the se object).
k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=15
#> Done
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=30
#> Done
#> Done
#> Centering
#> DonecomputeBanksy populates the assays slot
with H_0 and H_1 in this instance:
se
#> class: SpatialExperiment
#> dim: 120 10205
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_idThe lambda parameter is a mixing parameter in
[0,1] which determines how much spatial information is
incorporated for downstream analysis. With smaller values of
lambda, BANKY operates in cell-typing mode, while
at higher levels of lambda, BANKSY operates in
domain-finding mode. As a starting point, we recommend
lambda=0.2 for cell-typing and lambda=0.8 for
zone-finding, except for datasets generated using the Visium
v1/v2 technology, for which we recommend
lambda=0.2 for domain finding. See the note in the tutorial
on the main page
for more info.
Here, we run lambda=0 which corresponds to non-spatial
clustering, and lambda=0.2 for spatially-informed
cell-typing. We compute PCs with and without the AGF
(H_1).
lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000runBanksyPCA populates the reducedDims
slot, with each combination of use_agf and
lambda provided.
reducedDimNames(se)
#> [1] "PCA_M0_lam0" "PCA_M0_lam0.2" "PCA_M1_lam0" "PCA_M1_lam0.2"Next, we cluster the BANKSY embedding with Leiden graph-based
clustering. This admits two parameters: k_neighbors and
resolution. k_neighbors determines the number
of k nearest neighbors used to construct the shared nearest neighbors
graph. Leiden clustering is then performed on the resultant graph with
resolution resolution. For reproducibiltiy we set a seed
for each parameter combination.
k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000clusterBanksy populates colData(se) with
cluster labels:
To compare clustering runs visually, different runs can be relabeled
to minimise their differences with connectClusters:
se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1Visualise spatial coordinates with cluster labels.
cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
coord_equal() +
labs(title = cnm) +
theme(legend.title = element_blank()) +
guides(colour = guide_legend(override.aes = list(size = 2)))
})
plot_grid(plotlist = cplots, ncol = 2)
Compare all cluster outputs with compareClusters. This
function computes pairwise cluster comparison metrics between the
clusters in colData(se) based on adjusted Rand index
(ARI):
compareClusters(se, func = "ARI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.67
#> clust_M0_lam0.2_k50_res1 0.670 1.00
#> clust_M1_lam0_k50_res1 1.000 0.67
#> clust_M1_lam0.2_k50_res1 0.747 0.87
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.747
#> clust_M0_lam0.2_k50_res1 0.670 0.870
#> clust_M1_lam0_k50_res1 1.000 0.747
#> clust_M1_lam0.2_k50_res1 0.747 1.000or normalized mutual information (NMI):
compareClusters(se, func = "NMI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.741
#> clust_M0_lam0.2_k50_res1 0.741 1.000
#> clust_M1_lam0_k50_res1 1.000 0.741
#> clust_M1_lam0.2_k50_res1 0.782 0.915
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.782
#> clust_M0_lam0.2_k50_res1 0.741 0.915
#> clust_M1_lam0_k50_res1 1.000 0.782
#> clust_M1_lam0.2_k50_res1 0.782 1.000See ?compareClusters for the full list of comparison
measures.
Vignette runtime:
#> Time difference of 27.51106 secs
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/London
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] cowplot_1.1.3 scater_1.32.1
#> [3] ggplot2_3.5.1 scuttle_1.14.0
#> [5] SpatialExperiment_1.14.0 SingleCellExperiment_1.26.0
#> [7] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [9] GenomicRanges_1.56.1 GenomeInfoDb_1.40.1
#> [11] IRanges_2.38.1 S4Vectors_0.42.1
#> [13] BiocGenerics_0.48.1 MatrixGenerics_1.16.0
#> [15] matrixStats_1.3.0 Banksy_1.5.0
#> [17] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 rlang_1.1.4
#> [3] magrittr_2.0.3 compiler_4.4.1
#> [5] sccore_1.0.5 DelayedMatrixStats_1.26.0
#> [7] systemfonts_1.1.0 vctrs_0.6.5
#> [9] pkgconfig_2.0.3 crayon_1.5.3
#> [11] fastmap_1.2.0 magick_2.8.4
#> [13] XVector_0.44.0 labeling_0.4.3
#> [15] utf8_1.2.4 rmarkdown_2.27
#> [17] UCSC.utils_1.0.0 ggbeeswarm_0.7.2
#> [19] ragg_1.3.2 xfun_0.52
#> [21] zlibbioc_1.50.0 cachem_1.1.0
#> [23] beachmat_2.20.0 jsonlite_1.8.8
#> [25] highr_0.11 DelayedArray_0.30.1
#> [27] BiocParallel_1.38.0 irlba_2.3.5.1
#> [29] parallel_4.4.1 aricode_1.0.3
#> [31] R6_2.5.1 bslib_0.8.0
#> [33] leidenAlg_1.1.3 jquerylib_0.1.4
#> [35] Rcpp_1.0.13 bookdown_0.43
#> [37] knitr_1.48 Matrix_1.7-0
#> [39] igraph_2.0.3 tidyselect_1.2.1
#> [41] rstudioapi_0.16.0 abind_1.4-5
#> [43] yaml_2.3.10 viridis_0.6.5
#> [45] codetools_0.2-20 lattice_0.22-6
#> [47] tibble_3.2.1 withr_3.0.1
#> [49] evaluate_0.24.0 desc_1.4.3
#> [51] mclust_6.1.1 pillar_1.9.0
#> [53] BiocManager_1.30.23 generics_0.1.3
#> [55] dbscan_1.2-0 sparseMatrixStats_1.16.0
#> [57] munsell_0.5.1 scales_1.3.0
#> [59] glue_1.7.0 tools_4.4.1
#> [61] BiocNeighbors_1.22.0 data.table_1.15.4
#> [63] ScaledMatrix_1.12.0 fs_1.6.4
#> [65] grid_4.4.1 colorspace_2.1-1
#> [67] GenomeInfoDbData_1.2.11 RcppHungarian_0.3
#> [69] beeswarm_0.4.0 BiocSingular_1.20.0
#> [71] vipor_0.4.7 cli_3.6.3
#> [73] rsvd_1.0.5 textshaping_0.4.0
#> [75] fansi_1.0.6 viridisLite_0.4.2
#> [77] S4Arrays_1.4.1 dplyr_1.1.4
#> [79] uwot_0.2.2 gtable_0.3.5
#> [81] sass_0.4.9 digest_0.6.36
#> [83] SparseArray_1.4.8 ggrepel_0.9.5
#> [85] farver_2.1.2 rjson_0.2.21
#> [87] htmlwidgets_1.6.4 htmltools_0.5.8.1
#> [89] pkgdown_2.1.0 lifecycle_1.0.4
#> [91] httr_1.4.7