vignettes/parameter-selection.Rmd
parameter-selection.Rmd
Here, we demonstrate a grid search of clustering parameters with a
mouse hippocampus VeraFISH dataset. BANKSY currently provides
four algorithms for clustering the BANKSY matrix with
clusterBanksy: Leiden (default), Louvain, k-means, and
model-based clustering. In this vignette, we run only Leiden clustering.
See ?clusterBanksy
for more details on the parameters for
different clustering methods.
The dataset comprises gene expression for 10,944 cells and 120 genes
in 2 spatial dimensions. See ?Banksy::hippocampus
for more
details.
# Load libs
library(Banksy)
library(SummarizedExperiment)
library(SpatialExperiment)
library(scuttle)
library(scater)
library(cowplot)
library(ggplot2)
# Load data
data(hippocampus)
gcm <- hippocampus$expression
locs <- as.matrix(hippocampus$locations)
Here, gcm
is a gene by cell matrix, and
locs
is a matrix specifying the coordinates of the centroid
for each cell.
head(gcm[,1:5])
#> cell_1276 cell_8890 cell_691 cell_396 cell_9818
#> Sparcl1 45 0 11 22 0
#> Slc1a2 17 0 6 5 0
#> Map 10 0 12 16 0
#> Sqstm1 26 0 0 2 0
#> Atp1a2 0 0 4 3 0
#> Tnc 0 0 0 0 0
head(locs)
#> sdimx sdimy
#> cell_1276 -13372.899 15776.37
#> cell_8890 8941.101 15866.37
#> cell_691 -14882.899 15896.37
#> cell_396 -15492.899 15835.37
#> cell_9818 11308.101 15846.37
#> cell_11310 14894.101 15810.37
Initialize a SpatialExperiment object and perform basic quality control. We keep cells with total transcript count within the 5th and 98th percentile:
se <- SpatialExperiment(assay = list(counts = gcm), spatialCoords = locs)
colData(se) <- cbind(colData(se), spatialCoords(se))
# QC based on total counts
qcstats <- perCellQCMetrics(se)
thres <- quantile(qcstats$total, c(0.05, 0.98))
keep <- (qcstats$total > thres[1]) & (qcstats$total < thres[2])
se <- se[, keep]
Next, perform normalization of the data.
# Normalization to mean library size
se <- computeLibraryFactors(se)
aname <- "normcounts"
assay(se, aname) <- normalizeCounts(se, log = FALSE)
BANKSY has a few key parameters. We describe these below.
For characterising neighborhoods, BANKSY computes the
weighted neighborhood mean (H_0
) and the azimuthal Gabor
filter (H_1
), which estimates gene expression gradients.
Setting compute_agf=TRUE
computes both H_0
and
H_1
.
k_geom
specifies the number of neighbors used to compute
each H_m
for m=0,1
. If a single value is
specified, the same k_geom
will be used for each feature
matrix. Alternatively, multiple values of k_geom
can be
provided for each feature matrix. Here, we use k_geom[1]=15
and k_geom[2]=30
for H_0
and H_1
respectively. More neighbors are used to compute gradients.
For datasets generated using Visium v1/v2, use
k_geom=18
(ork_geom <- c(18, 18)
ifcompute_agf = TRUE
), since that corresponds to taking as neighbourhood two concentric rings of spots around each spot.
We compute the neighborhood feature matrices using normalized
expression (normcounts
in the se
object).
k_geom <- c(15, 30)
se <- computeBanksy(se, assay_name = aname, compute_agf = TRUE, k_geom = k_geom)
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=15
#> Done
#> Computing neighbors...
#> Spatial mode is kNN_median
#> Parameters: k_geom=30
#> Done
#> Computing harmonic m = 0
#> Using 15 neighbors
#> Done
#> Computing harmonic m = 1
#> Using 30 neighbors
#> Centering
#> Done
computeBanksy
populates the assays
slot
with H_0
and H_1
in this instance:
se
#> class: SpatialExperiment
#> dim: 120 10205
#> metadata(1): BANKSY_params
#> assays(4): counts normcounts H0 H1
#> rownames(120): Sparcl1 Slc1a2 ... Notch3 Egfr
#> rowData names(0):
#> colnames(10205): cell_1276 cell_691 ... cell_11635 cell_10849
#> colData names(4): sample_id sdimx sdimy sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : sdimx sdimy
#> imgData names(1): sample_id
The lambda
parameter is a mixing parameter in
[0,1]
which determines how much spatial information is
incorporated for downstream analysis. With smaller values of
lambda
, BANKY operates in cell-typing mode, while
at higher levels of lambda
, BANKSY operates in
domain-finding mode. As a starting point, we recommend
lambda=0.2
for cell-typing and lambda=0.8
for
zone-finding, except for datasets generated using the Visium
v1/v2 technology, for which we recommend
lambda=0.2
for domain finding. See the note in the tutorial
on the main page
for more info.
Here, we run lambda=0
which corresponds to non-spatial
clustering, and lambda=0.2
for spatially-informed
cell-typing. We compute PCs with and without the AGF
(H_1
).
lambda <- c(0, 0.2)
se <- runBanksyPCA(se, use_agf = c(FALSE, TRUE), lambda = lambda, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
runBanksyPCA
populates the reducedDims
slot, with each combination of use_agf
and
lambda
provided.
reducedDimNames(se)
#> [1] "PCA_M0_lam0" "PCA_M0_lam0.2" "PCA_M1_lam0" "PCA_M1_lam0.2"
Next, we cluster the BANKSY embedding with Leiden graph-based
clustering. This admits two parameters: k_neighbors
and
resolution
. k_neighbors
determines the number
of k nearest neighbors used to construct the shared nearest neighbors
graph. Leiden clustering is then performed on the resultant graph with
resolution resolution
. For reproducibiltiy we set a seed
for each parameter combination.
k <- 50
res <- 1
se <- clusterBanksy(se, use_agf = c(FALSE, TRUE), lambda = lambda, k_neighbors = k, resolution = res, seed = 1000)
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
#> Using seed=1000
clusterBanksy
populates colData(se)
with
cluster labels:
To compare clustering runs visually, different runs can be relabeled
to minimise their differences with connectClusters
:
se <- connectClusters(se)
#> clust_M1_lam0_k50_res1 --> clust_M0_lam0_k50_res1
#> clust_M0_lam0.2_k50_res1 --> clust_M1_lam0_k50_res1
#> clust_M1_lam0.2_k50_res1 --> clust_M0_lam0.2_k50_res1
Visualise spatial coordinates with cluster labels.
cnames <- colnames(colData(se))
cnames <- cnames[grep("^clust", cnames)]
cplots <- lapply(cnames, function(cnm) {
plotColData(se, x = "sdimx", y = "sdimy", point_size = 0.1, colour_by = cnm) +
coord_equal() +
labs(title = cnm) +
theme(legend.title = element_blank()) +
guides(colour = guide_legend(override.aes = list(size = 2)))
})
plot_grid(plotlist = cplots, ncol = 2)
Compare all cluster outputs with compareClusters
. This
function computes pairwise cluster comparison metrics between the
clusters in colData(se)
based on adjusted Rand index
(ARI):
compareClusters(se, func = "ARI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.67
#> clust_M0_lam0.2_k50_res1 0.670 1.00
#> clust_M1_lam0_k50_res1 1.000 0.67
#> clust_M1_lam0.2_k50_res1 0.747 0.87
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.747
#> clust_M0_lam0.2_k50_res1 0.670 0.870
#> clust_M1_lam0_k50_res1 1.000 0.747
#> clust_M1_lam0.2_k50_res1 0.747 1.000
or normalized mutual information (NMI):
compareClusters(se, func = "NMI")
#> clust_M0_lam0_k50_res1 clust_M0_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.741
#> clust_M0_lam0.2_k50_res1 0.741 1.000
#> clust_M1_lam0_k50_res1 1.000 0.741
#> clust_M1_lam0.2_k50_res1 0.782 0.915
#> clust_M1_lam0_k50_res1 clust_M1_lam0.2_k50_res1
#> clust_M0_lam0_k50_res1 1.000 0.782
#> clust_M0_lam0.2_k50_res1 0.741 0.915
#> clust_M1_lam0_k50_res1 1.000 0.782
#> clust_M1_lam0.2_k50_res1 0.782 1.000
See ?compareClusters
for the full list of comparison
measures.
Vignette runtime:
#> Time difference of 27.6314 secs
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/Detroit
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] cowplot_1.1.3 scater_1.30.1
#> [3] ggplot2_3.4.4 scuttle_1.12.0
#> [5] SpatialExperiment_1.12.0 SingleCellExperiment_1.24.0
#> [7] SummarizedExperiment_1.32.0 Biobase_2.62.0
#> [9] GenomicRanges_1.54.1 GenomeInfoDb_1.38.6
#> [11] IRanges_2.36.0 S4Vectors_0.40.2
#> [13] BiocGenerics_0.48.1 MatrixGenerics_1.14.0
#> [15] matrixStats_1.2.0 Banksy_0.99.12
#> [17] BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 gridExtra_2.3
#> [3] rlang_1.1.3 magrittr_2.0.3
#> [5] compiler_4.3.2 sccore_1.0.4
#> [7] DelayedMatrixStats_1.24.0 systemfonts_1.0.5
#> [9] vctrs_0.6.5 stringr_1.5.1
#> [11] pkgconfig_2.0.3 crayon_1.5.2
#> [13] fastmap_1.1.1 magick_2.8.2
#> [15] XVector_0.42.0 labeling_0.4.3
#> [17] utf8_1.2.4 rmarkdown_2.25
#> [19] ggbeeswarm_0.7.2 ragg_1.2.7
#> [21] purrr_1.0.2 xfun_0.42
#> [23] zlibbioc_1.48.0 cachem_1.0.8
#> [25] beachmat_2.18.0 jsonlite_1.8.8
#> [27] highr_0.10 DelayedArray_0.28.0
#> [29] BiocParallel_1.36.0 irlba_2.3.5.1
#> [31] parallel_4.3.2 aricode_1.0.3
#> [33] R6_2.5.1 bslib_0.6.1
#> [35] stringi_1.8.3 leidenAlg_1.1.2
#> [37] jquerylib_0.1.4 Rcpp_1.0.12
#> [39] bookdown_0.37 knitr_1.45
#> [41] Matrix_1.6-5 igraph_2.0.1.1
#> [43] tidyselect_1.2.0 viridis_0.6.5
#> [45] rstudioapi_0.15.0 abind_1.4-5
#> [47] yaml_2.3.8 codetools_0.2-19
#> [49] lattice_0.22-5 tibble_3.2.1
#> [51] withr_3.0.0 evaluate_0.23
#> [53] desc_1.4.3 mclust_6.0.1
#> [55] pillar_1.9.0 BiocManager_1.30.22
#> [57] generics_0.1.3 dbscan_1.1-12
#> [59] RCurl_1.98-1.14 sparseMatrixStats_1.14.0
#> [61] munsell_0.5.0 scales_1.3.0
#> [63] glue_1.7.0 tools_4.3.2
#> [65] BiocNeighbors_1.20.2 data.table_1.15.0
#> [67] ScaledMatrix_1.10.0 fs_1.6.3
#> [69] grid_4.3.2 colorspace_2.1-0
#> [71] GenomeInfoDbData_1.2.11 RcppHungarian_0.3
#> [73] beeswarm_0.4.0 BiocSingular_1.18.0
#> [75] vipor_0.4.7 cli_3.6.2
#> [77] rsvd_1.0.5 textshaping_0.3.7
#> [79] fansi_1.0.6 viridisLite_0.4.2
#> [81] S4Arrays_1.2.0 dplyr_1.1.4
#> [83] uwot_0.1.16 gtable_0.3.4
#> [85] sass_0.4.8 digest_0.6.34
#> [87] ggrepel_0.9.5 SparseArray_1.2.4
#> [89] farver_2.1.1 rjson_0.2.21
#> [91] memoise_2.0.1 htmltools_0.5.7
#> [93] pkgdown_2.0.7 lifecycle_1.0.4