In this tutorial, we will analyze two datasets: one from Zheng et al., (Nature Communications, 2016) and the other from Biase et al., (Genome Research, 2014). Zheng dataset contains 500 human peripheral blood mononuclear cells (PBMCs) sequenced using GemCode platform, which consists of three cell types, CD56+ natural killer cells, CD19+ B cells and CD4+/CD25+ regulatory T cells. The original data can be downloaded from 10X GENOMICS website. The Biase dataset has 49 mouse embryo cells, which were sequenced by SMART-Seq and can be found at NCBI GEO:GSE57249.
library("SAMEclustering")
data("data_SAME")
dim(data_SAME$Zheng.expr)
## [1] 32738 500
data_SAME$Zheng.expr[1:5, 1:5]
## CTACAACTCATACG CAACGAACTGGTTG AACGCCCTTTTGCT TATGTGCTAGTGTC
## MIR1302-10 0 0 0 0
## FAM138A 0 0 0 0
## OR4F5 0 0 0 0
## RP11-34P13.7 0 0 0 0
## RP11-34P13.8 0 0 0 0
## CTAAGGTGTTTGCT
## MIR1302-10 0
## FAM138A 0
## OR4F5 0
## RP11-34P13.7 0
## RP11-34P13.8 0
Here we perform single-cell clustering using five popular methods, SC3, CIDR, Seurat, t-SNE + k-means and SIMLR. Genes expressed in less than 10% or more than 90% of cells are removed for CIDR, tSNE + k-means and SIMLR clustering. To improve the performance of cluster ensemble, we take a subset of four out of five most diverse sets of clustering.
cluster.result <- individual_clustering(inputTags = data_SAME$Zheng.expr, datatype = "count", percent_dropout = 10, SC3 = TRUE, CIDR = TRUE, nPC.cidr = NULL, Seurat = TRUE, nPC.seurat = NULL, resolution = 0.9, tSNE = TRUE, dimensions = 2, perplexity = 30, SIMLR = TRUE, diverse = TRUE, SEED = 123)
## Performing SC3 clustering...
## Estimating k...
## Setting SC3 parameters...
## Calculating distances between the cells...
## Performing transformations and calculating eigenvectors...
## Performing k-means clustering...
## Calculating consensus matrix...
## Performing CIDR clustering...
## Performing Seurat clustering...
## Regressing out: nUMI
## Scaling data matrix
## Performing tSNE + k-means clustering...
## Performing tSNE + k-means clustering...
## Selecting clusteirng methods for ensemble...
The function indiviual_clustering will output a matrix, where each row represents the cluster results of each method, and each colunm represents a cell. User can also extend SAFE-clustering to other scRNA-seq clustering methods, by putting all clustering results into a \(M * N\) matrix with M clustering methods and N cells.
cluster.result[1:4, 1:10]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## CIDR 1 2 2 2 2 2 3 1 2 2
## Seurat 3 3 6 6 6 6 3 3 3 3
## tSNE+kmeans 1 1 1 1 1 1 1 1 1 1
## SIMLR 1 1 1 1 1 1 1 1 1 1
Using the clustering results generated in last step, we perform cluster ensemble using EM algorithm.
cluster.ensemble <- SAMEclustering(Y = t(cluster.result), rep = 3, SEED = 123)
Function SAMEclustering will output a list for optimal clustering ensemble and clustering number based on AIC and BIC index, respectively.
cluster.ensemble
## $AICcluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##
## $final_k_AIC
## [1] 3
##
## $BICcluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [482] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##
## $final_k_BIC
## [1] 3
We can compare the clustering results to the true labels using the Adjusted Rand Index (ARI)
library(cidr)
# Cell labels of ground truth
head(data_SAME$Zheng.celltype)
## [1] cd56_NK cd56_NK cd56_NK cd56_NK cd56_NK cd56_NK
## Levels: bcell cd56_NK regulatory T
# Calculating ARI for cluster ensemble
adjustedRandIndex(cluster.ensemble$AICcluster, data_SAME$Zheng.celltype)
## [1] 0.9941685
dim(data_SAME$Biase.expr.expr)
## NULL
data_SAME$Biase.expr[1:5, 1:5]
## GSM1377859 GSM1377860 GSM1377861 GSM1377862 GSM1377863
## ENSMUSG00000000001 25.8078 36.7561 8.87692 24.5712 31.2255
## ENSMUSG00000000028 93.4291 92.1165 94.59080 107.0380 121.4490
## ENSMUSG00000000031 0.0000 0.0000 0.00000 0.0000 0.0000
## ENSMUSG00000000037 37.9544 22.4305 23.34200 42.2728 23.8579
## ENSMUSG00000000049 0.0000 0.0000 0.00000 0.0000 0.0000
Here we perform single-cell clustering using five popular methods, SC3, CIDR, Seurat, t-SNE + k-means and SIMLR. Genes expressed in less than 10% or more than 90% of cells are removed for CIDR, tSNE + k-means and SIMLR clustering. Since there are only 49 cells in Biase dataset, the resolution parameter is set to 1.2 according to our benchmarking results. Four out of five most diverse sets of clustering were taken out for downstream cluster ensemble.
cluster.result <- individual_clustering(inputTags = data_SAME$Biase.expr, datatype = "FPKM", percent_dropout = 10, SC3 = TRUE, CIDR = TRUE, nPC.cidr = NULL, Seurat = TRUE, nPC.seurat = NULL, seurat_min_cell = 200, resolution_min = 1.2, tSNE = TRUE, dimensions = 2, tsne_min_cells = 200, tsne_min_perplexity = 10, SIMLR = TRUE, diverse = TRUE, SEED = 123)
## Performing SC3 clustering...
## Estimating k...
## Setting SC3 parameters...
## Calculating distances between the cells...
## Performing transformations and calculating eigenvectors...
## Performing k-means clustering...
## Calculating consensus matrix...
## Performing CIDR clustering...
## Performing Seurat clustering...
## Regressing out: nUMI
## Scaling data matrix
## 1 singletons identified. 3 final clusters.
## Performing tSNE + k-means clustering...
## Performing tSNE + k-means clustering...
## Selecting clusteirng methods for ensemble...
Using the clustering results, we perform cluster ensemble using EM algorithm.
cluster.ensemble <- SAMEclustering(Y = t(cluster.result), rep = 3, SEED = 123)
cluster.ensemble
## $AICcluster
## [1] 1 1 1 1 1 1 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2
##
## $final_k_AIC
## [1] 3
##
## $BICcluster
## [1] 1 1 1 1 1 1 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 2 2 2 2 2 2
##
## $final_k_BIC
## [1] 3
Compare the cluster ensemble results to the true labels.
# Cell labels of ground truth
head(data_SAME$Biase.celltype)
## [1] zygote zygote zygote zygote zygote zygote
## Levels: Four-cell Two-cell zygote
# Calculating ARI for cluster ensemble
adjustedRandIndex(cluster.ensemble$AICcluster, data_SAME$Biase.celltype)
## [1] 0.9482629