The current implementation of iSMNN encompasses two major steps: one optional cluster harmonizing step and the other batch effect correction step. In the first step, the clusters/cell type labels are matched/harmonized across multiple scRNA-seq batches using unifiedClusterLabelling function from SMNN package (Yang et al. 2020). This entire clustering step can be by-passed by feeding iSMNN cell cluster labels. With cell cluster label information, iSMNN iteratively searches mutual nearest neighbors within each harmonized cell type, and performs batch effect correction using the iSMNN function.
In this tutorial, we will perform batch effect correction using iSMNN in a toy example containing two batches. The first batch contains 400 cells from three cell types, namely fibroblasts, macrophages and endothelial cells. And the second batches has 500 cells from the same three cell types. Both two batches contain 3000 genes.
iSMNN package can be directly installed from GitHub with:
install.packages("devtools")
devtools::install_github("yycunc/iSMNN")
library("iSMNN")
data("data_iSMNN")
dim(data_iSMNN$batch1.mat)
## [1] 3000 400
data_iSMNN$batch1.mat[1:5, 1:5]
## batch1_CAACTAGTCGCGCCAA batch1_CATTATCTCGGTGTTA
## Smr3a 0 0
## Kif18a 0 0
## Arhgap25 0 0
## Msl3l2 0 0
## Gm11060 0 0
## batch1_CTTCTCTTCTTAGAGC batch1_TGACTTTGTCAACTGT
## Smr3a 0 0
## Kif18a 0 0
## Arhgap25 0 0
## Msl3l2 0 0
## Gm11060 0 0
## batch1_AGTTGGTAGAAGGGTA
## Smr3a 0
## Kif18a 0
## Arhgap25 0
## Msl3l2 0
## Gm11060 0
dim(data_iSMNN$batch2.mat)
## [1] 3000 500
The function unifiedClusterLabelling from SMNN package is used to match/harmonize the clusters/cell type labels across multiple scRNA-seq batches. It takes as input raw expression matrices from two or more batches, a list of marker genes and their corresponding cluster labels, and outputs harmonized cluster label for every single cells across all batches.
Two pieces of information are needed: - Marker genes - Corresponding cell type labels for each marker gene
# Maker genes
markers <- c("Col1a1", "Pdgfra", "Ptprc", "Pecam1")
# Corresponding cell type labels for each marker gene
cluster.info <- c("fibroblast", "fibroblast", "macrophage", "endothelial cells")
library(SMNN)
batch.cluster.labels <- unifiedClusterLabelling(data_SMNN$batch1.mat, data_iSMNN$batch2.mat, features.use = markers, cluster.labels = cluster.info, min.perc = 0.3)
# Add cell ID to the harmonized cluster/cell type labels
names(batch.cluster.labels[[1]]) <- colnames(data_iSMNN$batch1.mat)
names(batch.cluster.labels[[2]]) <- colnames(data_iSMNN$batch2.mat)
batch.cluster.labels is a list where each element represents the harmonized cluster/cell type labels of each batch.
head(batch.cluster.labels[[1]])
## batch1_CAACTAGTCGCGCCAA batch1_CATTATCTCGGTGTTA batch1_CTTCTCTTCTTAGAGC
## "fibroblast" "fibroblast" "fibroblast"
## batch1_TGACTTTGTCAACTGT batch1_AGTTGGTAGAAGGGTA batch1_TCTGGAAGTCTCTTTA
## "fibroblast" "fibroblast" "endothelial cells"
head(batch.cluster.labels[[2]])
## batch2_GCGCGATAGCGGATCA batch2_ATAACGCTCCAGTAGT batch2_TGTGTTTAGCGTAGTG
## "macrophage" "fibroblast" "fibroblast"
## batch2_TGACTAGAGCTCAACT batch2_TGGTTAGCAAGGTGTG batch2_GGGTTGCCACTCAGGC
## "fibroblast" "fibroblast" "fibroblast"
With harmonized cluster label information for single cells across batches, we implement batch effect correction using iSMNN.
Input object is first constructed following the instruction of Seurat v3 package. See the tutorial from the Seurat website for details.
library(Seurat)
merge <- CreateSeuratObject(counts = cbind(data_iSMNN$batch1.mat, data_iSMNN$batch2.mat), min.cells = 0, min.features = 0)
batch_id <- c(rep("batch1", ncol(data_iSMNN$batch1.mat)),
rep("batch2", ncol(data_iSMNN$batch2.mat)))
names(batch_id) <- colnames(merge)
merge <- AddMetaData(object = merge, metadata = batch_id, col.name = "batch_id")
merge.list <- SplitObject(merge, split.by = "batch_id")
merge.list <- lapply(X = merge.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
corrected.results <- iSMNN(object.list = merge.list, batch.cluster.labels = batch.cluster.labels, matched.clusters = c("endothelial cells", "macrophage", "fibroblast"), strategy = "Short.run", iterations = 5, dims = 1:20, npcs = 30, k.filter = 30)
iSMNN function will return a Seurat object that contains the batch-corrected expression matrix for batches.
# Output after correction for batch
corrected.results
## An object of class Seurat
## 4624 features across 900 samples within 2 assays
## Active assay: integrated (1624 features)
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
Yang, Y., Li, G., Xie, Y., Wang, L, Yang, Y., Liu, J., Qian, L., Li., Y. (2020) iSMNN: Batch Effect Correction for Single-cell RNA-seq data via Iterative Supervised Mutual Nearest Neighbor Refinement. biorxiv, 2020.
Some functions are borrowed from or executed according to the Seurat v3 package (Sturat et al., 2019).