The current implementation of SMNN encompasses two major steps: one optional clustering step and the other batch effect correction step. In the first step, SMNN takes the expression matrix as input, and performs clustering using Seurat v. 3.0 (Butler et al., 2018). Corresponding clusters/cell types are then matched across batches based on marker genes specified by the user. This entire clustering step can be by-passed by feeding SMNN cell cluster labels. With cell cluster label information, SMNN searches mutual nearest neighbors within each cell type, and performs batch effect correction using SMNN function.
In this tutorial, we will perform batch effect correction using our SMNN package on a toy example of two batches. The first batch contains 400 cells from three cell types, namely fibroblasts, macrophages and endothelial cells. And the second batches has 500 cells from the same three cell types. Both two batches contain expression information of the same 3000 genes.
library("SMNN")
data("data_SMNN")
dim(data_SMNN$batch1.mat)
## [1] 3000 400
data_SMNN$batch1.mat[1:5, 1:5]
## batch1_CAACTAGTCGCGCCAA batch1_CATTATCTCGGTGTTA
## Smr3a 0 0
## Kif18a 0 0
## Arhgap25 0 0
## Msl3l2 0 0
## Gm11060 0 0
## batch1_CTTCTCTTCTTAGAGC batch1_TGACTTTGTCAACTGT
## Smr3a 0 0
## Kif18a 0 0
## Arhgap25 0 0
## Msl3l2 0 0
## Gm11060 0 0
## batch1_AGTTGGTAGAAGGGTA
## Smr3a 0
## Kif18a 0
## Arhgap25 0
## Msl3l2 0
## Gm11060 0
dim(data_SMNN$batch2.mat)
## [1] 3000 500
Two pieces of information are needed: - Marker genes - Corresponding cell type labels for each marker gene
# Maker genes
markers <- c("Col1a1", "Pdgfra", "Ptprc", "Pecam1")
# Corresponding cell type labels for each marker gene
cluster.info <- c(1, 1, 2, 3)
The function unifiedClusterLabelling is used to match/harmonize the clusters/cell type labels across multiple scRNA-seq batches. It takes as input raw expression matrices from two or more batches, a list of marker genes and their corresponding cluster labels, and outputs harmonized cluster label for every single cells across all batches.
matched_clusters <- unifiedClusterLabelling(data_SMNN$batch1.mat, data_SMNN$batch2.mat, features.use = markers, cluster.labels = cluster.info, min.perc = 0.3)
In the SMNNcorrect function, we call several python functions from mnnpy package at chriscainx/mnnpy to speed up the computation.
Package mnnpy can be installed with pip install mnnpy,
or
git clone https://github.com/chriscainx/mnnpy.git
cd mnnpy
pip install .
To ensure successful execution of these python functions, we need to first set the python version (we recommand python3 for SMNN version 0.99).
library(reticulate)
# please change the path below to your local path for python
use_python("/nas/longleaf/apps/python/3.5.1/bin/python3")
With harmonized cluster label information for single cells across batches, we perform batch effect correction. Specifically, we apply cosine normalization on both input and output data and set the number of mutual nearest neighbors at 20.
corrected.results <- SMNNcorrect(data_SMNN$batch1.mat, data_SMNN$batch2.mat, batch1.cluster.labels = matched_clusters[[1]], batch2.cluster.labels = matched_clusters[[2]], num.defined.clusters = 3, k=20, sigma=1, cos.norm.in=TRUE, cos.norm.out=TRUE)
## [1] "Data preparation ..."
## [1] "Finding the mutual nearest neighbors for cell type 1 ..."
## [1] "Finding the mutual nearest neighbors for cell type 2 ..."
## [1] "Finding the mutual nearest neighbors for cell type 3 ..."
## [1] "Computing the correction vector ..."
## [1] "Adjusting the shift variance ..."
## [1] "Done!"
SMNNcorrect function will output the following information: (1) the batch-corrected expression matrix for each batch; and (2) information regarding mutual nearest neighbors.
In the example below, we treat batch 1 as the reference batch and batch 2 as the batch to be corrected (such that batch 2 will be corrected towards the reference batch 1). Note that the reference batch (i.e., batch 1 in our example) will only applied cosine normalization.
# Output after correction for batch
## Output (1): the batch-corrected expression matrix
corrected.results$corrected[[2]][1:10,1:10]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [3,] 1.638888e-04 2.107030e-04 1.205100e-04 1.458903e-04 1.525649e-04
## [4,] 7.548719e-04 1.119266e-03 7.507278e-04 8.083787e-04 8.227947e-04
## [5,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [6,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [7,] -4.434681e-05 -4.541523e-05 -4.175551e-05 -3.607593e-05 -3.597941e-05
## [8,] -4.795335e-05 1.071849e-02 -6.286728e-05 -5.533809e-05 -5.295780e-05
## [9,] 1.823547e-03 2.457465e-03 1.684354e-03 1.770569e-03 1.816802e-03
## [10,] -9.260156e-05 -1.307756e-04 -1.003363e-04 -9.577188e-05 -9.565618e-05
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [3,] 1.324284e-04 1.584435e-04 1.349946e-04 1.399951e-04 1.502947e-04
## [4,] 7.764454e-04 8.197435e-04 7.654365e-04 8.055797e-04 8.274734e-04
## [5,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [6,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [7,] -4.323060e-05 -2.805579e-05 -3.423085e-05 -4.693237e-05 -3.872993e-05
## [8,] -6.188545e-05 -4.968855e-05 -5.709170e-05 -5.778073e-05 -5.260472e-05
## [9,] 1.723383e-03 1.067772e-02 1.695001e-03 8.738493e-03 1.840756e-03
## [10,] -9.931511e-05 -9.317486e-05 -9.538336e-05 -9.881046e-05 -9.669906e-05
## Output (2): mutual nearest neighbor information
corrected.results$pairs[[2]]
## DataFrame with 2453 rows and 3 columns
## current.cell other.cell other.batch
## <integer> <Rle> <Rle>
## 1 2 64 1
## 2 2 98 1
## 3 3 65 1
## 4 4 146 1
## 5 5 118 1
## ... ... ... ...
## 2449 498 274 1
## 2450 498 262 1
## 2451 498 344 1
## 2452 498 170 1
## 2453 499 367 1