SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data

What You Need

Top

Installation

After downloading, for example, the SAFE_2.1_MacOS.tar.gz into a chosen local folder "local_path",

chmod 740 local_path/gpmetis

chmod 740 local_path/shmetis

source("local_path/individual_clustering.R")

source("local_path/SAFE.R")

Top

How to Run

Our SAFE-clustering package contains two functions individual_clustering.R and SAFE.R. individual_clustering.R is for individual clustering using four state-of-the-art methods, and SAFE.R is the main function to ensemble individual clustering result into one consensus clustering using three hypergraph-based partitioning algorithms, HGPA, MCLA and CSPA. gpmetis (Karypis and Kumar 1998) is employed for MCLA and CSPA partitioning, and shmetis (Karypis et al. 1997) is used for HGPA partitioning.

Usage of individual_clustering.R

individual_clustering(inputTags, datatype = "count", SC3 = TRUE, gene_filter = FALSE, svm_max = 5000, CIDR = TRUE, Seurat = TRUE, nPC = NULL, resolution = 0.9, seurat_min_cell = 200, resolution_min = 1.2, tSNE = TRUE, var_genes = NULL, SEED = 1)

Input of individual_clustering.R

inputTags: input data matrix, where rows correspond to genes and columns correspond to cells.
datatype: Type of input data, which could be "count", "CPM", "RPKM" and "FPKM". Default is "count".
SC3: A boolean parameter that defines whether to cluster cells using SC3 method. Default is "TRUE".
gene_filter: Whether to perform gene filtering before SC3 clustering, when SC3 = TRUE. Default is "FALSE".
svm_max: Mimimum number of cells above which SVM will be run, when SC3 = TRUE.
CIDR: A boolean parameter that defines whether to cluster cells using CIDR method. Default is "TRUE".
Seurat: A boolean parameter that defines whether to cluster cells using Seurat method. Default is "TRUE".
nPC: Number of princple compoents used in Seurat clustering, when Seurat = TRUE. Default value is esimated by nPC function of CIDR package.
resolution Value of resolution used in Seurat clustering, when Seurat = TRUE.
seurat_min_cell Mimimum number of cells in input dataset below which resolution is set to 1.2, when Seurat is TRUE.
resolution_min: Resolution used in Seurat clustering for small dataset, when Seurat = TRUE and cell number of input file < seurat_min_cell.
tSNE: A boolean parameter that defines whether to cluster cells using t-SNE + k-means method. Default is "TRUE".
var_genes: Number of variable genes used by tSNE analysis, when tSNE = TRUE.
SEED: Seed of the random number generator. Setting the seed to a fixed value can produce reproducible clustering results.

Output of individual_clustering.R

Return a matrix of individual clustering results

Usage of SAFE.R

SAFE(cluster_results, k_min = NULL, k_max = NULL, MCLA = TRUE, HGPA = FALSE, CSPA = FALSE, cspc_cell_max = NULL, SEED = 1)

Input of SAFE.R

cluster_results: input data matrix of indivdiual clustering resutls, where rows correspond to individual methods and colums correspond to cells.
k_min: The minimum cluster number estimated by all the single solutions. Default is 2.
k_max: Maximum number of clusters used for ensembel clustering. Default is the maximum cluster number estimated by all the single solutions.
MCLA: A boolean parameter that defines whether to use MCLA algorithm for ensemble clustering. Default is "TRUE".
HGPA: A boolean parameter that defines whether to use HGPA algorithm for ensemble clustering. Default is "FALSE".
CSPA: A boolean parameter that defines whether to use CSPA algorithm for ensemble clustering. Default is "FALSE".
cspc_cell_max: Maximum number of cells above which CSPA is not run, when CSPA = TRUE.
SEED: Seed of the random number generator. Setting the seed to a fixed value can produce reproducible clustering results for MCLA and CSPA algorithms.

Output of SAFE.R

Return a list of the overall optimal ensemble clustering and cluster number, as well as the optimal resutls for each algorithm:

HGPA/MCLA/CSPA_ANMI: optimal ANMI for clusters HGPA/MCLA/CSPA algorithm
HGPA/MCLA/CSPA: optimal ensemble clustering result for HGPA/MCLA/CSPA algorithm determined by ANMI
HGPA/MCLA/CSPA_optimal_k: optimal number of clusters for HGPA/MCLA/CSPA algorithm
optimal_clustering: optimal ensemble clustering result determined by ANMI
optimal_k: optimal number of clusters
Summary: A summary statement of the ensemble clustering result

Example

Download expression_matrix.csv (32738 genes and 500 cells of three cell types, cd56_NK cells, b cells and regulatory T cells from 10X Genomices (Zheng et al. 2016)), and celltype.info.csv (List of cell type and cluster information for the 500 cells).
Prepare input for individual_clustering.R function using the following R codes:
expr.mat <- read.csv("local_path/expression_matrix.csv", header = T)
rownames(expr.mat) <- expr.mat[,1]
expr.mat <- expr.mat[,-1]
Output: cluster.result <- individual_clustering(inputTags=expr.mat, SEED=123)
Take individual_clustering.R output for SAFE clustering:

cluster.ensemble <- SAFE(cluster_results=cluster.result, SEED=123)
Calcluate ARI (Adjusted Rand Index) for both ensemble clustering and individual clustering results, for example SC3, using adjustedRandIndex function from CIDR package (Lin et al. 2017):

celltype.info <- read.csv("local_path/celltype.info.csv")
ARI.SAFE = adjustedRandIndex(cluster.ensemble$optimal_clustering, celltype.info$cluster)
ARI.SC3 = adjustedRandIndex(cluster.result[1,], celltype.info$cluster)
A more detailed example which includes producing the partically sequenced data can be found in the R document of SAFE.R.

Top

The University of North Carolina at Chapel Hill

Li Group Home

SAFE-clustering Home

Tutorial

Download

Contact