Skip Navigation
Text:
Increase font size
Decrease font size

SAFE-clustering: Single-cell Aggregated (From Ensemble) Clustering for Single-cell RNA-seq Data

What You Need


Installation

After downloading, for example, the SAFE_2.1_MacOS.tar.gz into a chosen local folder "local_path",
    1. Start Terminal and use following two commands to make the two programs, gpmetis and shmetis, executable:
       chmod 740 local_path/gpmetis
       and
       chmod 740 local_path/shmetis
    2. Start R envrionment.
    3. Use following R command to load individual_clustering.R and SAFE.R.
       source("local_path/individual_clustering.R")
       and
       source("local_path/SAFE.R")
    Note that the scater, SC3, cidr, e1071, Seurat, dplyr, Matrix, Rtsne and ADPclust packages need to be installed before using individual_clustering.R, and the bit64, stringr and Matrix packages need to be installed before using SAFE.R. Our implementation downloaded the following versions for the packages, Scater 1.6.1, SC3 1.7.2, cidr 0.1.5, Seurat 2.1.0, Rtsne 0.13 and ADPclust 0.7. We've experienced that SC3 and Seurat periodically updates their packages causing some functions and options to change.

How to Run

Our SAFE-clustering package contains two functions individual_clustering.R and SAFE.R. individual_clustering.R is for individual clustering using four state-of-the-art methods, and SAFE.R is the main function to ensemble individual clustering result into one consensus clustering using three hypergraph-based partitioning algorithms, HGPA, MCLA and CSPA. gpmetis (Karypis and Kumar 1998) is employed for MCLA and CSPA partitioning, and shmetis (Karypis et al. 1997) is used for HGPA partitioning.

Usage of individual_clustering.R

  • individual_clustering(inputTags, datatype = "count", SC3 = TRUE, gene_filter = FALSE, svm_max = 5000, CIDR = TRUE, Seurat = TRUE, nPC = NULL, resolution = 0.9, seurat_min_cell = 200, resolution_min = 1.2, tSNE = TRUE, var_genes = NULL, SEED = 1)

Input of individual_clustering.R

  • inputTags: input data matrix, where rows correspond to genes and columns correspond to cells.
  • datatype: Type of input data, which could be "count", "CPM", "RPKM" and "FPKM". Default is "count".
  • SC3: A boolean parameter that defines whether to cluster cells using SC3 method. Default is "TRUE".
  • gene_filter: Whether to perform gene filtering before SC3 clustering, when SC3 = TRUE. Default is "FALSE".
  • svm_max: Mimimum number of cells above which SVM will be run, when SC3 = TRUE.
  • CIDR: A boolean parameter that defines whether to cluster cells using CIDR method. Default is "TRUE".
  • Seurat: A boolean parameter that defines whether to cluster cells using Seurat method. Default is "TRUE".
  • nPC: Number of princple compoents used in Seurat clustering, when Seurat = TRUE. Default value is esimated by nPC function of CIDR package.
  • resolution Value of resolution used in Seurat clustering, when Seurat = TRUE.
  • seurat_min_cell Mimimum number of cells in input dataset below which resolution is set to 1.2, when Seurat is TRUE.
  • resolution_min: Resolution used in Seurat clustering for small dataset, when Seurat = TRUE and cell number of input file < seurat_min_cell.
  • tSNE: A boolean parameter that defines whether to cluster cells using t-SNE + k-means method. Default is "TRUE".
  • var_genes: Number of variable genes used by tSNE analysis, when tSNE = TRUE.
  • SEED: Seed of the random number generator. Setting the seed to a fixed value can produce reproducible clustering results.

Output of individual_clustering.R

Return a matrix of individual clustering results


Usage of SAFE.R

  • SAFE(cluster_results, k_min = NULL, k_max = NULL, MCLA = TRUE, HGPA = FALSE, CSPA = FALSE, cspc_cell_max = NULL, SEED = 1)

Input of SAFE.R

  • cluster_results: input data matrix of indivdiual clustering resutls, where rows correspond to individual methods and colums correspond to cells.
  • k_min: The minimum cluster number estimated by all the single solutions. Default is 2.
  • k_max: Maximum number of clusters used for ensembel clustering. Default is the maximum cluster number estimated by all the single solutions.
  • MCLA: A boolean parameter that defines whether to use MCLA algorithm for ensemble clustering. Default is "TRUE".
  • HGPA: A boolean parameter that defines whether to use HGPA algorithm for ensemble clustering. Default is "FALSE".
  • CSPA: A boolean parameter that defines whether to use CSPA algorithm for ensemble clustering. Default is "FALSE".
  • cspc_cell_max: Maximum number of cells above which CSPA is not run, when CSPA = TRUE.
  • SEED: Seed of the random number generator. Setting the seed to a fixed value can produce reproducible clustering results for MCLA and CSPA algorithms.

Output of SAFE.R

Return a list of the overall optimal ensemble clustering and cluster number, as well as the optimal resutls for each algorithm:

  • HGPA/MCLA/CSPA_ANMI: optimal ANMI for clusters HGPA/MCLA/CSPA algorithm
  • HGPA/MCLA/CSPA: optimal ensemble clustering result for HGPA/MCLA/CSPA algorithm determined by ANMI
  • HGPA/MCLA/CSPA_optimal_k: optimal number of clusters for HGPA/MCLA/CSPA algorithm
  • optimal_clustering: optimal ensemble clustering result determined by ANMI
  • optimal_k: optimal number of clusters
  • Summary: A summary statement of the ensemble clustering result

Example

  • Download expression_matrix.csv (32738 genes and 500 cells of three cell types, cd56_NK cells, b cells and regulatory T cells from 10X Genomices (Zheng et al. 2016)), and celltype.info.csv (List of cell type and cluster information for the 500 cells).
  • Prepare input for individual_clustering.R function using the following R codes:
    expr.mat <- read.csv("local_path/expression_matrix.csv", header = T)
    rownames(expr.mat) <- expr.mat[,1]
    expr.mat <- expr.mat[,-1]
  • Output: cluster.result <- individual_clustering(inputTags=expr.mat, SEED=123)
  • Take individual_clustering.R output for SAFE clustering:

    cluster.ensemble <- SAFE(cluster_results=cluster.result, SEED=123)
  • Calcluate ARI (Adjusted Rand Index) for both ensemble clustering and individual clustering results, for example SC3, using adjustedRandIndex function from CIDR package (Lin et al. 2017):

    celltype.info <- read.csv("local_path/celltype.info.csv")
    ARI.SAFE = adjustedRandIndex(cluster.ensemble$optimal_clustering, celltype.info$cluster)
    ARI.SC3 = adjustedRandIndex(cluster.result[1,], celltype.info$cluster)
  • A more detailed example which includes producing the partically sequenced data can be found in the R document of SAFE.R.