Skip Navigation
Text:
Increase font size
Decrease font size

SAME-clustering: Single-cell RNA-seq Aggregated clustering via Mixture model Ensemble

What You Need


Installation

After downloading, for example, the SAME_1.0.tar.gz into a chosen local folder "local_path",
    1. Start R envrionment.
    2. Use following R command to load individual_clustering.R and SAME.R.
       setwd("local_path")
       source("individual_clustering.R")

       and
       source("SAME.R")
    Note that the scater, SC3, cidr, Seurat, Rtsne, and ADPclust packages need to be installed before using individual_clustering.R, and the inline and Rcpp packages need to be installed before using SAME.R. Our implementation downloaded the following versions for the packages, Scater 1.6.1, SC3 1.7.2, cidr 0.1.5, Seurat 2.1.0, Rtsne 0.13 and ADPclust 0.7. We've experienced that SC3 and Seurat periodically updates their packages causing some functions and options to change.

How to Run

Our SAME-clustering package contains two functions individual_clustering.R and SAME.R. individual_clustering.R is for individual clustering using four state-of-the-art methods, and SAME.R is the main function to ensemble individual clustering result into one consensus clustering using Multinomial Mixture Model with EM algorithm written in Rcpp (EM.cpp).

Usage of individual_clustering.R

  • individual_clustering(inputTags, datatype = "count", SC3 = TRUE, gene_filter = FALSE, svm_max = 5000, CIDR = TRUE, Seurat = TRUE, nPC = NULL, resolution = 0.9, seurat_min_cell = 200, resolution_min = 1.2, tSNE = TRUE, var_genes = NULL, SEED = 1)

Input of individual_clustering.R

  • inputTags: input data matrix, where rows correspond to genes and columns correspond to cells.
  • datatype: Type of input data, which could be "count", "CPM", "RPKM" and "FPKM". Default is "count".
  • SC3: A boolean parameter that defines whether to cluster cells using SC3 method. Default is "TRUE".
  • gene_filter: Whether to perform gene filtering before SC3 clustering, when SC3 = TRUE. Default is "FALSE".
  • svm_max: Mimimum number of cells above which SVM will be run, when SC3 = TRUE.
  • CIDR: A boolean parameter that defines whether to cluster cells using CIDR method. Default is "TRUE".
  • Seurat: A boolean parameter that defines whether to cluster cells using Seurat method. Default is "TRUE".
  • nPC: Number of princple compoents used in Seurat clustering, when Seurat = TRUE. Default value is esimated by nPC function of CIDR package.
  • resolution Value of resolution used in Seurat clustering, when Seurat = TRUE.
  • seurat_min_cell Mimimum number of cells in input dataset below which resolution is set to 1.2, when Seurat is TRUE.
  • resolution_min: Resolution used in Seurat clustering for small dataset, when Seurat = TRUE and cell number of input file < seurat_min_cell.
  • tSNE: A boolean parameter that defines whether to cluster cells using t-SNE + k-means method. Default is "TRUE".
  • var_genes: Number of variable genes used by tSNE analysis, when tSNE = TRUE.
  • SEED: Seed of the random number generator. Setting the seed to a fixed value can produce reproducible clustering results.

Output of individual_clustering.R

Return a matrix of individual clustering results


Usage of SAME.R

  • SAME(Y, MAX, seed = 1, rep = 3)

Input of SAME.R

  • Y: input data matrix of individual clustering resutls, where rows correspond to number of single cells and columns correspond to cluster results of individual methods.
  • MAX: Maximum number of clusters used for ensemble clustering. Default is the maximum cluster number estimated by all the single solutions. Clustering is done for 2 to MAX number of clusters.
  • seed: Sets the seed which enables production of reproducible results. Default seed is set to 123.
  • rep: repetitions of EM to be carried out for clustering for a range from 2:MAX. The results with the maximum likelihood is carried forward. The default value is 3 repetitions.

Output of SAME.R

Return a list of the overall optimal ensemble clustering and cluster number decided by both AIC and BIC criterion.

  • BICcluster: cluster results chosen by BIC criterion
  • AICcluster: cluster results chosen by AIC criterion
  • final_k_BIC: number of clusters determined by BIC
  • final_k_AIC: number of clusters determined by AIC

Example

  • Use expression_matrix.csv (32738 genes and 500 cells of three cell types, cd56_NK cells, b cells and regulatory T cells from 10X Genomices (Zheng et al. 2016)), and celltype.info.csv (List of cell type and cluster information for the 500 cells).
  • Prepare input for individual_clustering.R function using the following R codes:
    setwd("local_path")
    expr.mat <- read.csv("expression_matrix.csv", header = T)
    rownames(expr.mat) <- expr.mat[,1]
    expr.mat <- expr.mat[,-1]
  • Output: cluster.result <- individual_clustering(inputTags=expr.mat, SEED=123)
  • Take individual_clustering.R output for SAME clustering:

    cluster.ensemble <- SAME(Y = t(cluster.result), MAX = max(cluster.result), seed=123, rep = 3)
  • Calcluate ARI (Adjusted Rand Index) for both ensemble clustering and individual clustering results, for example SC3, using adjustedRandIndex function from CIDR package (Lin et al. 2017):

    celltype.info <- read.csv("celltype.info.csv")
    ARI_SAME = adjustedRandIndex(cluster.ensemble$BICcluster, celltype.info$cluster)
    ARI_SC3 = adjustedRandIndex(cluster.result[1,], celltype.info$cluster)
    ARI_CIDR = adjustedRandIndex(cluster.result[2,], celltype.info$cluster)
    ARI_Seurat = adjustedRandIndex(cluster.result[3,], celltype.info$cluster)
    ARI_tsne = adjustedRandIndex(cluster.result[4,], celltype.info$cluster)

    > ARI_SAME
    [1] 1
    > ARI_SC3
    [1] 0.8478856
    > ARI_CIDR
    [1] 0.963799
    > ARI_Seurat
    [1] 0.8497503
    > ARI_tsne
    [1] 0.9883755