If you use vmrseq in published research, please cite:
Shen, Ning, and Keegan Korthauer. “vmrseq: Probabilistic Modeling of Single-Cell Methylation Heterogeneity.” bioRxiv, November 21, 2023. https://doi.org/10.1101/2023.11.20.567911.
This package will be submitted to Bioconductor. For now, you can install the development version from Bioconductor (recommended) or our GitHub repository through running this:
### Install stable version from Bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("vmrseq")
## Or development version from Github
# install.packages("remotes")
remotes::install_github("nshen7/vmrseq")
After installation, load the vmrseq package:
How to get help for vmrseq: Feel free to raise questions by opening issue(s) in the vmrseq GitHub repository
High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. We present vmrseq
, a statistical method that overcomes these challenges to accurately and robustly identify de novo variably methylated regions (VMR) in single-cell DNA methylation data. To the best of our knowledge, vmrseq has been the only method that flexibly pinpoints biologically relevant regions at single-base pair resolution without prior knowledge of size or location. In addition, vmrseq is the only approach capable of detecting VMRs that are sensitive to the level of heterogeneity present in input datasets so far.
vmrseq takes processed and filtered single-cell bisulfite sequencing binary methylation values as input. After processing and filtering, each sequenced CpG takes a value of methylated or unmethylated in each cell.
In the experiments of the vmrseq manuscript, the authors did not carry out the pre-processing but instead, adopted the processed .bam files provided by the original publications of the sequencing protocol and datasets. However, by summarizing the processing steps described in those publications as sources, we offer a flexible guideline intended to serve as a general reference rather than a strict requirement:
We have implemented a poolData
function to process individual-cell read files into a format that is suitable to input to the vmrseq framework. Following is an example of how to use this function.
The poolData
function pools individual-cell CpG read files into a SummarizedExperiment object (with NA-dropped sparseMatrix representation by default). Store the file paths of individual cells to a list, for example, called cell_list
, and input it to poolData
function. poolData
does not give any output but directly writes the SummarizedExperiment object into the writeDir
specified by the user, here is an example: (this step might take long in practice since input datasets usually contain many cells; suggest to parallelize the computation with chromosomes)
Note that in each cell, sites with hemimethylation or intermediate methylation levels (i.e., 0 < meth_read/total_read < 1) will be removed.
Each cell file should be in BED-like format, where the first 5 columns in each file must be: chr, pos, strand, meth_read, total_read, in strict order, for example:
data("cell_1")
head(cell_1)
## chr pos strand mc_count total
## <char> <int> <char> <int> <int>
## 1: chr1 3003885 * 1 1
## 2: chr1 3003898 * 1 1
## 3: chr1 3011550 * 1 1
## 4: chr1 3019920 * 1 1
## 5: chr1 3020794 * 2 2
## 6: chr1 3020814 * 2 2
The SummarizedExperiment
object stores CpG sites as rows and cells as columns. We adopted the NA-dropped sparseMatrix representation to save storage space for large single-cell datasets (see following section for specifics of how this representation look like). This option can be turned off but we recommend to keep it on.
Further, poolData
saves the SummarizedExperiment
object into disk using HDF5
format, which is a backend extension of DelayedArray
. The HDF5/DelayedArray format allows one to perform common array operations on it without loading the object in memory. See packages HDF5Array and DelayedArray for details.
In this vignette, we use a formatted example dataset built in the package that’s ready to be input to model fitting function (which is not accessible to users due to format limitation of internal data of R packages). In usual cases users of this package would use HDF5::loadHDF5SummarizedExperiment
to load saved HDF5SummarizedExperiment objects into R.
toy.se <- HDF5Array::loadHDF5SummarizedExperiment(system.file("extdata", "toy", package = "vmrseq"))
It is a SummarizedExperiment
object with one assays
slot called M_mat
that contains the binary methylation values of individual cells at CpG sites (sites as rows and cells as columns):
toy.se
## class: RangedSummarizedExperiment
## dim: 30000 200
## metadata(0):
## assays(1): M_mat
## rownames: NULL
## rowData names(0):
## colnames(200): V1 V1 ... V1 V1
## colData names(0):
GenomicRanges::granges(toy.se)
## GRanges object with 30000 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] pseudo 3000827 *
## [2] pseudo 3001007 *
## [3] pseudo 3001018 *
## [4] pseudo 3001277 *
## [5] pseudo 3001629 *
## ... ... ... ...
## [29996] pseudo 8523533 *
## [29997] pseudo 8523719 *
## [29998] pseudo 8523783 *
## [29999] pseudo 8523864 *
## [30000] pseudo 8524231 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Moreover, the assay matrix is in NA-dropped sparseMatrix representation (implemented using the recommenderlab::dropNA
function):
SummarizedExperiment::assays(toy.se)$M_mat[5:10, 5:10]
## <6 x 6> sparse DelayedMatrix object of type "double":
## V1 V1 V1 V1 V1
## [1,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [2,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [3,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [4,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [5,] 0.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00
## [6,] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## V1
## [1,] 0.000000e+00
## [2,] 0.000000e+00
## [3,] 0.000000e+00
## [4,] 0.000000e+00
## [5,] 0.000000e+00
## [6,] 2.225074e-308
In particular, the NAs in the dataset are represented by 0 (or 0.000000e+00 as shown above), and the 0’s are represented in a very small positive real number (2.225074e-308 as shown above). This allows the matrix to be stored in a sparseMatrix
format so that it takes less disk storage. One can use the vmrseq::HDF5NAdrop2matrix
function to convert the assay from HDF5SummarizedExpriment object saved by vmrseq::poolData
back to regular matrices.
Prior to applying vmrseq, we filter out CpG sites with total number of covered cells < 3. Note that the assays(toy.se)$M_mat
, which stores the binary methyation values per site and per cell, is in NA-dropped sparseMatrix representation (see above section). Therefore, to count the covered cells per CpG, we count the number of non-zero values per row.
vmrseq is a two-stage approach that first constructs candidate regions and then determines whether a VMR is present and its location if applicable. The first stage of vmrseq scans the genome for regions containing consecutive CpGs that show evidence of potential cell-to-cell variation. vmrseq first applies smoothing to mitigate the influence of limited coverage and counteract the reduction in statistical power caused by the inherent noise in single-cell data. The candidate regions are defined as groups of consecutive loci that exceed some threshold on the inter-cellular variance of smoothed methylation levels.
The second stage of vmrseq optimizes a hidden Markov model that models methylation states of individual CpG sites for each candidate region. The estimation of parameters and hidden states in the HMM determines whether groups of cell subpopulations show distinct epigenetic signals in each region and solves for the precise genomic range of VMRs.
For more details, refer to the vmrseq paper (Shen et al. 2023).
The use of parallelization can sufficiently lower computation time of vmrseq. To use more cores, use the register
function of BiocParallel. For example, the following chunk (not evaluated here), would register 8 cores, and then the functions above would split computation over these cores.
Generally, the computation time of vmrseq depends on not only the number of cores, but also the level of heterogeneity presents in input data. Thus we suggest users to first run vmrseq on a small subset of the cells and CpG sites of the input dataset to test out the computational burden before apply the method on the complete dataset.
The standard VMR analysis steps are wrapped into two functions, vmrseqSmooth
and vmrseqFit
. The estimation steps performed by them are described briefly below, as well as in more detail in the vmrseq paper. Here we run the results for a subset of 30,000 CpGs in 200 cells in the interest of computation time.
The vmrseqSmooth
function performs kernel smoothing on the single-cell methylation and output a GRanges
object with the CpG coordinate information extracted from the input toy.se
and metadata columns indicating the methylated cell count (i.e.,
toy.gr <- vmrseqSmooth(toy.se)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Smoothing in progress...
## ...Chromosome pseudo: Variance computed (0.07 min).
head(toy.gr)
## GRanges object with 6 ranges and 3 metadata columns:
## seqnames ranges strand | meth total var
## <Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
## [1] pseudo 3000827 * | 17 17 0.00479461
## [2] pseudo 3001007 * | 19 19 0.00479461
## [3] pseudo 3001018 * | 16 16 0.00479461
## [4] pseudo 3001277 * | 14 15 0.00479461
## [5] pseudo 3001629 * | 16 16 0.00479461
## [6] pseudo 3003226 * | 3 3 0.00814370
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Next, the vmrseqFit
function performs rest steps of the method, i.e., defining the candidate regions based on the variance and fitting the hidden Markov model to detect the VMRs.
toy.results <- vmrseqFit(toy.gr)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Step 1: Detecting candidate regions...
## ...Calling candidate regions with cutoff of 0.117 on variance.
## ...Finished calling candidate regions - found 72 candidate regions in total.
## ...3.23% sites are called to be in candidate regions.
## Step 2: Detecting VMRs...
## ...Finished detecting VMRs - took 0.22 min and 63 VMRs found in total.
## ...2.87% sites are called to be in VMRs.
For users who need a preliminary check or prioritize speed over high precision in region detection and the ranking of regions, we recommend running only the first stage of the methodology. This allows for a faster initial analysis before applying the full methodology. This can be achieved by specifying the argument stage1only = TRUE
in the vmrseqFit
function.
toy.results_s1 <- vmrseqFit(toy.gr, stage1only = TRUE)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Step 1: Detecting candidate regions...
## ...Calling candidate regions with cutoff of 0.117 on variance.
## ...Finished calling candidate regions - found 73 candidate regions in total.
## ...3.25% sites are called to be in candidate regions.
toy.results
## $gr
## GRanges object with 29313 ranges and 5 metadata columns:
## seqnames ranges strand | meth total var cr_index
## <Rle> <IRanges> <Rle> | <integer> <numeric> <numeric> <integer>
## [1] pseudo 3000827 * | 17 17 0.00479461 <NA>
## [2] pseudo 3001007 * | 19 19 0.00479461 <NA>
## [3] pseudo 3001018 * | 16 16 0.00479461 <NA>
## [4] pseudo 3001277 * | 14 15 0.00479461 <NA>
## [5] pseudo 3001629 * | 16 16 0.00479461 <NA>
## ... ... ... ... . ... ... ... ...
## [29309] pseudo 8523533 * | 15 15 0.0149416 <NA>
## [29310] pseudo 8523719 * | 11 11 0.0158558 <NA>
## [29311] pseudo 8523783 * | 6 6 0.0158558 <NA>
## [29312] pseudo 8523864 * | 7 7 0.0158558 <NA>
## [29313] pseudo 8524231 * | 2 4 0.0199032 <NA>
## vmr_index
## <integer>
## [1] <NA>
## [2] <NA>
## [3] <NA>
## [4] <NA>
## [5] <NA>
## ... ...
## [29309] <NA>
## [29310] <NA>
## [29311] <NA>
## [29312] <NA>
## [29313] <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $vmr.ranges
## GRanges object with 63 ranges and 5 metadata columns:
## seqnames ranges strand | num_cpg start_ind end_ind
## <Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
## [1] pseudo 3121587-3122009 * | 12 675 686
## [2] pseudo 3196918-3198634 * | 11 1066 1076
## [3] pseudo 3217393-3221314 * | 21 1145 1165
## [4] pseudo 3234201-3236719 * | 15 1210 1224
## [5] pseudo 3252227-3259375 * | 33 1285 1317
## ... ... ... ... . ... ... ...
## [59] pseudo 7429380-7430593 * | 7 24062 24068
## [60] pseudo 7505596-7507387 * | 7 24365 24371
## [61] pseudo 7671976-7672660 * | 9 25171 25179
## [62] pseudo 8399348-8400680 * | 10 28758 28767
## [63] pseudo 8410430-8411660 * | 32 28803 28834
## pi loglik_diff
## <numeric> <numeric>
## [1] 0.505781 26.67067
## [2] 0.398940 15.66478
## [3] 0.817459 7.27884
## [4] 0.772000 10.02922
## [5] 0.423506 42.29019
## ... ... ...
## [59] 0.451126 8.20103
## [60] 0.362805 8.75095
## [61] 0.698666 9.35440
## [62] 0.224853 7.65353
## [63] 0.426745 41.88692
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $cr.ranges
## GRanges object with 72 ranges and 1 metadata column:
## seqnames ranges strand | num_cpg
## <Rle> <IRanges> <Rle> | <integer>
## [1] pseudo 3121587-3122009 * | 12
## [2] pseudo 3196918-3198634 * | 11
## [3] pseudo 3217393-3221314 * | 21
## [4] pseudo 3234201-3236719 * | 15
## [5] pseudo 3252227-3259375 * | 33
## ... ... ... ... . ...
## [68] pseudo 7671976-7672711 * | 12
## [69] pseudo 7800327-7801784 * | 5
## [70] pseudo 8399348-8400680 * | 10
## [71] pseudo 8410068-8411660 * | 33
## [72] pseudo 8495540-8497541 * | 5
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $alpha
## [1] 0.05
##
## $var_cutoff
## 95%
## 0.117234
##
## $bb_params
## $bb_params$pars_u
## nu mu sigma
## 0.4131979 0.1871845 0.1205215
##
## $bb_params$pars_m
## mu sigma
## 0.91401348 0.09909785
The toy.results object is a list of 6 elements that contains the following information:
gr
: The Granges
object that has been input to vmrseqFit
with two added metadata columns:
cr_index
= Index in reference to rows of cr.ranges
, denoting row number of the candidate region to which the CpG site belongs.vmr_index
= Index in reference to rows of vmr.ranges
, denoting row number of the variably methylated region to which the CpG site belongs.vmr.ranges
: A Granges
object with the coordinates of each detected variably methylated region (each row is a VMR), with metadata columns:
num_cpg
= Number of observed CpG sites in the VMR.start_ind
= Index of the starting CpG sites in reference to rows of gr
.end_ind
= Index of the ending CpG sites in reference to rows of gr
.pi
= Prevalence of the methylated grouping (see manuscript for details)loglik_diff
= Difference in log-likelihood of two-grouping and one-grouping HMM fitted to the VMR; can be used to rank the VMRs.cr.ranges
: A Granges
object with the coordinates of each candidate region (each row is a candidate region), with metadata column:
num_cpg
= Number of observed CpG sites in the candidate region.alpha
: Designated significance level (default 0.05, can be changed by user with function argument). It is used for determining the threshold on variance used for constructing candidate. The threshold is computed by taking the \(1-\alpha\) quantile of an approximate null distribution of variance (see manuscript for details).var_cutoff
: Variance cutoff computed from alpha
.bb_params
: Beta-binomial parameter used in emission probability of the HMM model; they are determined by the magnitude of the input dataset (see manuscript for details).In summary, vmrseq found 72 candidate regions, among which 63 contains VMR.
If the argument stage1only = TRUE
has been turned on, the output only contains four of the above-mentioned elements: gr
, cr.ranges
, alpha
, var_cutoff
.
We also provide a function called regionSummary
to compute regional methylation information for downstream analysis. As an example, we use this function to summarize the regional info of the detected VMRs from toy.se
:
regions.se <- regionSummary(SE = toy.se, region_ranges = toy.results$vmr.ranges)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
regions.se
## class: RangedSummarizedExperiment
## dim: 63 200
## metadata(0):
## assays(3): M Cov MF
## rownames: NULL
## rowData names(5): num_cpg start_ind end_ind pi loglik_diff
## colnames: NULL
## colData names(0):
The function output a SummarizedExperiment object of dimension (# regions x # cells). In total the object contains thress assays. Specifically, M
and Cov
represent the number of methylated CpGs and the number of covered CpGs in each region per cell; and MF
(stands for methylation fraction) represents the regional average methylation level computed by M/Cov
. Note that MF
contains NAs due to zero counts in the Cov
matrix.
We have created the GitHub repo vmrseq-workflow-vignette as a demo of complete workflow of scBS-seq analysis using vmrseq. This repo includes scripts that run vmseq and apply downstream analysis such as gene/CpG annotation and clustering analysis. Feel free to check it out!
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] vmrseq_1.1.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] farver_2.1.2 blob_1.2.4
## [5] Biostrings_2.77.0 bitops_1.0-9
## [7] fastmap_1.2.0 RCurl_1.98-1.17
## [9] recosystem_0.5.1 GenomicAlignments_1.45.0
## [11] XML_3.99-0.18 digest_0.6.37
## [13] lifecycle_1.0.4 gamlss.dist_6.1-1
## [15] KEGGREST_1.49.0 RSQLite_2.3.9
## [17] magrittr_2.0.3 recommenderlab_1.0.6
## [19] compiler_4.5.0 rlang_1.1.6
## [21] sass_0.4.10 rngtools_1.5.2
## [23] tools_4.5.0 yaml_2.3.10
## [25] data.table_1.17.0 rtracklayer_1.69.0
## [27] knitr_1.50 S4Arrays_1.9.0
## [29] doRNG_1.8.6.2 bit_4.6.0
## [31] curl_6.2.2 DelayedArray_0.35.1
## [33] RColorBrewer_1.1-3 abind_1.4-8
## [35] BiocParallel_1.43.0 registry_0.5-1
## [37] HDF5Array_1.37.0 purrr_1.0.4
## [39] BiocGenerics_0.55.0 grid_4.5.0
## [41] stats4_4.5.0 Rhdf5lib_1.31.0
## [43] ggplot2_3.5.2 scales_1.4.0
## [45] iterators_1.0.14 MASS_7.3-65
## [47] dichromat_2.0-0.1 SummarizedExperiment_1.39.0
## [49] cli_3.6.5 rmarkdown_2.29
## [51] crayon_1.5.3 generics_0.1.3
## [53] httr_1.4.7 rjson_0.2.23
## [55] proxy_0.4-27 DBI_1.2.3
## [57] cachem_1.1.0 rhdf5_2.53.0
## [59] parallel_4.5.0 AnnotationDbi_1.71.0
## [61] XVector_0.49.0 restfulr_0.0.15
## [63] matrixStats_1.5.0 vctrs_0.6.5
## [65] Matrix_1.7-3 jsonlite_2.0.0
## [67] IRanges_2.43.0 S4Vectors_0.47.0
## [69] bit64_4.6.0-1 irlba_2.3.5.1
## [71] GenomicFeatures_1.61.0 h5mread_1.1.0
## [73] locfit_1.5-9.12 foreach_1.5.2
## [75] tidyr_1.3.1 jquerylib_0.1.4
## [77] glue_1.8.0 codetools_0.2-20
## [79] gtable_0.3.6 GenomeInfoDb_1.45.3
## [81] GenomicRanges_1.61.0 BiocIO_1.19.0
## [83] UCSC.utils_1.5.0 tibble_3.2.1
## [85] pillar_1.10.2 htmltools_0.5.8.1
## [87] rhdf5filters_1.21.0 float_0.3-3
## [89] R6_2.6.1 arules_1.7-10
## [91] evaluate_1.0.3 lattice_0.22-7
## [93] Biobase_2.69.0 png_0.1-8
## [95] Rsamtools_2.25.0 memoise_2.0.1
## [97] bslib_0.9.0 Rcpp_1.0.14
## [99] SparseArray_1.9.0 bumphunter_1.51.0
## [101] xfun_0.52 MatrixGenerics_1.21.0
## [103] pkgconfig_2.0.3