Citation

If you use vmrseq in published research, please cite:

Shen, Ning, and Keegan Korthauer. “vmrseq: Probabilistic Modeling of Single-Cell Methylation Heterogeneity.” bioRxiv, November 21, 2023. https://doi.org/10.1101/2023.11.20.567911.

Installation

This package will be submitted to Bioconductor. For now, you can install the development version from Bioconductor (recommended) or our GitHub repository through running this:

### Install stable version from Bioconductor
 if (!require("BiocManager", quietly = TRUE))
     install.packages("BiocManager")
BiocManager::install("vmrseq")

## Or development version from Github
# install.packages("remotes")
remotes::install_github("nshen7/vmrseq")

After installation, load the vmrseq package:

library(vmrseq)

How to get help for vmrseq: Feel free to raise questions by opening issue(s) in the vmrseq GitHub repository

Background

High-throughput single-cell measurements of DNA methylation allows studying inter-cellular epigenetic heterogeneity, but this task faces the challenges of sparsity and noise. We present vmrseq, a statistical method that overcomes these challenges to accurately and robustly identify de novo variably methylated regions (VMR) in single-cell DNA methylation data. To the best of our knowledge, vmrseq has been the only method that flexibly pinpoints biologically relevant regions at single-base pair resolution without prior knowledge of size or location. In addition, vmrseq is the only approach capable of detecting VMRs that are sensitive to the level of heterogeneity present in input datasets so far.

Input data

vmrseq takes processed and filtered single-cell bisulfite sequencing binary methylation values as input. After processing and filtering, each sequenced CpG takes a value of methylated or unmethylated in each cell.

On the note of pre-processing

In the experiments of the vmrseq manuscript, the authors did not carry out the pre-processing but instead, adopted the processed .bam files provided by the original publications of the sequencing protocol and datasets. However, by summarizing the processing steps described in those publications as sources, we offer a flexible guideline intended to serve as a general reference rather than a strict requirement:

Remove cells with a low non-conversion rate (e.g., $\leq1\%$ in mouse and $\leq2\%$ in human).
To eliminate contaminated samples, remove cells with a low number of non-clonal mapped reads (e.g., 400K in mouse; 500K in human).
To protect against wells with multiple cells, remove cells exceeding an upper limit on coverage ($\leq15\%$ of cytosines).
Remove cells with low CpG site coverage (e.g., $\leq50,000$ CpG sites)

Process individual-cell read files

We have implemented a poolData function to process individual-cell read files into a format that is suitable to input to the vmrseq framework. Following is an example of how to use this function.

The poolData function pools individual-cell CpG read files into a SummarizedExperiment object (with NA-dropped sparseMatrix representation by default). Store the file paths of individual cells to a list, for example, called cell_list, and input it to poolData function. poolData does not give any output but directly writes the SummarizedExperiment object into the writeDir specified by the user, here is an example: (this step might take long in practice since input datasets usually contain many cells; suggest to parallelize the computation with chromosomes)

poolData(cellFiles = cell_list, sep = ",", chrNames = "chr1", writeDir = "your/write/path")

Note that in each cell, sites with hemimethylation or intermediate methylation levels (i.e., 0 < meth_read/total_read < 1) will be removed.

Each cell file should be in BED-like format, where the first 5 columns in each file must be: chr, pos, strand, meth_read, total_read, in strict order, for example:

data("cell_1")
head(cell_1)
##       chr     pos strand mc_count total
##    <char>   <int> <char>    <int> <int>
## 1:   chr1 3003885      *        1     1
## 2:   chr1 3003898      *        1     1
## 3:   chr1 3011550      *        1     1
## 4:   chr1 3019920      *        1     1
## 5:   chr1 3020794      *        2     2
## 6:   chr1 3020814      *        2     2

The SummarizedExperiment object stores CpG sites as rows and cells as columns. We adopted the NA-dropped sparseMatrix representation to save storage space for large single-cell datasets (see following section for specifics of how this representation look like). This option can be turned off but we recommend to keep it on.

Further, poolData saves the SummarizedExperiment object into disk using HDF5 format, which is a backend extension of DelayedArray. The HDF5/DelayedArray format allows one to perform common array operations on it without loading the object in memory. See packages HDF5Array and DelayedArray for details.

Load example data

In this vignette, we use a formatted example dataset built in the package that’s ready to be input to model fitting function (which is not accessible to users due to format limitation of internal data of R packages). In usual cases users of this package would use HDF5::loadHDF5SummarizedExperiment to load saved HDF5SummarizedExperiment objects into R.

toy.se <- HDF5Array::loadHDF5SummarizedExperiment(system.file("extdata", "toy", package = "vmrseq"))

It is a SummarizedExperiment object with one assays slot called M_mat that contains the binary methylation values of individual cells at CpG sites (sites as rows and cells as columns):

toy.se
## class: RangedSummarizedExperiment 
## dim: 30000 200 
## metadata(0):
## assays(1): M_mat
## rownames: NULL
## rowData names(0):
## colnames(200): V1 V1 ... V1 V1
## colData names(0):

dim(toy.se)
## [1] 30000   200

GenomicRanges::granges(toy.se)
## GRanges object with 30000 ranges and 0 metadata columns:
##           seqnames    ranges strand
##              <Rle> <IRanges>  <Rle>
##       [1]   pseudo   3000827      *
##       [2]   pseudo   3001007      *
##       [3]   pseudo   3001018      *
##       [4]   pseudo   3001277      *
##       [5]   pseudo   3001629      *
##       ...      ...       ...    ...
##   [29996]   pseudo   8523533      *
##   [29997]   pseudo   8523719      *
##   [29998]   pseudo   8523783      *
##   [29999]   pseudo   8523864      *
##   [30000]   pseudo   8524231      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

SummarizedExperiment::assays(toy.se)
## List of length 1
## names(1): M_mat

Moreover, the assay matrix is in NA-dropped sparseMatrix representation (implemented using the recommenderlab::dropNA function):

SummarizedExperiment::assays(toy.se)$M_mat[5:10, 5:10]
## <6 x 6> sparse DelayedMatrix object of type "double":
##                 V1            V1            V1            V1            V1
## [1,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## [2,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## [3,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## [4,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
## [5,]  0.000000e+00  0.000000e+00  1.000000e+00  0.000000e+00  0.000000e+00
## [6,]  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
##                 V1
## [1,]  0.000000e+00
## [2,]  0.000000e+00
## [3,]  0.000000e+00
## [4,]  0.000000e+00
## [5,]  0.000000e+00
## [6,] 2.225074e-308

In particular, the NAs in the dataset are represented by 0 (or 0.000000e+00 as shown above), and the 0’s are represented in a very small positive real number (2.225074e-308 as shown above). This allows the matrix to be stored in a sparseMatrix format so that it takes less disk storage. One can use the vmrseq::HDF5NAdrop2matrix function to convert the assay from HDF5SummarizedExpriment object saved by vmrseq::poolData back to regular matrices.

Filter out low-cell-coverage sites

Prior to applying vmrseq, we filter out CpG sites with total number of covered cells < 3. Note that the assays(toy.se)$M_mat, which stores the binary methyation values per site and per cell, is in NA-dropped sparseMatrix representation (see above section). Therefore, to count the covered cells per CpG, we count the number of non-zero values per row.

total <- DelayedArray::rowSums(SummarizedExperiment::assays(toy.se)$M_mat > 0)
toy.se <- subset(toy.se, total >= 3)

dim(toy.se)
## [1] 29313   200

Detect variably methylated regions (VMR)

Brief intro on the vmrseq method

vmrseq is a two-stage approach that first constructs candidate regions and then determines whether a VMR is present and its location if applicable. The first stage of vmrseq scans the genome for regions containing consecutive CpGs that show evidence of potential cell-to-cell variation. vmrseq first applies smoothing to mitigate the influence of limited coverage and counteract the reduction in statistical power caused by the inherent noise in single-cell data. The candidate regions are defined as groups of consecutive loci that exceed some threshold on the inter-cellular variance of smoothed methylation levels.

The second stage of vmrseq optimizes a hidden Markov model that models methylation states of individual CpG sites for each candidate region. The estimation of parameters and hidden states in the HMM determines whether groups of cell subpopulations show distinct epigenetic signals in each region and solves for the precise genomic range of VMRs.

For more details, refer to the vmrseq paper (Shen et al. 2023).

On the note of computational time

The use of parallelization can sufficiently lower computation time of vmrseq. To use more cores, use the register function of BiocParallel. For example, the following chunk (not evaluated here), would register 8 cores, and then the functions above would split computation over these cores.

library("BiocParallel")
register(MulticoreParam(8))

Generally, the computation time of vmrseq depends on not only the number of cores, but also the level of heterogeneity presents in input data. Thus we suggest users to first run vmrseq on a small subset of the cells and CpG sites of the input dataset to test out the computational burden before apply the method on the complete dataset.

Run vmrseq method

The standard VMR analysis steps are wrapped into two functions, vmrseqSmooth and vmrseqFit. The estimation steps performed by them are described briefly below, as well as in more detail in the vmrseq paper. Here we run the results for a subset of 30,000 CpGs in 200 cells in the interest of computation time.

The vmrseqSmooth function performs kernel smoothing on the single-cell methylation and output a GRanges object with the CpG coordinate information extracted from the input toy.se and metadata columns indicating the methylated cell count (i.e., ), the total cell coverage (i.e., ) and the inter-cellular variance on methylation level.

toy.gr <- vmrseqSmooth(toy.se)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Smoothing in progress...
## ...Chromosome pseudo: Variance computed (0.07 min).

head(toy.gr)
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames    ranges strand |      meth     total        var
##          <Rle> <IRanges>  <Rle> | <integer> <numeric>  <numeric>
##   [1]   pseudo   3000827      * |        17        17 0.00479461
##   [2]   pseudo   3001007      * |        19        19 0.00479461
##   [3]   pseudo   3001018      * |        16        16 0.00479461
##   [4]   pseudo   3001277      * |        14        15 0.00479461
##   [5]   pseudo   3001629      * |        16        16 0.00479461
##   [6]   pseudo   3003226      * |         3         3 0.00814370
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Next, the vmrseqFit function performs rest steps of the method, i.e., defining the candidate regions based on the variance and fitting the hidden Markov model to detect the VMRs.

toy.results <- vmrseqFit(toy.gr)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Step 1: Detecting candidate regions...
## ...Calling candidate regions with cutoff of 0.117 on variance.
## ...Finished calling candidate regions - found 72 candidate regions in total.
##   ...3.23% sites are called to be in candidate regions.
## Step 2: Detecting VMRs...
## ...Finished detecting VMRs - took 0.22 min and 63 VMRs found in total.
##   ...2.87% sites are called to be in VMRs.

For users who need a preliminary check or prioritize speed over high precision in region detection and the ranking of regions, we recommend running only the first stage of the methodology. This allows for a faster initial analysis before applying the full methodology. This can be achieved by specifying the argument stage1only = TRUE in the vmrseqFit function.

toy.results_s1 <- vmrseqFit(toy.gr, stage1only = TRUE)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
## Step 1: Detecting candidate regions...
## ...Calling candidate regions with cutoff of 0.117 on variance.
## ...Finished calling candidate regions - found 73 candidate regions in total.
##   ...3.25% sites are called to be in candidate regions.

Method output

toy.results
## $gr
## GRanges object with 29313 ranges and 5 metadata columns:
##           seqnames    ranges strand |      meth     total        var  cr_index
##              <Rle> <IRanges>  <Rle> | <integer> <numeric>  <numeric> <integer>
##       [1]   pseudo   3000827      * |        17        17 0.00479461      <NA>
##       [2]   pseudo   3001007      * |        19        19 0.00479461      <NA>
##       [3]   pseudo   3001018      * |        16        16 0.00479461      <NA>
##       [4]   pseudo   3001277      * |        14        15 0.00479461      <NA>
##       [5]   pseudo   3001629      * |        16        16 0.00479461      <NA>
##       ...      ...       ...    ... .       ...       ...        ...       ...
##   [29309]   pseudo   8523533      * |        15        15  0.0149416      <NA>
##   [29310]   pseudo   8523719      * |        11        11  0.0158558      <NA>
##   [29311]   pseudo   8523783      * |         6         6  0.0158558      <NA>
##   [29312]   pseudo   8523864      * |         7         7  0.0158558      <NA>
##   [29313]   pseudo   8524231      * |         2         4  0.0199032      <NA>
##           vmr_index
##           <integer>
##       [1]      <NA>
##       [2]      <NA>
##       [3]      <NA>
##       [4]      <NA>
##       [5]      <NA>
##       ...       ...
##   [29309]      <NA>
##   [29310]      <NA>
##   [29311]      <NA>
##   [29312]      <NA>
##   [29313]      <NA>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $vmr.ranges
## GRanges object with 63 ranges and 5 metadata columns:
##        seqnames          ranges strand |   num_cpg start_ind   end_ind
##           <Rle>       <IRanges>  <Rle> | <integer> <numeric> <numeric>
##    [1]   pseudo 3121587-3122009      * |        12       675       686
##    [2]   pseudo 3196918-3198634      * |        11      1066      1076
##    [3]   pseudo 3217393-3221314      * |        21      1145      1165
##    [4]   pseudo 3234201-3236719      * |        15      1210      1224
##    [5]   pseudo 3252227-3259375      * |        33      1285      1317
##    ...      ...             ...    ... .       ...       ...       ...
##   [59]   pseudo 7429380-7430593      * |         7     24062     24068
##   [60]   pseudo 7505596-7507387      * |         7     24365     24371
##   [61]   pseudo 7671976-7672660      * |         9     25171     25179
##   [62]   pseudo 8399348-8400680      * |        10     28758     28767
##   [63]   pseudo 8410430-8411660      * |        32     28803     28834
##               pi loglik_diff
##        <numeric>   <numeric>
##    [1]  0.505781    26.67067
##    [2]  0.398940    15.66478
##    [3]  0.817459     7.27884
##    [4]  0.772000    10.02922
##    [5]  0.423506    42.29019
##    ...       ...         ...
##   [59]  0.451126     8.20103
##   [60]  0.362805     8.75095
##   [61]  0.698666     9.35440
##   [62]  0.224853     7.65353
##   [63]  0.426745    41.88692
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $cr.ranges
## GRanges object with 72 ranges and 1 metadata column:
##        seqnames          ranges strand |   num_cpg
##           <Rle>       <IRanges>  <Rle> | <integer>
##    [1]   pseudo 3121587-3122009      * |        12
##    [2]   pseudo 3196918-3198634      * |        11
##    [3]   pseudo 3217393-3221314      * |        21
##    [4]   pseudo 3234201-3236719      * |        15
##    [5]   pseudo 3252227-3259375      * |        33
##    ...      ...             ...    ... .       ...
##   [68]   pseudo 7671976-7672711      * |        12
##   [69]   pseudo 7800327-7801784      * |         5
##   [70]   pseudo 8399348-8400680      * |        10
##   [71]   pseudo 8410068-8411660      * |        33
##   [72]   pseudo 8495540-8497541      * |         5
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $alpha
## [1] 0.05
## 
## $var_cutoff
##      95% 
## 0.117234 
## 
## $bb_params
## $bb_params$pars_u
##        nu        mu     sigma 
## 0.4131979 0.1871845 0.1205215 
## 
## $bb_params$pars_m
##         mu      sigma 
## 0.91401348 0.09909785

The toy.results object is a list of 6 elements that contains the following information:

gr: The Granges object that has been input to vmrseqFit with two added metadata columns:
- cr_index = Index in reference to rows of cr.ranges, denoting row number of the candidate region to which the CpG site belongs.
- vmr_index = Index in reference to rows of vmr.ranges, denoting row number of the variably methylated region to which the CpG site belongs.
vmr.ranges: A Granges object with the coordinates of each detected variably methylated region (each row is a VMR), with metadata columns:
- num_cpg = Number of observed CpG sites in the VMR.
- start_ind = Index of the starting CpG sites in reference to rows of gr.
- end_ind = Index of the ending CpG sites in reference to rows of gr.
- pi = Prevalence of the methylated grouping (see manuscript for details)
- loglik_diff = Difference in log-likelihood of two-grouping and one-grouping HMM fitted to the VMR; can be used to rank the VMRs.
cr.ranges: A Granges object with the coordinates of each candidate region (each row is a candidate region), with metadata column:
- num_cpg = Number of observed CpG sites in the candidate region.
alpha: Designated significance level (default 0.05, can be changed by user with function argument). It is used for determining the threshold on variance used for constructing candidate. The threshold is computed by taking the $1-\alpha$ quantile of an approximate null distribution of variance (see manuscript for details).
var_cutoff: Variance cutoff computed from alpha.
bb_params: Beta-binomial parameter used in emission probability of the HMM model; they are determined by the magnitude of the input dataset (see manuscript for details).

In summary, vmrseq found 72 candidate regions, among which 63 contains VMR.

If the argument stage1only = TRUE has been turned on, the output only contains four of the above-mentioned elements: gr, cr.ranges, alpha, var_cutoff.

names(toy.results_s1)
## [1] "gr"         "cr.ranges"  "alpha"      "var_cutoff"

Compute region-level methylation

We also provide a function called regionSummary to compute regional methylation information for downstream analysis. As an example, we use this function to summarize the regional info of the detected VMRs from toy.se:

regions.se <- regionSummary(SE = toy.se, region_ranges = toy.results$vmr.ranges)
## Parallel: Parallelizing using 4 workers/cores (backend: BiocParallel:MulticoreParam).
regions.se
## class: RangedSummarizedExperiment 
## dim: 63 200 
## metadata(0):
## assays(3): M Cov MF
## rownames: NULL
## rowData names(5): num_cpg start_ind end_ind pi loglik_diff
## colnames: NULL
## colData names(0):

The function output a SummarizedExperiment object of dimension (# regions x # cells). In total the object contains thress assays. Specifically, M and Cov represent the number of methylated CpGs and the number of covered CpGs in each region per cell; and MF (stands for methylation fraction) represents the regional average methylation level computed by M/Cov. Note that MF contains NAs due to zero counts in the Cov matrix.

SummarizedExperiment::assays(regions.se)$MF[1:5, 1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  NaN 0.75  NaN  NaN    1
## [2,]    0  NaN  NaN  NaN    0
## [3,]    1  NaN    0    1    0
## [4,]    1 0.20    1  NaN  NaN
## [5,]    0 0.00    0    1    0

Downstream analysis

We have created the GitHub repo vmrseq-workflow-vignette as a demo of complete workflow of scBS-seq analysis using vmrseq. This repo includes scripts that run vmseq and apply downstream analysis such as gene/CpG annotation and clustering analysis. Feel free to check it out!

Session info

sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] vmrseq_1.1.1
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.2.1            dplyr_1.1.4                
##   [3] farver_2.1.2                blob_1.2.4                 
##   [5] Biostrings_2.77.0           bitops_1.0-9               
##   [7] fastmap_1.2.0               RCurl_1.98-1.17            
##   [9] recosystem_0.5.1            GenomicAlignments_1.45.0   
##  [11] XML_3.99-0.18               digest_0.6.37              
##  [13] lifecycle_1.0.4             gamlss.dist_6.1-1          
##  [15] KEGGREST_1.49.0             RSQLite_2.3.9              
##  [17] magrittr_2.0.3              recommenderlab_1.0.6       
##  [19] compiler_4.5.0              rlang_1.1.6                
##  [21] sass_0.4.10                 rngtools_1.5.2             
##  [23] tools_4.5.0                 yaml_2.3.10                
##  [25] data.table_1.17.0           rtracklayer_1.69.0         
##  [27] knitr_1.50                  S4Arrays_1.9.0             
##  [29] doRNG_1.8.6.2               bit_4.6.0                  
##  [31] curl_6.2.2                  DelayedArray_0.35.1        
##  [33] RColorBrewer_1.1-3          abind_1.4-8                
##  [35] BiocParallel_1.43.0         registry_0.5-1             
##  [37] HDF5Array_1.37.0            purrr_1.0.4                
##  [39] BiocGenerics_0.55.0         grid_4.5.0                 
##  [41] stats4_4.5.0                Rhdf5lib_1.31.0            
##  [43] ggplot2_3.5.2               scales_1.4.0               
##  [45] iterators_1.0.14            MASS_7.3-65                
##  [47] dichromat_2.0-0.1           SummarizedExperiment_1.39.0
##  [49] cli_3.6.5                   rmarkdown_2.29             
##  [51] crayon_1.5.3                generics_0.1.3             
##  [53] httr_1.4.7                  rjson_0.2.23               
##  [55] proxy_0.4-27                DBI_1.2.3                  
##  [57] cachem_1.1.0                rhdf5_2.53.0               
##  [59] parallel_4.5.0              AnnotationDbi_1.71.0       
##  [61] XVector_0.49.0              restfulr_0.0.15            
##  [63] matrixStats_1.5.0           vctrs_0.6.5                
##  [65] Matrix_1.7-3                jsonlite_2.0.0             
##  [67] IRanges_2.43.0              S4Vectors_0.47.0           
##  [69] bit64_4.6.0-1               irlba_2.3.5.1              
##  [71] GenomicFeatures_1.61.0      h5mread_1.1.0              
##  [73] locfit_1.5-9.12             foreach_1.5.2              
##  [75] tidyr_1.3.1                 jquerylib_0.1.4            
##  [77] glue_1.8.0                  codetools_0.2-20           
##  [79] gtable_0.3.6                GenomeInfoDb_1.45.3        
##  [81] GenomicRanges_1.61.0        BiocIO_1.19.0              
##  [83] UCSC.utils_1.5.0            tibble_3.2.1               
##  [85] pillar_1.10.2               htmltools_0.5.8.1          
##  [87] rhdf5filters_1.21.0         float_0.3-3                
##  [89] R6_2.6.1                    arules_1.7-10              
##  [91] evaluate_1.0.3              lattice_0.22-7             
##  [93] Biobase_2.69.0              png_0.1-8                  
##  [95] Rsamtools_2.25.0            memoise_2.0.1              
##  [97] bslib_0.9.0                 Rcpp_1.0.14                
##  [99] SparseArray_1.9.0           bumphunter_1.51.0          
## [101] xfun_0.52                   MatrixGenerics_1.21.0      
## [103] pkgconfig_2.0.3

Analyzing single-cell bisulfite sequencing data with vmrseq

Ning Shen