Flow and mass cytometry are important modern immunology tools for measuring expression levels of multiple proteins on single cells. The goal is to better understand the mechanisms of responses on a single cell basis by studying differential expression of proteins. Most current data analysis tools compare expressions across many computationally discovered cell types. Our goal is to focus on just one cell type. Differential analysis of marker expressions can be difficult due to marker correlations and inter-subject heterogeneity, particularly for studies of human immunology. We address these challenges with a bootstrapped generalized linear model (GLM). Here, we illustrate the CytoGLMM
R package and workflow for simulated mass cytometry data.
We construct our simulated datasets by sampling from a Poisson GLM. We confirmed—with predictive posterior checks—that Poisson GLMs with mixed effects provide a good fit to mass cytometry data (Seiler et al. 2019). We consider one underlying data generating mechanisms described by a hierarchical model for the \(i\)th cell and \(j\)th donor:
\[ \begin{aligned} \boldsymbol{X}_{ij} & \sim \text{Poisson}(\boldsymbol{\lambda}_{ij}) \\ \log(\boldsymbol{\lambda}_{ij}) & = \boldsymbol{B}_{ij} + \boldsymbol{U}_j \\ \boldsymbol{B}_{ij} & \sim \begin{cases} \text{Normal}(\boldsymbol{\delta}^{(0)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 0, \text{ cell unstimulated} \\ \text{Normal}(\boldsymbol{\delta}^{(1)}, \boldsymbol{\Sigma}_B) & \text{if } Y_{ij} = 1, \text{ cell stimulated} \end{cases} \\ \boldsymbol{U}_j & \sim \text{Normal}(\boldsymbol{0}, \boldsymbol{\Sigma}_U). \end{aligned} \]
The following graphic shows a representation of the hierarchical model.
The stimulus activates proteins and induces a difference in marker expression. We define the effect size to be the difference between expected expression levels of stimulated versus unstimulated cells on the \(\log\)-scale. All markers that belong to the active set , have a non-zero effect size, whereas, all markers that are not, have a zero effect size:
\[ \begin{cases} \delta^{(1)}_p - \delta^{(0)}_p > 0 & \text{if protein } p \text{ is in activation set } p \in C \\ \delta^{(1)}_{p'} - \delta^{(0)}_{p'} = 0 & \text{if protein } p' \text{ is not in activation set } p' \notin C. \end{cases} \]
Both covariance matrices have an autoregressive structure,
\[ \begin{aligned} \Omega_{rs} & = \rho^{|r-s|} \\ \boldsymbol{\Sigma} & = \operatorname{diag}(\boldsymbol{\sigma}) \, \boldsymbol{\Omega} \, \operatorname{diag}(\boldsymbol{\sigma}), \end{aligned} \]
where \(\Omega_{rs}\) is the \(r\)th row and \(s\)th column of the correlation matrix \(\boldsymbol{\Omega}\). We regulate two separate correlation parameters: a cell-level \(\rho_B\) and a donor-level \(\rho_U\) coefficient. Non-zero \(\rho_B\) or \(\rho_U\) induce a correlation between condition and marker expression even for markers with a zero effect size.
library("CytoGLMM")
set.seed(23)
df <- generate_data()
df[1:5,1:5]
## # A tibble: 5 × 5
## donor condition m01 m02 m03
## <int> <fct> <int> <int> <int>
## 1 1 treatment 9 6 326
## 2 1 treatment 23 125 104
## 3 1 treatment 58 72 49
## 4 1 treatment 68 11 91
## 5 1 treatment 100 47 16
We define the marker names that we will focus on in our analysis by extracting them from the simulated data frame.
protein_names <- names(df)[3:12]
We recommend that marker expressions be corrected for batch effects (Nowicka et al. 2017; Chevrier et al. 2018; Schuyler et al. 2019; Van Gassen et al. 2020; Trussart et al. 2020) and transformed using variance stabilizing transformations to account for heteroskedasticity, for instance with an inverse hyperbolic sine transformation with the cofactor set to 150 for flow cytometry, and 5 for mass cytometry (Bendall et al. 2011). This transformation assumes a two-component model for the measurement error (Rocke and Lorenzato 1995; Huber et al. 2003) where small counts are less noisy than large counts. Intuitively, this corresponds to a noise model with additive and multiplicative noise depending on the magnitude of the marker expression; see (Holmes and Huber 2019) for details.
df <- dplyr::mutate_at(df, protein_names, function(x) asinh(x/5))
The goal of the CytoGLMM::cytoglm
function is to find protein expression patterns that are associated with the condition of interest, such as a response to a stimulus. We set up the GLM to predict the experimental condition (condition
) from protein marker expressions (protein_names
), thus our experimental conditions are response variables and marker expressions are explanatory variables.
glm_fit <- CytoGLMM::cytoglm(df,
protein_names = protein_names,
condition = "condition",
group = "donor",
num_cores = 1,
num_boot = 1000)
glm_fit
##
## #######################
## ## paired analysis ####
## #######################
##
## number of bootstrap samples: 1000
##
## number of cells per group and condition:
## control treatment
## 1 1000 1000
## 2 1000 1000
## 3 1000 1000
## 4 1000 1000
## 5 1000 1000
## 6 1000 1000
## 7 1000 1000
## 8 1000 1000
##
## proteins included in the analysis:
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10
##
## condition compared: condition
## grouping variable: donor
We plot the maximum likelihood estimates with 95% confidence intervals for the fixed effects \(\boldsymbol{\beta}\). The estimates are on the \(\log\)-odds scale. We see that markers m1, m2, and m3 are potential predictors of the treatment. This means that one unit increase in the transformed marker expression makes it more likely to be a cell from the treatment group, while holding the other markers constant.
plot(glm_fit)
The summary
function returns a table about the model fit with unadjusted and Benjamini-Hochberg (BH) adjusted \(p\)-values.
summary(glm_fit)
## # A tibble: 10 × 3
## protein_name pvalues_unadj pvalues_adj
## <chr> <dbl> <dbl>
## 1 m03 0.012 0.12
## 2 m01 0.07 0.35
## 3 m02 0.198 0.66
## 4 m06 0.386 0.678
## 5 m05 0.438 0.678
## 6 m09 0.488 0.678
## 7 m07 0.526 0.678
## 8 m10 0.542 0.678
## 9 m08 0.71 0.789
## 10 m04 0.896 0.896
We can extract the proteins below an False Discovery Rate (FDR) of \(0.1\) from the \(p\)-value table by filtering the table.
summary(glm_fit) |> dplyr::filter(pvalues_adj < 0.1)
## # A tibble: 0 × 3
## # ℹ 3 variables: protein_name <chr>, pvalues_unadj <dbl>, pvalues_adj <dbl>
sessionInfo()
## R version 4.5.0 Patched (2025-04-21 r88169)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] CytoGLMM_1.17.0 BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 dplyr_1.1.4
## [4] farver_2.1.2 fastmap_1.2.0 pROC_1.18.5
## [7] caret_7.0-1 digest_0.6.37 rpart_4.1.24
## [10] timechange_0.3.0 lifecycle_1.0.4 factoextra_1.0.7
## [13] survival_3.8-3 magrittr_2.0.3 compiler_4.5.0
## [16] rlang_1.1.6 sass_0.4.10 tools_4.5.0
## [19] utf8_1.2.4 yaml_2.3.10 data.table_1.17.0
## [22] knitr_1.50 labeling_0.4.3 plyr_1.8.9
## [25] RColorBrewer_1.1-3 BiocParallel_1.43.0 withr_3.0.2
## [28] purrr_1.0.4 nnet_7.3-20 grid_4.5.0
## [31] stats4_4.5.0 future_1.40.0 ggplot2_3.5.2
## [34] globals_0.17.0 scales_1.4.0 iterators_1.0.14
## [37] MASS_7.3-65 tinytex_0.57 dichromat_2.0-0.1
## [40] cli_3.6.5 rmarkdown_2.29 generics_0.1.3
## [43] future.apply_1.11.3 reshape2_1.4.4 cachem_1.1.0
## [46] stringr_1.5.1 modeltools_0.2-23 splines_4.5.0
## [49] parallel_4.5.0 BiocManager_1.30.25 vctrs_0.6.5
## [52] hardhat_1.4.1 Matrix_1.7-3 sandwich_3.1-1
## [55] jsonlite_2.0.0 bookdown_0.43 ggrepel_0.9.6
## [58] listenv_0.9.1 magick_2.8.6 foreach_1.5.2
## [61] tidyr_1.3.1 gower_1.0.2 jquerylib_0.1.4
## [64] recipes_1.3.0 glue_1.8.0 parallelly_1.43.0
## [67] codetools_0.2-20 cowplot_1.1.3 lubridate_1.9.4
## [70] stringi_1.8.7 strucchange_1.5-4 gtable_0.3.6
## [73] tibble_3.2.1 pillar_1.10.2 htmltools_0.5.8.1
## [76] ipred_0.9-15 lava_1.8.1 R6_2.6.1
## [79] evaluate_1.0.3 lattice_0.22-7 pheatmap_1.0.12
## [82] bslib_0.9.0 class_7.3-23 Rcpp_1.0.14
## [85] flexmix_2.3-20 nlme_3.1-168 prodlim_2024.06.25
## [88] xfun_0.52 zoo_1.8-14 pkgconfig_2.0.3
## [91] ModelMetrics_1.2.2.2