How to use viewmastR
2024-01-23
HowTo.Rmd
ViewmastR is a tool designed to predict cell type assignments in a query dataset based on reference data. In this tutorial, you’ll learn how to install and use viewmastR, load data, and evaluate its predictions.
Prerequisites
Before we begin, ensure you have an updated Rust installation, as it’s a core dependency. You can follow the instructions provided on the official Rust installation page.
Installing viewmastR
First, ensure you have the devtools
R package installed,
which allows you to install packages from GitHub. If
devtools
is installed, you can easily install viewmastR
using the following command:
devtools::install_github("furlan-lab/viewmastR")
This will fetch the latest version of viewmastR from GitHub and install it.
Running viewmastR
In this section, we’ll load two Seurat objects:
- Query dataset (seu
): Contains the data
you want to classify.
- Reference dataset (seur
): Contains known
cell type labels used to train the model.
ViewmastR predicts the cell types of your query dataset by leveraging the features associated with cell type labels in the reference data.
# Load required packages
suppressPackageStartupMessages({
library(viewmastR)
library(Seurat)
library(ggplot2)
library(scCustomize)
})
# Load query and reference datasets
seu <- readRDS(file.path(ROOT_DIR1, "240813_final_object.RDS"))
seur <- readRDS(file.path(ROOT_DIR2, "230329_rnaAugmented_seurat.RDS"))
Defining “Ground Truth” in the Query Dataset
Although we don’t know the cell type labels for the query dataset a priori, we can approximate the ground truth by using cluster-based cell type assignments. This approximation will help us evaluate the accuracy of viewmastR’s predictions. We can visualize the query dataset with its ground truth labels to get an initial idea of the cell types we’re working with.
DimPlot(seu, group.by = "ground_truth", cols = seur@misc$colors)
Finding Common Features
The performance of viewmastR is enhanced when the features (genes) are consistent between the query and reference datasets. We’ll now identify and select highly variable genes in both datasets and find the common genes to use for training the model.
# Calculate and plot gene dispersion in query dataset
seu <- calculate_gene_dispersion(seu)
plot_gene_dispersion(seu)
seu <- select_genes(seu, top_n = 10000, logmean_ul = -1, logmean_ll = -8)
plot_gene_dispersion(seu)
vgq <- get_selected_genes(seu)
# Repeat the process for the reference dataset
seur <- calculate_gene_dispersion(seur)
plot_gene_dispersion(seur)
seur <- select_genes(seur, top_n = 10000, logmean_ul = -1, logmean_ll = -8)
plot_gene_dispersion(seur)
vgr <- get_selected_genes(seur)
# Find common genes
vg <- intersect(vgq, vgr)
Visualizing Reference Cell Types
Next, we visualize the reference dataset to see the known cell type classifications that viewmastR will use to train its model.
DimPlot(seur, group.by = "SFClassification", cols = seur@misc$colors)
Running viewmastR
Now we run viewmastR to predict cell types in the query dataset. This function will learn from the reference dataset’s cell type annotations and apply its knowledge to classify the query cells.
seu <- viewmastR(seu, seur, ref_celldata_col = "SFClassification", selected_genes = vg, max_epochs = 4)
Visualizing Predictions
After running viewmastR, we can visualize the predicted cell types for the query dataset.
DimPlot(seu, group.by = "viewmastR_pred", cols = seur@misc$colors)
Evaluating Model Accuracy with a Confusion Matrix
We can further evaluate the accuracy of viewmastR’s predictions by comparing them to the ground truth labels (approximated earlier) using a confusion matrix.
confusion_matrix(pred = factor(seu$viewmastR_pred), gt = factor(seu$ground_truth), cols = seur@misc$colors)
Analyzing Training Performance
ViewmastR can also return a detailed training history, including metrics like training loss and validation loss over time. This helps diagnose overfitting or underfitting during model training.
To access these metrics, you need to set the return_type
parameter to "list"
. Here’s an example of how to retrieve
and plot the training data:
# Run viewmastR with return_type = "list"
output_list <- viewmastR(seu, seur, ref_celldata_col = "SFClassification", selected_genes = vg, return_type = "list")
# Plot training data
plot_training_data(output_list)
We can now visualize how the training and validation losses change over the epochs. If the training loss keeps decreasing while the validation loss plateaus or increases, it may indicate overfitting.
plt <- plot_training_data(output_list)
plt
Probabilities
Finally, we can also look at prediction probabilities using the
return_probs argument. Doing so will add meta-data columns to the object
prefixed with the string “probs_” for each class of prediction. The
values are transformed log-odds from the model prediction transformed
using the plogis
function in R.
seu <- viewmastR(seu, seur, ref_celldata_col = "SFClassification", selected_genes = vg, backend = "candle", max_epochs = 4, return_probs = T)
FeaturePlot_scCustom(seu, features = "prob_14_B")
FeaturePlot_scCustom(seu, features = "prob_16_CD8.N")
Appendix
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scCustomize_2.1.2 ggplot2_3.5.1 Seurat_5.1.0 SeuratObject_5.0.2
## [5] sp_2.1-4 viewmastR_0.2.3
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.4 matrixStats_1.3.0
## [3] spatstat.sparse_3.0-3 RcppMsgPack_0.2.3
## [5] lubridate_1.9.3 httr_1.4.7
## [7] RColorBrewer_1.1-3 doParallel_1.0.17
## [9] tools_4.4.0 sctransform_0.4.1
## [11] backports_1.5.0 utf8_1.2.4
## [13] R6_2.5.1 lazyeval_0.2.2
## [15] uwot_0.2.2 GetoptLong_1.0.5
## [17] withr_3.0.0 gridExtra_2.3
## [19] progressr_0.14.0 cli_3.6.2
## [21] Biobase_2.64.0 textshaping_0.4.0
## [23] spatstat.explore_3.2-7 fastDummies_1.7.3
## [25] labeling_0.4.3 sass_0.4.9
## [27] spatstat.data_3.0-4 proxy_0.4-27
## [29] ggridges_0.5.6 pbapply_1.7-2
## [31] pkgdown_2.0.9 systemfonts_1.1.0
## [33] foreign_0.8-86 R.utils_2.12.3
## [35] parallelly_1.37.1 rstudioapi_0.16.0
## [37] generics_0.1.3 shape_1.4.6.1
## [39] crosstalk_1.2.1 ica_1.0-3
## [41] spatstat.random_3.2-3 dplyr_1.1.4
## [43] Matrix_1.7-0 ggbeeswarm_0.7.2
## [45] fansi_1.0.6 S4Vectors_0.42.0
## [47] abind_1.4-5 R.methodsS3_1.8.2
## [49] lifecycle_1.0.4 yaml_2.3.8
## [51] snakecase_0.11.1 SummarizedExperiment_1.34.0
## [53] recipes_1.1.0 SparseArray_1.4.8
## [55] Rtsne_0.17 paletteer_1.6.0
## [57] grid_4.4.0 promises_1.3.0
## [59] crayon_1.5.2 miniUI_0.1.1.1
## [61] lattice_0.22-6 cowplot_1.1.3
## [63] pillar_1.9.0 knitr_1.46
## [65] ComplexHeatmap_2.20.0 GenomicRanges_1.56.0
## [67] rjson_0.2.21 boot_1.3-30
## [69] future.apply_1.11.2 codetools_0.2-20
## [71] leiden_0.4.3.1 glue_1.7.0
## [73] data.table_1.15.4 vctrs_0.6.5
## [75] png_0.1-8 spam_2.10-0
## [77] gtable_0.3.5 rematch2_2.1.2
## [79] assertthat_0.2.1 cachem_1.1.0
## [81] gower_1.0.1 xfun_0.44
## [83] S4Arrays_1.4.1 mime_0.12
## [85] prodlim_2024.06.25 survival_3.6-4
## [87] timeDate_4041.110 SingleCellExperiment_1.26.0
## [89] iterators_1.0.14 pbmcapply_1.5.1
## [91] hardhat_1.4.0 lava_1.8.0
## [93] fitdistrplus_1.1-11 ROCR_1.0-11
## [95] ipred_0.9-15 nlme_3.1-164
## [97] RcppAnnoy_0.0.22 GenomeInfoDb_1.40.1
## [99] bslib_0.7.0 irlba_2.3.5.1
## [101] vipor_0.4.7 KernSmooth_2.23-24
## [103] rpart_4.1.23 colorspace_2.1-0
## [105] BiocGenerics_0.50.0 Hmisc_5.1-2
## [107] nnet_7.3-19 ggrastr_1.0.2
## [109] tidyselect_1.2.1 compiler_4.4.0
## [111] htmlTable_2.4.2 desc_1.4.3
## [113] DelayedArray_0.30.1 plotly_4.10.4
## [115] checkmate_2.3.1 scales_1.3.0
## [117] lmtest_0.9-40 stringr_1.5.1
## [119] digest_0.6.35 goftest_1.2-3
## [121] spatstat.utils_3.1-0 minqa_1.2.7
## [123] rmarkdown_2.27 XVector_0.44.0
## [125] htmltools_0.5.8.1 pkgconfig_2.0.3
## [127] base64enc_0.1-3 lme4_1.1-35.3
## [129] sparseMatrixStats_1.16.0 MatrixGenerics_1.16.0
## [131] highr_0.10 fastmap_1.2.0
## [133] rlang_1.1.4 GlobalOptions_0.1.2
## [135] htmlwidgets_1.6.4 UCSC.utils_1.0.0
## [137] shiny_1.8.1.1 DelayedMatrixStats_1.26.0
## [139] farver_2.1.2 jquerylib_0.1.4
## [141] zoo_1.8-12 jsonlite_1.8.8
## [143] ModelMetrics_1.2.2.2 R.oo_1.26.0
## [145] magrittr_2.0.3 Formula_1.2-5
## [147] GenomeInfoDbData_1.2.12 dotCall64_1.1-1
## [149] patchwork_1.2.0 munsell_0.5.1
## [151] Rcpp_1.0.12 reticulate_1.37.0
## [153] stringi_1.8.4 pROC_1.18.5
## [155] zlibbioc_1.50.0 MASS_7.3-60.2
## [157] plyr_1.8.9 parallel_4.4.0
## [159] listenv_0.9.1 ggrepel_0.9.5
## [161] forcats_1.0.0 deldir_2.0-4
## [163] splines_4.4.0 tensor_1.5
## [165] circlize_0.4.16 igraph_2.0.3
## [167] spatstat.geom_3.2-9 RcppHNSW_0.6.0
## [169] reshape2_1.4.4 stats4_4.4.0
## [171] evaluate_0.23 ggprism_1.0.5
## [173] nloptr_2.0.3 foreach_1.5.2
## [175] httpuv_1.6.15 RANN_2.6.1
## [177] tidyr_1.3.1 purrr_1.0.2
## [179] polyclip_1.10-6 future_1.33.2
## [181] clue_0.3-65 scattermore_1.2
## [183] janitor_2.2.0 xtable_1.8-4
## [185] monocle3_1.3.7 e1071_1.7-16
## [187] RSpectra_0.16-1 later_1.3.2
## [189] viridisLite_0.4.2 class_7.3-22
## [191] ragg_1.3.2 tibble_3.2.1
## [193] memoise_2.0.1 beeswarm_0.4.0
## [195] IRanges_2.38.0 cluster_2.1.6
## [197] timechange_0.3.0 globals_0.16.3
## [199] caret_6.0-94
getwd()
## [1] "/Users/sfurla/develop/viewmastR/vignettes"