Skip to contents

The viewmastR function preprocesses one or two single-cell datasets (a reference and an optional query), splits the reference data into training and test sets, and optionally includes the ability to run inference on a query dataset for downstream analysis. It then applies specified modeling functions (e.g., MLR, NN, NB) to train and optionally predict on the query data.

Usage

viewmastR(
  query_cds,
  ref_cds,
  ref_celldata_col,
  query_celldata_col = NULL,
  FUNC = c("mlr", "nn", "nb"),
  norm_method = c("log", "binary", "size_only", "none"),
  selected_features = NULL,
  train_frac = 0.8,
  tf_idf = FALSE,
  scale = FALSE,
  hidden_layers = c(as.integer(500), as.integer(100)),
  learning_rate = 0.001,
  max_epochs = 5,
  LSImethod = 1,
  verbose = TRUE,
  backend = c("auto", "wgpu", "nd", "candle"),
  threshold = NULL,
  keras_model = NULL,
  model_dir = "/tmp/sc_local",
  return_probs = FALSE,
  return_type = c("object", "list"),
  debug = FALSE,
  train_only = FALSE,
  addbias = FALSE,
  ...
)

Arguments

query_cds

A Seurat or cell_data_set object representing the query dataset. If NULL, the function will operate in "reference-only" mode, using the reference dataset for training and testing only.

ref_cds

A Seurat or cell_data_set object representing the reference dataset. This is required.

ref_celldata_col

A character string specifying the metadata column in ref_cds that contains the cell labels.

query_celldata_col

A character string specifying a metadata column name in query_cds (or reference in reference-only mode) where predicted labels should be stored. If NULL, defaults to "viewmastR_pred".

FUNC

A character string specifying the modeling function to apply. One of "mlr", "nn", or "nb".

norm_method

Character string indicating the normalization method. One of "log", "binary", "size_only", or "none".

selected_features

A character vector specifying genes to subset. If NULL, uses the set of common features (if query is provided) or selected genes directly (if reference-only).

train_frac

A numeric value between 0 and 1 specifying the fraction of reference cells to use for training. The remainder are used for testing.

tf_idf

Logical, whether to apply TF-IDF transformation after normalization.

scale

Logical, whether to scale the data. If both tf_idf and scale are TRUE, TF-IDF takes precedence.

hidden_layers

A numeric vector indicating the size of hidden layers (for the NN model). Only 1 or 2 layers are allowed.

learning_rate

Numeric, learning rate for model training.

max_epochs

Integer, the maximum number of epochs for model training.

LSImethod

Integer, specifying the TF-IDF method variant if using TF-IDF.

verbose

Logical, whether to print progress messages.

backend

A character string specifying the backend to use. One of "wgpu", "nd", "candle".

threshold

Currently unused. Can be NULL.

keras_model

Currently unused. Can be NULL.

model_dir

A character string specifying the directory to store model artifacts.

return_probs

Logical, whether to return predicted probabilities in the object's metadata.

return_type

A character string specifying the return type. One of "object" or "list". If "object", returns the updated query_cds. If "list", returns a list containing object and training_output.

debug

Logical, whether to print debugging messages and dimension checks.

train_only

Logical, if TRUE, only the reference data is processed and no query data is included.

addbias

Logical, whether to add a bias term (a row of ones) to the data.

...

Additional arguments passed to setup_training.

Value

Depending on return_type, returns either:

  • return_type = "object": the input query_cds (or ref_cds if query_cds = NULL) with predicted labels (and optionally probabilities) appended.

  • return_type = "list": a list containing:

    object

    The updated query_cds (or ref_cds).

    training_output

    The output from the model training process, including probabilities if applicable.

Details

The function first calls setup_training to preprocess and split the data into training, testing, and optionally query subsets. Then, based on the selected FUNC, it calls one of the model training and prediction functions (process_learning_obj_mlr, process_learning_obj_ann, process_learning_obj_nb). If train_only = TRUE, the query portion is skipped and no query predictions are made.

For "mlr" and "nn" functions, predicted log odds are converted to probabilities using the logistic function. Predicted cell labels are assigned to the query_cds (or ref_cds if query is not provided).

Examples

if (FALSE) { # \dontrun{
# Training and predicting with reference and query data:
res <- viewmastR(
  query_cds = query_seurat_obj,
  ref_cds = ref_seurat_obj,
  ref_celldata_col = "cell_type",
  FUNC = "mlr",
  norm_method = "log",
  train_frac = 0.8,
  backend = "wgpu",
  verbose = TRUE,
  return_type = "object"
)

# Reference-only scenario:
res_ref <- viewmastR(
  query_cds = NULL,
  ref_cds = ref_cds_obj,
  ref_celldata_col = "cell_type",
  FUNC = "nn",
  norm_method = "none",
  train_frac = 0.7,
  scale = TRUE,
  train_only = TRUE,
  return_type = "list"
)
} # }