scgpt package

Modules

scgpt.data_collator

class scgpt.data_collator.DataCollator(do_padding: bool = True, pad_token_id: int | None = None, pad_value: int = 0, do_mlm: bool = True, do_binning: bool = True, mlm_probability: float = 0.15, mask_value: int = -1, max_length: int | None = None, sampling: bool = True, reserve_keys: ~typing.List[str] = <factory>, keep_first_n_tokens: int = 1, data_style: str = 'pcpt')[源代码]

基类:object

Data collator for the mask value learning task. It pads the sequences to the maximum length in the batch and masks the gene expression values.

参数:
  • do_padding (bool) – whether to pad the sequences to the max length.

  • pad_token_id (int, optional) – the token id to use for padding. This is required if do_padding is True.

  • pad_value (int) – the value to use for padding the expression values to the max length.

  • do_mlm (bool) – whether to do masking with MLM.

  • do_binning (bool) – whether to bin the expression values.

  • mlm_probability (float) – the probability of masking with MLM.

  • mask_value (int) – the value to fill at the expression postions that are masked.

  • max_length (int, optional) – the maximum length of the sequences. This is required if do_padding is True.

  • sampling (bool) – whether to do sampling instead of truncation if length > max_length.

  • reserve_keys (List[str], optional) – a list of keys in the examples to reserve in the output dictionary. Default to []. These fields will be kept unchanged in the output.

  • keep_first_n_tokens (int) – the number of tokens in the beginning of the sequence to keep unchanged from sampling. This is useful when special tokens have been added to the beginning of the sequence. Default to 1.

  • data_style (str) – the style of the data. If “pcpt”, the data is masked and padded for perception training. If “gen”, only the gene tokens are provided, but not the expression values, for pure generative training setting. If “both”, the output will contain both fields above. Choices: “pcpt”, “gen”, “both”. Default to “pcpt”.

data_style: str = 'pcpt'
do_binning: bool = True
do_mlm: bool = True
do_padding: bool = True
get_mlm_probability() float[源代码]

Get the mlm probability for the current step.

keep_first_n_tokens: int = 1
mask_value: int = -1
max_length: int | None = None
mlm_probability: float = 0.15
pad_token_id: int | None = None
pad_value: int = 0
reserve_keys: List[str]
sampling: bool = True

scgpt.data_sampler

class scgpt.data_sampler.SubsetSequentialSampler(indices: Sequence[int])[源代码]

基类:Sampler

Samples elements sequentially from a given list of indices, without replacement.

参数:

indices (sequence) – a sequence of indices

class scgpt.data_sampler.SubsetsBatchSampler(subsets: List[Sequence[int]], batch_size: int, intra_subset_shuffle: bool = True, inter_subset_shuffle: bool = True, drop_last: bool = False)[源代码]

基类:Sampler[List[int]]

Samples batches of indices from a list of subsets of indices. Each subset of indices represents a data subset and is sampled without replacement randomly or sequentially. Specially, each batch only contains indices from a single subset. This sampler is for the scenario where samples need to be drawn from multiple subsets separately.

参数:
  • subsets (List[Sequence[int]]) – A list of subsets of indices.

  • batch_size (int) – Size of mini-batch.

  • intra_subset_shuffle (bool) – If True, the sampler will shuffle the indices within each subset.

  • inter_subset_shuffle (bool) – If True, the sampler will shuffle the order of subsets.

  • drop_last (bool) – If True, the sampler will drop the last batch if its size would be less than batch_size.

scgpt.loss

scgpt.loss.criterion_neg_log_bernoulli(input: Tensor, target: Tensor, mask: Tensor) Tensor[源代码]

Compute the negative log-likelihood of Bernoulli distribution

scgpt.loss.masked_mse_loss(input: Tensor, target: Tensor, mask: Tensor) Tensor[源代码]

Compute the masked MSE loss between input and target.

scgpt.loss.masked_relative_error(input: Tensor, target: Tensor, mask: LongTensor) Tensor[源代码]

Compute the masked relative error between input and target.

scgpt.preprocess

class scgpt.preprocess.Preprocessor(use_key: str | None = None, filter_gene_by_counts: int | bool = False, filter_cell_by_counts: int | bool = False, normalize_total: float | bool = 10000.0, result_normed_key: str | None = 'X_normed', log1p: bool = False, result_log1p_key: str = 'X_log1p', subset_hvg: int | bool = False, hvg_use_key: str | None = None, hvg_flavor: str = 'seurat_v3', binning: int | None = None, result_binned_key: str = 'X_binned')[源代码]

基类:object

Prepare data into training, valid and test split. Normalize raw expression values, binning or using other transform into the preset model input format.

Set up the preprocessor, use the args to config the workflow steps.

Args:

use_key (str, optional):

The key of AnnData to use for preprocessing.

filter_gene_by_counts (int or bool, default: False):

Whther to filter genes by counts, if int, filter genes with counts

filter_cell_by_counts (int or bool, default: False):

Whther to filter cells by counts, if int, filter cells with counts

normalize_total (float or bool, default: 1e4):

Whether to normalize the total counts of each cell to a specific value.

result_normed_key (str, default: "X_normed"):

The key of AnnData to store the normalized data. If None, will use normed data to replce the use_key.

log1p (bool, default: True):

Whether to apply log1p transform to the normalized data.

result_log1p_key (str, default: "X_log1p"):

The key of AnnData to store the log1p transformed data.

subset_hvg (int or bool, default: False):

Whether to subset highly variable genes.

hvg_use_key (str, optional):

The key of AnnData to use for calculating highly variable genes. If None, will use adata.X.

hvg_flavor (str, default: "seurat_v3"):

The flavor of highly variable genes selection. See scanpy.pp.highly_variable_genes() for more details.

binning (int, optional):

Whether to bin the data into discrete values of number of bins provided.

result_binned_key (str, default: "X_binned"):

The key of AnnData to store the binned data.

check_logged(adata: AnnData, obs_key: str | None = None) bool[源代码]

Check if the data is already log1p transformed.

Args:

adata (AnnData):

The AnnData object to preprocess.

obs_key (str, optional):

The key of AnnData.obs to use for batch information. This arg is used in the highly variable gene selection step.

scgpt.preprocess.binning(row: ndarray | Tensor, n_bins: int) ndarray | Tensor[源代码]

Binning the row into n_bins.

scgpt.trainer

class scgpt.trainer.SeqDataset(data: Dict[str, Tensor])[源代码]

基类:Dataset

scgpt.trainer.define_wandb_metrcis()[源代码]
scgpt.trainer.eval_testdata(model: Module, adata_t: AnnData, gene_ids, vocab, config, logger, include_types: List[str] = ['cls']) Dict | None[源代码]

evaluate the model on test dataset of adata_t

scgpt.trainer.evaluate(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, device, config, epoch) float[源代码]

Evaluate the model on the evaluation data.

scgpt.trainer.predict(model: Module, loader: DataLoader, vocab, config, device) float[源代码]

Evaluate the model on the evaluation data.

scgpt.trainer.prepare_data(tokenized_train, tokenized_valid, train_batch_labels, valid_batch_labels, config, epoch, train_celltype_labels=None, valid_celltype_labels=None, sort_seq_batch=False) Tuple[Dict[str, Tensor]][源代码]
scgpt.trainer.prepare_dataloader(data_pt: Dict[str, Tensor], batch_size: int, shuffle: bool = False, intra_domain_shuffle: bool = False, drop_last: bool = False, num_workers: int = 0, per_seq_batch_sample: bool = False) DataLoader[源代码]
scgpt.trainer.test(model: Module, adata: DataLoader, gene_ids, vocab, config, device, logger) float[源代码]
scgpt.trainer.train(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, scaler, optimizer, scheduler, device, config, logger, epoch) None[源代码]

Train the model for one epoch.