scgpt package
Modules
- scgpt.model
- Submodules
- scgpt.model.dsbn
- scgpt.model.flash_layers
- scgpt.model.generation_model
- scgpt.model.grad_reverse
- scgpt.model.layers
- scgpt.model.model
- scgpt.scbank package
- scgpt.scbank.data
- scgpt.scbank.databank
DataBank
DataBank.append_study()
DataBank.batch_from_anndata()
DataBank.custom_filter()
DataBank.data_tables
DataBank.delete_study()
DataBank.filter()
DataBank.from_anndata()
DataBank.from_path()
DataBank.gene_vocab
DataBank.link()
DataBank.load()
DataBank.load_all()
DataBank.load_anndata()
DataBank.load_table()
DataBank.main_data
DataBank.main_table_key
DataBank.meta_info
DataBank.save()
DataBank.settings
DataBank.sync()
DataBank.track()
DataBank.update_datatables()
- scgpt.scbank.monitor
- scgpt.scbank.setting
- Module contents
- scgpt.tasks package
- scgpt.tasks.grn
GeneEmbedding
GeneEmbedding.average_vector_results()
GeneEmbedding.cluster_definitions_as_df()
GeneEmbedding.compute_similarities()
GeneEmbedding.generate_network()
GeneEmbedding.generate_vector()
GeneEmbedding.generate_weighted_vector()
GeneEmbedding.get_adata()
GeneEmbedding.get_metagenes()
GeneEmbedding.get_similar_genes()
GeneEmbedding.plot_metagene()
GeneEmbedding.plot_metagenes_scores()
GeneEmbedding.plot_similarities()
GeneEmbedding.read_embedding()
GeneEmbedding.read_vector()
GeneEmbedding.score_metagenes()
- Module contents
- scgpt.tasks.grn
- scgpt.tokenizer package
- scgpt.utils package
scgpt.data_collator
- class scgpt.data_collator.DataCollator(do_padding: bool = True, pad_token_id: int | None = None, pad_value: int = 0, do_mlm: bool = True, do_binning: bool = True, mlm_probability: float = 0.15, mask_value: int = -1, max_length: int | None = None, sampling: bool = True, reserve_keys: ~typing.List[str] = <factory>, keep_first_n_tokens: int = 1, data_style: str = 'pcpt')[源代码]
基类:
object
Data collator for the mask value learning task. It pads the sequences to the maximum length in the batch and masks the gene expression values.
- 参数:
do_padding (
bool
) – whether to pad the sequences to the max length.pad_token_id (
int
, optional) – the token id to use for padding. This is required if do_padding is True.pad_value (
int
) – the value to use for padding the expression values to the max length.do_mlm (
bool
) – whether to do masking with MLM.do_binning (
bool
) – whether to bin the expression values.mlm_probability (
float
) – the probability of masking with MLM.mask_value (
int
) – the value to fill at the expression postions that are masked.max_length (
int
, optional) – the maximum length of the sequences. This is required if do_padding is True.sampling (
bool
) – whether to do sampling instead of truncation if length > max_length.reserve_keys (
List[str]
, optional) – a list of keys in the examples to reserve in the output dictionary. Default to []. These fields will be kept unchanged in the output.keep_first_n_tokens (
int
) – the number of tokens in the beginning of the sequence to keep unchanged from sampling. This is useful when special tokens have been added to the beginning of the sequence. Default to 1.data_style (
str
) – the style of the data. If “pcpt”, the data is masked and padded for perception training. If “gen”, only the gene tokens are provided, but not the expression values, for pure generative training setting. If “both”, the output will contain both fields above. Choices: “pcpt”, “gen”, “both”. Default to “pcpt”.
- data_style: str = 'pcpt'
- do_binning: bool = True
- do_mlm: bool = True
- do_padding: bool = True
- keep_first_n_tokens: int = 1
- mask_value: int = -1
- max_length: int | None = None
- mlm_probability: float = 0.15
- pad_token_id: int | None = None
- pad_value: int = 0
- reserve_keys: List[str]
- sampling: bool = True
scgpt.data_sampler
- class scgpt.data_sampler.SubsetSequentialSampler(indices: Sequence[int])[源代码]
基类:
Sampler
Samples elements sequentially from a given list of indices, without replacement.
- 参数:
indices (sequence) – a sequence of indices
- class scgpt.data_sampler.SubsetsBatchSampler(subsets: List[Sequence[int]], batch_size: int, intra_subset_shuffle: bool = True, inter_subset_shuffle: bool = True, drop_last: bool = False)[源代码]
基类:
Sampler
[List
[int
]]Samples batches of indices from a list of subsets of indices. Each subset of indices represents a data subset and is sampled without replacement randomly or sequentially. Specially, each batch only contains indices from a single subset. This sampler is for the scenario where samples need to be drawn from multiple subsets separately.
- 参数:
subsets (List[Sequence[int]]) – A list of subsets of indices.
batch_size (int) – Size of mini-batch.
intra_subset_shuffle (bool) – If
True
, the sampler will shuffle the indices within each subset.inter_subset_shuffle (bool) – If
True
, the sampler will shuffle the order of subsets.drop_last (bool) – If
True
, the sampler will drop the last batch if its size would be less thanbatch_size
.
scgpt.loss
- scgpt.loss.criterion_neg_log_bernoulli(input: Tensor, target: Tensor, mask: Tensor) Tensor [源代码]
Compute the negative log-likelihood of Bernoulli distribution
scgpt.preprocess
- class scgpt.preprocess.Preprocessor(use_key: str | None = None, filter_gene_by_counts: int | bool = False, filter_cell_by_counts: int | bool = False, normalize_total: float | bool = 10000.0, result_normed_key: str | None = 'X_normed', log1p: bool = False, result_log1p_key: str = 'X_log1p', subset_hvg: int | bool = False, hvg_use_key: str | None = None, hvg_flavor: str = 'seurat_v3', binning: int | None = None, result_binned_key: str = 'X_binned')[源代码]
基类:
object
Prepare data into training, valid and test split. Normalize raw expression values, binning or using other transform into the preset model input format.
Set up the preprocessor, use the args to config the workflow steps.
Args:
- use_key (
str
, optional): The key of
AnnData
to use for preprocessing.- filter_gene_by_counts (
int
orbool
, default:False
): Whther to filter genes by counts, if
int
, filter genes with counts- filter_cell_by_counts (
int
orbool
, default:False
): Whther to filter cells by counts, if
int
, filter cells with counts- normalize_total (
float
orbool
, default:1e4
): Whether to normalize the total counts of each cell to a specific value.
- result_normed_key (
str
, default:"X_normed"
): The key of
AnnData
to store the normalized data. IfNone
, will use normed data to replce theuse_key
.- log1p (
bool
, default:True
): Whether to apply log1p transform to the normalized data.
- result_log1p_key (
str
, default:"X_log1p"
): The key of
AnnData
to store the log1p transformed data.- subset_hvg (
int
orbool
, default:False
): Whether to subset highly variable genes.
- hvg_use_key (
str
, optional): The key of
AnnData
to use for calculating highly variable genes. IfNone
, will useadata.X
.- hvg_flavor (
str
, default:"seurat_v3"
): The flavor of highly variable genes selection. See
scanpy.pp.highly_variable_genes()
for more details.- binning (
int
, optional): Whether to bin the data into discrete values of number of bins provided.
- result_binned_key (
str
, default:"X_binned"
): The key of
AnnData
to store the binned data.
- check_logged(adata: AnnData, obs_key: str | None = None) bool [源代码]
Check if the data is already log1p transformed.
Args:
- adata (
AnnData
): The
AnnData
object to preprocess.- obs_key (
str
, optional): The key of
AnnData.obs
to use for batch information. This arg is used in the highly variable gene selection step.
- adata (
- use_key (
scgpt.trainer
- scgpt.trainer.eval_testdata(model: Module, adata_t: AnnData, gene_ids, vocab, config, logger, include_types: List[str] = ['cls']) Dict | None [源代码]
evaluate the model on test dataset of adata_t
- scgpt.trainer.evaluate(model: Module, loader: DataLoader, vocab, criterion_gep_gepc, criterion_dab, criterion_cls, device, config, epoch) float [源代码]
Evaluate the model on the evaluation data.
- scgpt.trainer.predict(model: Module, loader: DataLoader, vocab, config, device) float [源代码]
Evaluate the model on the evaluation data.
- scgpt.trainer.prepare_data(tokenized_train, tokenized_valid, train_batch_labels, valid_batch_labels, config, epoch, train_celltype_labels=None, valid_celltype_labels=None, sort_seq_batch=False) Tuple[Dict[str, Tensor]] [源代码]
- scgpt.trainer.prepare_dataloader(data_pt: Dict[str, Tensor], batch_size: int, shuffle: bool = False, intra_domain_shuffle: bool = False, drop_last: bool = False, num_workers: int = 0, per_seq_batch_sample: bool = False) DataLoader [源代码]