scgpt.tokenizer package

Submodules

scgpt.tokenizer.gene_tokenizer

class scgpt.tokenizer.gene_tokenizer.GeneTokenizer(**kwargs)[源代码]

基类:PreTrainedTokenizer

class scgpt.tokenizer.gene_tokenizer.GeneVocab(gene_list_or_vocab: List[str] | Vocab, specials: List[str] | None = None, special_first: bool = True, default_token: str | None = '<pad>')[源代码]

基类:Vocab

Vocabulary for genes.

Initialize the vocabulary. Note: add specials only works when init from a gene list.

参数:
  • gene_list_or_vocab (List[str] or Vocab) – List of gene names or a Vocab object.

  • specials (List[str]) – List of special tokens.

  • special_first (bool) – Whether to add special tokens to the beginning of the vocabulary.

  • default_token (str) – Default token, by default will set to “<pad>”, if “<pad>” is in the vocabulary.

classmethod from_dict(token2idx: Dict[str, int], default_token: str | None = '<pad>') Self[源代码]

Load the vocabulary from a dictionary.

参数:

token2idx (Dict[str, int]) – Dictionary mapping tokens to indices.

classmethod from_file(file_path: Path | str) Self[源代码]

Load the vocabulary from a file. The file should be either a pickle or a json file of token to index mapping.

property pad_token: str | None

Get the pad token.

save_json(file_path: Path | str) None[源代码]

Save the vocabulary to a json file.

set_default_token(default_token: str) None[源代码]

Set the default token.

参数:

default_token (str) – Default token.

scgpt.tokenizer.gene_tokenizer.get_default_gene_vocab() GeneVocab[源代码]

Get the default gene vocabulary, consisting of gene symbols and ids.

scgpt.tokenizer.gene_tokenizer.pad_batch(batch: List[Tuple], max_len: int, vocab: Vocab, pad_token: str = '<pad>', pad_value: int = 0, cls_appended: bool = True) Dict[str, Tensor][源代码]

Pad a batch of data. Returns a list of Dict[gene_id, count].

参数:
  • batch (list) – A list of tuple (gene_id, count).

  • max_len (int) – The maximum length of the batch.

  • vocab (Vocab) – The vocabulary containing the pad token.

  • pad_token (str) – The token to pad with.

返回:

A dictionary of gene_id and count.

返回类型:

Dict[str, torch.Tensor]

scgpt.tokenizer.gene_tokenizer.random_mask_value(values: Tensor | ndarray, mask_ratio: float = 0.15, mask_value: int = -1, pad_value: int = 0) Tensor[源代码]

Randomly mask a batch of data.

参数:
  • values (array-like) – A batch of tokenized data, with shape (batch_size, n_features).

  • mask_ratio (float) – The ratio of genes to mask, default to 0.15.

  • mask_value (int) – The value to mask with, default to -1.

  • pad_value (int) – The value of padding in the values, will be kept unchanged.

返回:

A tensor of masked data.

返回类型:

torch.Tensor

scgpt.tokenizer.gene_tokenizer.tokenize_and_pad_batch(data: ndarray, gene_ids: ndarray, max_len: int, vocab: Vocab, pad_token: str, pad_value: int, append_cls: bool = True, include_zero_gene: bool = False, cls_token: str = '<cls>', return_pt: bool = True) Dict[str, Tensor][源代码]

Tokenize and pad a batch of data. Returns a list of tuple (gene_id, count).

scgpt.tokenizer.gene_tokenizer.tokenize_batch(data: ndarray, gene_ids: ndarray, return_pt: bool = True, append_cls: bool = True, include_zero_gene: bool = False, cls_id: int = '<cls>') List[Tuple[Tensor | ndarray]][源代码]

Tokenize a batch of data. Returns a list of tuple (gene_id, count).

参数:
  • data (array-like) – A batch of data, with shape (batch_size, n_features). n_features equals the number of all genes.

  • gene_ids (array-like) – A batch of gene ids, with shape (n_features,).

  • return_pt (bool) – Whether to return torch tensors of gene_ids and counts, default to True.

返回:

A list of tuple (gene_id, count) of non zero gene expressions.

返回类型:

list

Module contents