Split

_split module

This file is a base class for splitter.

class scalr.data.split._split.SplitterBase[source]

Bases: object

Base class for splitter, to make Train|Val|Test Splits.

check_splits(datapath: str, data_splits: dict, target: str)[source]

This function performs certain checks regarding splits and logs the distribution of target classes in each split.

Parameters:
  • datapath (str) – Path to full data.

  • data_splits (dict) – Split of ‘train’, ‘val’ and ‘test’ indices.

  • target (str) – Classification target column name in obs.

generate_train_val_test_split_indices(target: str, **kwargs) dict[source]

Generate a list of indices for train/val/test split of whole dataset.

Parameters:
  • datapath (str) – Path to full data.

  • target (str) – Target for classification present in obs.

  • **kwargs – Any other params needed for splitting.

Returns:

‘train’, ‘val’ and ‘test’ indices list.

Return type:

dict

classmethod get_default_params() dict[source]

Class method to get default params for model_config.

write_splits(full_data: AnnData | AnnCollection, data_split_indices: dict, sample_chunksize: int, dirpath: int)[source]

THis function writes the train validation and test splits to the disk.

Parameters:
  • full_data (Union[AnnData, AnnCollection]) – Full data to be split.

  • data_split_indices (dict) – Indices of each split.

  • sample_chunksize (int) – Number of samples to be written in one file.

  • dirpath (int) – Path to write data into.

Returns:

Path of each split.

Return type:

dict

scalr.data.split._split.build_splitter(splitter_config: dict) tuple[SplitterBase, dict][source]

Builder object to get splitter, updated splitter_config.

group_splitter

This file is an implementation of group splitter.

class scalr.data.split.group_splitter.GroupSplitter(split_ratio: list[float], stratify: str)[source]

Bases: StratifiedSplitter

Class for splitting data based on the provided group.

Generate a stratified split of data into train, validation, and test sets. Stratification ensures samples have the same value for stratify column, can not belong to different sets.

_split_data_with_stratification(metadata: DataFrame, target: str, test_ratio: float) tuple[list[int], list[int]][source]

A function to split given metadata into a training and testing set.

Parameters:
  • metadata (DataFrame) – Dataframe containing all samples to be split.

  • target (str) – Target for classification present in obs.

  • test_ratio (float) – Ratio of samples belonging to the test split.

Returns:

Two lists consisting of train and test indices.

Return type:

(list(int), list(int))

classmethod get_default_params() dict[source]

Class method to get default params for model_config.

stratified_group_splitter

This file is an implementation of stratified group splitter.

class scalr.data.split.stratified_group_splitter.StratifiedGroupSplitter(split_ratio: list[float], stratify: str)[source]

Bases: SplitterBase

Class for stratified group splitter.

Generates split of data into train, validation, and test sets. Stratification ensures samples have the same value for stratify column, can not belong to different sets. Also, it ensures every split contains samples from each class available in the data.

_split_data_with_stratification(metadata: DataFrame, target: str, test_ratio: float) tuple[list[int], list[int]][source]

A function to split given metadata into a training and testing set.

Parameters:
  • metadata (DataFrame) – Dataframe containing all samples to be split.

  • target (str) – Target for classification present in obs.

  • test_ratio (float) – Ratio of samples belonging to the test split.

Returns:

Two lists consisting of train and test indices.

Return type:

(list(int), list(int))

generate_train_val_test_split_indices(datapath: str, target: str) dict[source]

A function to generate a list of indices for train/val/test split of the whole dataset.

Parameters:
  • datapath (str) – Path to full data.

  • target (str) – Target for classification present in obs.

Returns:

‘train’, ‘val’ and ‘test’ indices list.

Return type:

dict

classmethod get_default_params() dict[source]

Class method to get default params for model_config.

stratified_splitter

This file is an implementation of the stratified splitter.

class scalr.data.split.stratified_splitter.StratifiedSplitter(split_ratio: list[float])[source]

Bases: SplitterBase

Generate Stratified split of data into train, validation, and test sets.

Stratification ensures the percentage of samples for each class. It ensures every split contains samples from each class available in the data.

_split_data_with_stratification(metadata: DataFrame, target: str, test_ratio: float) tuple[list[int], list[int]][source]

A function to split the given metadata into a training and testing set.

Parameters:
  • metadata (DataFrame) – Dataframe containing all samples to be split.

  • target (str) – Target for classification present in obs.

  • test_ratio (float) – Ratio of samples belonging to the test split.

Returns:

Two lists consisting of train and test indices.

Return type:

(list(int), list(int))

generate_train_val_test_split_indices(datapath: str, target: str) dict[source]

A function to generate a list of indices for train/val/test split of the whole dataset.

Parameters:
  • datapath (str) – Path to full data.

  • target (str) – Target for classification present in obs.

Returns:

‘train’, ‘val’ and ‘test’ indices list.

Return type:

dict

classmethod get_default_params() dict[source]

Class method to get default params for model_config.