Data ingestion

This file is a class for data ingestion into the pipeline.

class scalr.data_ingestion_pipeline.DataIngestionPipeline(data_config: dict, dirpath: str = '.')[source]

Bases: object

Class for Data Ingestion into the pipeline

generate_mappings()[source]

A function to generate an Integer mapping to and from target columns.

generate_train_val_test_split()[source]

A function to split data into train, validation and test sets.

get_updated_config()[source]

This function returns updated configs.

preprocess_data()[source]

A function to apply preprocessing on data splits.

Eval and analysis

This file contains an implementation of model evaluation and performs downstream analysis tasks.

class scalr.eval_and_analysis_pipeline.EvalAndAnalysisPipeline(analysis_config, dirpath, device)[source]

Bases: object

Class for evaluation and analysis of the trained model.

_perform_downstream_analysis(samples: str)[source]

A function to perform all downstream analysis tasks on model and data.

Parameters:

'full'] (samples ['test' |) – indicates the samples to perform downstream analysis.

build_dataloaders()[source]

A function to build dataloader for train, validation, and test data.

evaluation_and_classification_report()[source]

A function to evaluate the trained model and generate classification report on test data.

full_samples_downstream_anlaysis()[source]

A function to perform downstream analysis tasks on all samples data.

Note: The Model & DataLoader will not be passsed since it is assumed that a model is trained on the train data, so analysis by model should not be on full samples data.

gene_analysis()[source]

A function to perform analysis on trained model to get top genes and biomarkers.

get_updated_config() dict[source]

A function to return updated configs.

load_data_and_targets_from_config(data_config: dict)[source]

A function to load data and targets from data config.

Parameters:

data_config – Data config.

set_data_and_targets(train_data: AnnData | AnnCollection, val_data: AnnData | AnnCollection, test_data: AnnData | AnnCollection, target: str | list[str], mappings: dict)[source]

A function to set data when you don’t use data directly from config, but rather by other sources like feature subsetting, etc.

Parameters:
  • train_data (Union[AnnData, AnnCollection]) – Training data.

  • val_data (Union[AnnData, AnnCollection]) – Validation data.

  • target (Union[str, list[str]]) – Target columns name(s).

  • mappings (dict) – Mapping of a column value to ids eg. mappings[column_name][label2id] = {A: 1, B:2, …}.

test_samples_downstream_anlaysis()[source]

A function to perform downstream analysis tasks on model and test samples data.

Feature extraction

This file contains the implementation of feature subsetting, model training followed by top feature extraction.

class scalr.feature_extraction_pipeline.FeatureExtractionPipeline(feature_selection_config, dirpath, device)[source]

Bases: object

feature_scoring() DataFrame[source]

A function to generate scores of each feature for each class using a scorer and chunked models.

feature_subsetted_model_training() list[Module][source]

This function train models on subsetted data containing feature_subsetsize genes.

get_updated_config() dict[source]

This function returns updated configs.

load_data_and_targets_from_config(data_config: dict)[source]

A function to load data and targets from data config.

Parameters:

data_config – Data config.

set_data_and_targets(train_data: AnnData | AnnCollection, val_data: AnnData | AnnCollection, target: str | list[str], mappings: dict, sample_chunksize: int = None)[source]

A function to set data when you don’t use data directly from config, but rather by other sources like feature subsetting, etc.

Parameters:
  • train_data (Union[AnnData, AnnCollection]) – Training data.

  • val_data (Union[AnnData, AnnCollection]) – Validation data.

  • target (Union[str, list[str]]) – Target columns name(s).

  • mappings (dict) – Mapping of a column value to ids eg. mappings[column_name][label2id] = {A: 1, B:2, …}.

  • sample_chunksize (int) – Chunks of samples to be loaded in memory at once.

set_model(models: list[Module])[source]

A function to set the trained model for downstream feature tasks.

set_score_matrix(score_matrix: DataFrame)[source]

A function to set score_matrix for feature extraction.

top_feature_extraction() list[str] | dict[source]

A function to get top features using Selector.

write_top_features_subset_data(data_config: dict) dict[source]

A function to write top features subset data onto disk and return updated data_config.

Parameters:

data_config – Data config.

Model training

This file contains an implementation for the model training pipeline.

class scalr.model_training_pipeline.ModelTrainingPipeline(model_config: dict, train_config: dict, dirpath: str = None, device: str = 'cpu')[source]

Bases: object

Class for Model training pipeline.

build_model_training_artifacts()[source]

This function configures the model, optimizer, and loss function required for model training.

build_optimizer(opt_config: dict = None)[source]

A function to build optimizer.

Parameters:

opt_config (dict) – Optimizer config.

get_updated_config()[source]

This function returns updated configs.

load_data_and_targets_from_config(data_config: dict)[source]

A function to load data and targets from data config.

Parameters:

data_config – Data config.

set_data_and_targets(train_data: AnnData | AnnCollection, val_data: AnnData | AnnCollection, target: str | list[str], mappings: dict)[source]

A function to set data when you don’t use data directly from config, but rather by other sources like feature subsetting, etc.

Parameters:
  • train_data (Union[AnnData, AnnCollection]) – Training data.

  • val_data (Union[AnnData, AnnCollection]) – Validation data.

  • target (Union[str, list[str]]) – Target columns name(s).

  • mappings (dict) – Mapping of a column value to ids eg. mappings[column_name][label2id] = {A: 1, B:2, …}.

train()[source]

This function trains the model.