Analysis

_analyser module

This file is a base class for the analysis module.

class scalr.analysis._analyser.AnalysisBase[source]

Bases: object

A base class for downstream analysis tasks.

This class provides common attributes and methods for all the analysis tasks. It is intended to be subclassed to create task-specific analysis.

generate_analysis(model: Module, test_data: AnnData | AnnCollection, test_dl: DataLoader, dirpath: str, **kwargs)[source]

A function to generate analysis, should be overridden by all subclasses.

Parameters:
  • model (nn.Module) – final trained model.

  • test_data (Union[AnnData, AnnCollection]) – test data to run analysis on.

  • test_dl (DataLoader) – DataLoader object to prepare inputs for the model.

  • dirpath (str) – dirpath to store analysis.

  • **kwargs – contains all previous analysis done to be used later.

classmethod get_default_params() dict[source]

Class method to get default params for analysis_config.

scalr.analysis._analyser.build_analyser(analysis_config: dict) tuple[AnalysisBase, dict][source]

Builder object to get analyser, updated analyser_config.

dge_lmem

This file generates differential gene expression analysis using Linear mixed-effects model approach and stores the results.

class scalr.analysis.dge_lmem.DgeLMEM(fixed_effect_column: str, fixed_effect_factors: list[str], group: str, celltype_column: str = None, cell_subsets: list[str] = None, min_cell_threshold: int = 10, n_cpu: int = 6, gene_batch_size: int = 1000, coef_threshold: [<class 'float'>, <class 'int'>] = 0, p_val: float | int = 0.05, y_lim_tuple: ~typing.Tuple[float, ...] | None = None, save_plot: bool = True, stdout: bool = False)[source]

Bases: AnalysisBase

Class to perform differential gene expression analysis using Linear mixed effects model.

generate_analysis(test_data: AnnData | AnnCollection, dirpath: str, **kwargs)[source]

This method calls methods to run multiple linear mixed effects models and generate DGE results and plots.

Parameters:
  • test_data – Anndata.

  • dirpath – Path to save results.

classmethod get_default_params() dict[source]

Class method to get default params for DgeLMEM_config.

get_genes_n_fixed_val_subset_df(batch_adata: AnnData, cell_type: str = None)[source]

This method converts Anndata into a pandas DataFrame with gene expression data, ‘fixed_effect_column’, and ‘group’ params.

Parameters:
  • batch_adata – Anndata.

  • cell_type – Cell type in the ‘celltype_column’ to subset the anndata, the whole anndata will be processed if ‘cell_type’ is None.

Returns:

A list of gene names in the anndata, and a pandas dataframe with count matrix.

get_multiproc_mxeffect_model_batch_res(gene_names: list[str], ad_subset_to_df: DataFrame)[source]

This method parallelizes the Linear mixed effects models for a list of genes.

Parameters:
  • gene_names – List of gene names.

  • ad_subset_to_df – A pandas dataframe with gene expression, ‘fixed_effect_column’,

  • params. (and 'group')

Returns:

A list of dictionaries with model stats for ‘gene_names’.

get_result_mxmodel_per_gene(gene: str, ad_subset_to_df: DataFrame)[source]

This method produces the Linear mixed-effects model statistics for a single gene.

Parameters:
  • gene – Gene name.

  • ad_subset_to_df – A pandas dataframe with gene expression, ‘fixed_effect_column’, and ‘group’ params.

Returns:

A dictionary with model statistics.

plot_lmem_dge_result(lmem_res_df: DataFrame, dirpath: str, cell_type: str = None)[source]

This method produces a volcano plot for the model results for a data subset with a cell type, or for the whole dataset.

Parameters:
  • lmem_res_df – A pandas DataFrame with Model results (p-value, co-efficients, Standard error..).

  • dirpath – Path to save the plot.

  • cell_type – Cell type used to subset input anndata.

replace_spec_char_get_dict(var_names: Index)[source]

This method replaces any special character in gene names.

Parameters:

var_names – var_names in the Anndata.

Returns:

var_names with special characters replaced with ‘_’, and a dictionary mapping of old and new names.

dge_pseudobulk

This file generates differential gene expression analysis using Pseudobulk approach and stores the results.

class scalr.analysis.dge_pseudobulk.DgePseudoBulk(celltype_column: str, design_factor: str, factor_categories: list[str], sum_column: str, cell_subsets: list[str] = None, min_cell_threshold: int = 1, fold_change: float | int = 1.5, p_val: float | int = 0.05, y_lim_tuple: Tuple[float, ...] | None = None, save_plot: bool = True, stdout: bool = False)[source]

Bases: AnalysisBase

Class to perform differential gene expression analysis using Pseudobulk approach.

_make_design_matrix(adata: AnnData, cell_type: str)[source]

Method to subset an anndata as per a cell type and make design matrix based upon the factor levels in design_factor.

Parameters:
  • adata – AnnData.

  • cell_type – Cell type to subset data on, belonging to celltype_column.

Returns:

AnnData oject of design matrix.

generate_analysis(test_data: AnnData | AnnCollection, dirpath: str, **kwargs)[source]

This method calls methods to perform differential gene expression analysis on data.

Parameters:
  • test_data – AnnData.

  • dirpath – Path to save the result.

Returns:

Pandas DataFrame object containing differential gene expression stats.

classmethod get_default_params() dict[source]

Class method to get default params for DgePseudoBulk_config.

get_differential_expression_results(design_matrix: AnnData, cell_type: str, dirpath: str)[source]

Method to get differential gene expression analysis results.

Parameters:
  • design_matrix – AnnData generated using ‘_make_design_matrix’.

  • cell_type – Cell type used to subset the input anndata.

  • dirpath – Path to save the result.

Returns:

A pandas DataFrame object containing differential gene expression results.

plot_volcano(dge_results_df: DataFrame, cell_type: str, dirpath: str)[source]

Method to generate volcano plot of differential gene expression results and store it on disk.

Parameters:
  • dge_results_df – Differential gene expression results in dataframe.

  • cell_type – Cell type used to subset the input anndata.

  • dirpath – Path to save the result.

evaluation

This file generates accuracy, classification report and stores it.

scalr.analysis.evaluation.generate_and_save_classification_report(test_labels: list[int], pred_labels: list[int], dirpath: str, mapping: dict = None) DataFrame[source]

A function to generate a classificaton report from the actual and predicted data and store at dirpath.

Parameters:
  • test_labels – True labels from the test set.

  • pred_labels – Predicted labels from the trained model.

  • dirpath – Path to store classification_report.

  • mapping[optional] – Mapping of label_id to true label_names (id2label).

Returns:

A Pandas DataFrame with the classification report.

scalr.analysis.evaluation.get_accuracy(test_labels: list[int], pred_labels: list[int]) float[source]

A function to get accuracy for the predicted labels.

Parameters:
  • test_labels (list[int]) – True labels from the test set.

  • pred_labels (list[int]) – Predicted labels from the trained model.

Returns:

accuracy score

Return type:

float

gene_recall_curve

This file generates gene recall curves for reference genes in provided models ranked genes

class scalr.analysis.gene_recall_curve.GeneRecallCurve(reference_genes_path: str, ranked_genes_path_dict: dict = None, top_K: int = 150, plots_per_row: int = 3, save_plots: bool = True, features_selector: dict = None)[source]

Bases: AnalysisBase

Class to generate gene recall curve.

generate_analysis(score_matrix: DataFrame = None, dirpath: str = '.', **kwargs)[source]

This function calls function to generate gene recall after setting a few parameters.

Parameters:
  • score_matrix – Matrix that contains a score of each gene for each category.

  • dirpath – Path to store gene recall curve if applicable.

classmethod get_default_params() dict[source]

Class method to get default params for preprocess_config.

plot_gene_recall(ranked_genes_df_dict: dict, reference_genes_df: DataFrame, dirpath: str = '.', title: str = '')[source]

This function plots & stores the gene recall curve for reference genes in provided ranked genes.

It also stores the reference genes along with their ranks for each model in a json file for further analysis to the user.

Parameters:
  • ranked_genes_df_dict – Pipeline generated ranked genes dataframe.

  • reference_genes_df – Reference genes dataframe.

  • top_K – The top K-ranked genes in which reference genes are to be looked for.

  • dirpath – Path to store gene recall plot and json.

  • plot_type – Type of gene recall - per category or aggregated across all categories.

heatmap

This file generates heatmaps for top genes of particular class w.r.t same top genes in other classes.

class scalr.analysis.heatmap.Heatmap(top_n_genes: int = 100, save_plot: bool = True, score_matrix_path: str = None, top_features_path: str = None, *args, **kwargs)[source]

Bases: AnalysisBase

Class to generate a heatmap of top genes classwise.

generate_analysis(dirpath: str, score_matrix: DataFrame = None, top_features: dict | list = None, **kwargs) None[source]

A function to generate heatmap for top features.

Parameters:
  • score_matrix – Matrix(class * genes) that contains a score of each gene per class.

  • top_features – Class-wise top genes or list of top features.

  • dirpath – Path to store the heatmap image.

plot_heatmap(class_genes_weights: DataFrame, dirpath: str, filename: str) None[source]

A function to plot a heatmap for top n genes across all classes.

Parameters:
  • class_genes_weights – Matrix(genes * classes) which contains shap_value/weights of each gene to class.

  • dirpath – Path to store the heatmap image.

  • filename – Heatmap image name.

roc_auc

This file generates ROC-AUC plot and stores it.

class scalr.analysis.roc_auc.RocAucCurve[source]

Bases: AnalysisBase

Class to generate ROC-AUC curve.

generate_analysis(test_labels: list[int], pred_probabilities: list[list[float]], dirpath: str, mapping: list, **kwargs) None[source]

A function to calculate ROC-AUC and save the plot.

Parameters:
  • test_labels – True labels from the test dataset.

  • pred_probabilities – Predictions probabilities of each sample for all the classes.

  • dirpath – Path to store gene recall curve if applicable.

  • mapping – List of class names.