Analysis
_analyser module
This file is a base class for the analysis module.
- class scalr.analysis._analyser.AnalysisBase[source]
Bases:
object
A base class for downstream analysis tasks.
This class provides common attributes and methods for all the analysis tasks. It is intended to be subclassed to create task-specific analysis.
- generate_analysis(model: Module, test_data: AnnData | AnnCollection, test_dl: DataLoader, dirpath: str, **kwargs)[source]
A function to generate analysis, should be overridden by all subclasses.
- Parameters:
model (nn.Module) – final trained model.
test_data (Union[AnnData, AnnCollection]) – test data to run analysis on.
test_dl (DataLoader) – DataLoader object to prepare inputs for the model.
dirpath (str) – dirpath to store analysis.
**kwargs – contains all previous analysis done to be used later.
- scalr.analysis._analyser.build_analyser(analysis_config: dict) tuple[AnalysisBase, dict] [source]
Builder object to get analyser, updated analyser_config.
dge_lmem
This file generates differential gene expression analysis using Linear mixed-effects model approach and stores the results.
- class scalr.analysis.dge_lmem.DgeLMEM(fixed_effect_column: str, fixed_effect_factors: list[str], group: str, celltype_column: str = None, cell_subsets: list[str] = None, min_cell_threshold: int = 10, n_cpu: int = 6, gene_batch_size: int = 1000, coef_threshold: [<class 'float'>, <class 'int'>] = 0, p_val: float | int = 0.05, y_lim_tuple: ~typing.Tuple[float, ...] | None = None, save_plot: bool = True, stdout: bool = False)[source]
Bases:
AnalysisBase
Class to perform differential gene expression analysis using Linear mixed effects model.
- generate_analysis(test_data: AnnData | AnnCollection, dirpath: str, **kwargs)[source]
This method calls methods to run multiple linear mixed effects models and generate DGE results and plots.
- Parameters:
test_data – Anndata.
dirpath – Path to save results.
- classmethod get_default_params() dict [source]
Class method to get default params for DgeLMEM_config.
- get_genes_n_fixed_val_subset_df(batch_adata: AnnData, cell_type: str = None)[source]
This method converts Anndata into a pandas DataFrame with gene expression data, ‘fixed_effect_column’, and ‘group’ params.
- Parameters:
batch_adata – Anndata.
cell_type – Cell type in the ‘celltype_column’ to subset the anndata, the whole anndata will be processed if ‘cell_type’ is None.
- Returns:
A list of gene names in the anndata, and a pandas dataframe with count matrix.
- get_multiproc_mxeffect_model_batch_res(gene_names: list[str], ad_subset_to_df: DataFrame)[source]
This method parallelizes the Linear mixed effects models for a list of genes.
- Parameters:
gene_names – List of gene names.
ad_subset_to_df – A pandas dataframe with gene expression, ‘fixed_effect_column’,
params. (and 'group')
- Returns:
A list of dictionaries with model stats for ‘gene_names’.
- get_result_mxmodel_per_gene(gene: str, ad_subset_to_df: DataFrame)[source]
This method produces the Linear mixed-effects model statistics for a single gene.
- Parameters:
gene – Gene name.
ad_subset_to_df – A pandas dataframe with gene expression, ‘fixed_effect_column’, and ‘group’ params.
- Returns:
A dictionary with model statistics.
- plot_lmem_dge_result(lmem_res_df: DataFrame, dirpath: str, cell_type: str = None)[source]
This method produces a volcano plot for the model results for a data subset with a cell type, or for the whole dataset.
- Parameters:
lmem_res_df – A pandas DataFrame with Model results (p-value, co-efficients, Standard error..).
dirpath – Path to save the plot.
cell_type – Cell type used to subset input anndata.
dge_pseudobulk
This file generates differential gene expression analysis using Pseudobulk approach and stores the results.
- class scalr.analysis.dge_pseudobulk.DgePseudoBulk(celltype_column: str, design_factor: str, factor_categories: list[str], sum_column: str, cell_subsets: list[str] = None, min_cell_threshold: int = 1, fold_change: float | int = 1.5, p_val: float | int = 0.05, y_lim_tuple: Tuple[float, ...] | None = None, save_plot: bool = True, stdout: bool = False)[source]
Bases:
AnalysisBase
Class to perform differential gene expression analysis using Pseudobulk approach.
- _make_design_matrix(adata: AnnData, cell_type: str)[source]
Method to subset an anndata as per a cell type and make design matrix based upon the factor levels in design_factor.
- Parameters:
adata – AnnData.
cell_type – Cell type to subset data on, belonging to celltype_column.
- Returns:
AnnData oject of design matrix.
- generate_analysis(test_data: AnnData | AnnCollection, dirpath: str, **kwargs)[source]
This method calls methods to perform differential gene expression analysis on data.
- Parameters:
test_data – AnnData.
dirpath – Path to save the result.
- Returns:
Pandas DataFrame object containing differential gene expression stats.
- classmethod get_default_params() dict [source]
Class method to get default params for DgePseudoBulk_config.
- get_differential_expression_results(design_matrix: AnnData, cell_type: str, dirpath: str)[source]
Method to get differential gene expression analysis results.
- Parameters:
design_matrix – AnnData generated using ‘_make_design_matrix’.
cell_type – Cell type used to subset the input anndata.
dirpath – Path to save the result.
- Returns:
A pandas DataFrame object containing differential gene expression results.
- plot_volcano(dge_results_df: DataFrame, cell_type: str, dirpath: str)[source]
Method to generate volcano plot of differential gene expression results and store it on disk.
- Parameters:
dge_results_df – Differential gene expression results in dataframe.
cell_type – Cell type used to subset the input anndata.
dirpath – Path to save the result.
evaluation
This file generates accuracy, classification report and stores it.
- scalr.analysis.evaluation.generate_and_save_classification_report(test_labels: list[int], pred_labels: list[int], dirpath: str, mapping: dict = None) DataFrame [source]
A function to generate a classificaton report from the actual and predicted data and store at dirpath.
- Parameters:
test_labels – True labels from the test set.
pred_labels – Predicted labels from the trained model.
dirpath – Path to store classification_report.
mapping[optional] – Mapping of label_id to true label_names (id2label).
- Returns:
A Pandas DataFrame with the classification report.
- scalr.analysis.evaluation.get_accuracy(test_labels: list[int], pred_labels: list[int]) float [source]
A function to get accuracy for the predicted labels.
- Parameters:
test_labels (list[int]) – True labels from the test set.
pred_labels (list[int]) – Predicted labels from the trained model.
- Returns:
accuracy score
- Return type:
float
gene_recall_curve
This file generates gene recall curves for reference genes in provided models ranked genes
- class scalr.analysis.gene_recall_curve.GeneRecallCurve(reference_genes_path: str, ranked_genes_path_dict: dict = None, top_K: int = 150, plots_per_row: int = 3, save_plots: bool = True, features_selector: dict = None)[source]
Bases:
AnalysisBase
Class to generate gene recall curve.
- generate_analysis(score_matrix: DataFrame = None, dirpath: str = '.', **kwargs)[source]
This function calls function to generate gene recall after setting a few parameters.
- Parameters:
score_matrix – Matrix that contains a score of each gene for each category.
dirpath – Path to store gene recall curve if applicable.
- classmethod get_default_params() dict [source]
Class method to get default params for preprocess_config.
- plot_gene_recall(ranked_genes_df_dict: dict, reference_genes_df: DataFrame, dirpath: str = '.', title: str = '')[source]
This function plots & stores the gene recall curve for reference genes in provided ranked genes.
It also stores the reference genes along with their ranks for each model in a json file for further analysis to the user.
- Parameters:
ranked_genes_df_dict – Pipeline generated ranked genes dataframe.
reference_genes_df – Reference genes dataframe.
top_K – The top K-ranked genes in which reference genes are to be looked for.
dirpath – Path to store gene recall plot and json.
plot_type – Type of gene recall - per category or aggregated across all categories.
heatmap
This file generates heatmaps for top genes of particular class w.r.t same top genes in other classes.
- class scalr.analysis.heatmap.Heatmap(top_n_genes: int = 100, save_plot: bool = True, score_matrix_path: str = None, top_features_path: str = None, *args, **kwargs)[source]
Bases:
AnalysisBase
Class to generate a heatmap of top genes classwise.
- generate_analysis(dirpath: str, score_matrix: DataFrame = None, top_features: dict | list = None, **kwargs) None [source]
A function to generate heatmap for top features.
- Parameters:
score_matrix – Matrix(class * genes) that contains a score of each gene per class.
top_features – Class-wise top genes or list of top features.
dirpath – Path to store the heatmap image.
- plot_heatmap(class_genes_weights: DataFrame, dirpath: str, filename: str) None [source]
A function to plot a heatmap for top n genes across all classes.
- Parameters:
class_genes_weights – Matrix(genes * classes) which contains shap_value/weights of each gene to class.
dirpath – Path to store the heatmap image.
filename – Heatmap image name.
roc_auc
This file generates ROC-AUC plot and stores it.
- class scalr.analysis.roc_auc.RocAucCurve[source]
Bases:
AnalysisBase
Class to generate ROC-AUC curve.
- generate_analysis(test_labels: list[int], pred_probabilities: list[list[float]], dirpath: str, mapping: list, **kwargs) None [source]
A function to calculate ROC-AUC and save the plot.
- Parameters:
test_labels – True labels from the test dataset.
pred_probabilities – Predictions probabilities of each sample for all the classes.
dirpath – Path to store gene recall curve if applicable.
mapping – List of class names.