Analyst: Jennifer Fouquier


EXPLANA Description and Analysis Summary


EXPLANA is a feature selection tool that finds numerical or categorical variables that relate to outcomes of interest. EXPLANA ranks statistically significant variables by importance and estimates their relationship to the outcome to streamline hypothesis generation.

There may be sections or figures that are complex and would be better understood through training or additional reading. See figure legends in report, documentation at explana.io [documentation will be expanded as time allows], or preprint, for more information. Browsing through report sections will give you an idea of what a report provides. Essentially, EXPLANA documents complex analytics to improve research efficiency.

EXPLANA can be used for data that is cross-sectional (using original values only), or longitudinal (using original values and differences/deltas in variables over time, for each subject/study identifier, that has repeated measurements.

For longitudinal datasets, feature changes are calculated using different reference points to obtain delta datasets (First, Previous and Pairwise delta datasets are indicated below in different report sections). This is important because features in longitudinal studies can carry varying degrees of importance between models built using different reference points. For longitudinal studies, several feature selection analyses are performed and included in one report.

When using data without prior hypotheses, you are performing exploratory analysis and should make this clear when communicating results.


Outcome Variable: diagnosis

Analysis Notes: Classification of breast cancer tumors as benign or malignant in women. Data obtained from https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset


Methods

This section explains the methods, software, and statistical decisions. Feel free to use or expand the following text:

EXPLANA was used for exploratory analysis to identify important features related to the outcome variable, diagnosis. A random effect of subjectID was used to adjust for non-independence (repeated measurements) if needed. There were 1000 trees used per Random Forest model with a max feature fraction of 0.7 of the input features for each split per decision tree in the forest. If mixed effects Random Forests were needed, 1 iterations were performed. BorutaSHAP was used to find features that perform repeatedly better than shuffled versions of all input features. Features were considered important if they performed better than 100% of the SHAP importance score of the best shuffled feature using 100 trials, p=0.05. Categorical variables were binary encoded and low occuring categorical values were not removed. These methods are from an EXPLANA feature selection report (version: 2024.08.21) created on 2025-01-03. Additional information can be found at https://github.com/JTFouquier/explana/ or at explana.io.


Workflow Plan (Configuration File)

This is a copy of the script used to perform this analysis.

analyst: Jennifer Fouquier
response_var: diagnosis
include_time: 'yes'
random_effect: subjectID
sample_id: sampleID
timepoint: timepoint
out: workflow-results/EXPLANA-breast-cancer-classification/
iterations: '1'
n_estimators: '1000'
max_features: '0.7'
borutashap_trials: '100'
borutashap_threshold: '100'
borutashap_p: '0.05'
analyze_original: 'yes'
analyze_first: 'no'
analyze_previous: 'no'
analyze_pairwise: 'no'
absolute_values: 'no'
include_reference_values: 'no'
analysis_notes: Classification of breast cancer tumors as benign or malignant in women.
  Data obtained from https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset
enc_percent_threshold: '0'
distance_matrices: list()
df_mod: ''
delta_df_mod: |
  delta_df <- delta_df
input_datasets:
  metadata:
    file_path: data/breast-cancer/breast-cancer.csv
    df_mod: ''
    dim_method: ''
    dim_param_dict:
      method: none

Results Summary

Selected Features by Dataset

Data Original First Previous Pairwise
% Variance Explained 86.7% (86.8%) NA NA NA
N Trees 1000 NA NA NA
Feature fraction/split 0.7 NA NA NA
Max Depth 7 NA NA NA
MERF Iters. NA NA NA NA
BorutaSHAP Trials 100 NA NA NA
BorutaSHAP Threshold 100 NA NA NA
P-value 0.05 NA NA NA
N Study IDs 569 NA NA NA
N Samples 569 NA NA NA
Input Features 31 NA NA NA
Accepted Features 15 NA NA NA
Tentative Features 2 NA NA NA
Rejected Features 13 NA NA NA
Model Type (Pass/Fail) RF Pass Not performed Not performed Not performed

Selected feature ranks from models built using Original and, for longitudinal analyses, First, Previous and Pairwise delta datasets. Selected features are shown in black and labeled with feature rank. For true/positive instances of categorical variables (indicated with “ENC” after encoding), average impact on response/outcome is shown after the rank. For numerical features, impact is not shown because the feature relationship to response can be complex, requiring further post-hoc tests or inspection of SHAP dependence plots for additional insight. Empty grey boxes indicate features included in the model for a dataset, but not selected. Long feature names may be truncated and indicated with ellipses. A comprehensive list of input features can be found in model details below.


Original Dataset


⇨ Open File Directory

Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Original
Model Type (Pass/Fail) RF Pass
% Variance Explained 86.7% (86.8%)
N Trees 1000
Feature fraction/split 0.7
Max Depth 7
MERF Iters. NA
BorutaSHAP Trials 100
BorutaSHAP Threshold 100
P-value 0.05
N Study IDs 569
N Samples 569
Input Features 31
Accepted Features 15
Tentative Features 2
Rejected Features 13

SHAP summary beeswarm plots of feature influence on the machine learning prediction of outcome values. Each point represents one sample, and the horizontal position indicates impact on the outcome as indicated on the x-axis. Points to the left indicate a negative impact, and points to the right indicate a positive impact. The colors represent the selected feature values, where purple is larger and green is smaller. For binary encoded features (‘ENC’) purple is yes[1] and green is no[0]. SHAP is generally an improvement upon other importance scores because it provides information about both rank (how helpful the feature was compared to other features [y-axis]) and impact (a positive or negative impact on outcome values [x-axis]). If multiple figures are shown, scales may vary with a maximum of ten features per plot.

Selected Features

important_features decoded_features feature_importance_vals
concave points_worst concave points_worst 0.0956
perimeter_worst perimeter_worst 0.0898
area_worst area_worst 0.0730
radius_worst radius_worst 0.0599
concave points_mean concave points_mean 0.0538
concavity_worst concavity_worst 0.0237
texture_worst texture_worst 0.0225
area_se area_se 0.0195
texture_mean texture_mean 0.0166
concavity_mean concavity_mean 0.0109
smoothness_worst smoothness_worst 0.0079
area_mean area_mean 0.0064
radius_se radius_se 0.0062
compactness_worst compactness_worst 0.0039
perimeter_se perimeter_se 0.0036

Stats

important_features feature_importance_vals mean std min 25% 50% 75% max
concave points_worst 0.0956 0.115 0.066 0.000 0.065 0.100 0.161 0.291
perimeter_worst 0.0898 107.261 33.603 50.410 84.110 97.660 125.400 251.200
area_worst 0.0730 880.583 569.357 185.200 515.300 686.500 1084.000 4254.000
radius_worst 0.0599 16.269 4.833 7.930 13.010 14.970 18.790 36.040
concave points_mean 0.0538 0.049 0.039 0.000 0.020 0.034 0.074 0.201
concavity_worst 0.0237 0.272 0.209 0.000 0.114 0.227 0.383 1.252
texture_worst 0.0225 25.677 6.146 12.020 21.080 25.410 29.720 49.540
area_se 0.0195 40.337 45.491 6.802 17.850 24.530 45.190 542.200
texture_mean 0.0166 19.290 4.301 9.710 16.170 18.840 21.800 39.280
concavity_mean 0.0109 0.089 0.080 0.000 0.030 0.062 0.131 0.427
smoothness_worst 0.0079 0.132 0.023 0.071 0.117 0.131 0.146 0.223
area_mean 0.0064 654.889 351.914 143.500 420.300 551.100 782.700 2501.000
radius_se 0.0062 0.405 0.277 0.112 0.232 0.324 0.479 2.873
compactness_worst 0.0039 0.254 0.157 0.027 0.147 0.212 0.339 1.058
perimeter_se 0.0036 2.866 2.022 0.757 1.606 2.287 3.357 21.980

Input Features

input_features was_selected
radius_mean no
texture_mean yes
perimeter_mean no
area_mean yes
smoothness_mean no
compactness_mean no
concavity_mean yes
concave points_mean yes
symmetry_mean no
fractal_dimension_mean no
radius_se yes
texture_se no
perimeter_se yes
area_se yes
smoothness_se no
compactness_se no
concavity_se no
concave points_se no
symmetry_se no
fractal_dimension_se no
radius_worst yes
texture_worst yes
perimeter_worst yes
area_worst yes
smoothness_worst yes
compactness_worst yes
concavity_worst yes
concave points_worst yes
symmetry_worst no
fractal_dimension_worst no
timepoint_explana no

Log


Binary encoded columns created for categorical input variables:

Numeric mapping was created for categoric responsevariable, diagnosis, (sorted alphabetically and factorized):
B maps to 0
M maps to 1
diagnosis (response) mapping:
B maps to 0
M maps to 1


*** Percent variation explained is not optimal for categorical response variables. 
*** Use caution with interpretation.

BorutaSHAP Figures

⇨ Open PDF in new window


First Delta Dataset


⇨ Open File Directory

Results

Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data First
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis

Log

BorutaSHAP Figures

⇨ Open PDF in new window


Previous Delta Dataset


⇨ Open File Directory

Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Previous
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis

Log

BorutaSHAP Figures

⇨ Open PDF in new window


Pairwise Delta Dataset


⇨ Open File Directory

Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Pairwise
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis

Log

BorutaSHAP Figures

⇨ Open PDF in new window


See the Github repository for more information.