EXPLANA is a feature selection tool that finds numerical or categorical variables that relate to outcomes of interest. EXPLANA ranks statistically significant variables by importance and estimates their relationship to the outcome to streamline hypothesis generation.
There may be sections or figures that are complex and would be better understood through training or additional reading. See figure legends in report, documentation at explana.io [documentation will be expanded as time allows], or preprint, for more information. Browsing through report sections will give you an idea of what a report provides. Essentially, EXPLANA documents complex analytics to improve research efficiency.
EXPLANA can be used for data that is cross-sectional (using original values only), or longitudinal (using original values and differences/deltas in variables over time, for each subject/study identifier, that has repeated measurements.
For longitudinal datasets, feature changes are calculated using different reference points to obtain delta datasets (First, Previous and Pairwise delta datasets are indicated below in different report sections). This is important because features in longitudinal studies can carry varying degrees of importance between models built using different reference points. For longitudinal studies, several feature selection analyses are performed and included in one report.
When using data without prior hypotheses, you are performing exploratory analysis and should make this clear when communicating results.
This section explains the methods, software, and statistical decisions. Feel free to use or expand the following text:
EXPLANA was used for exploratory analysis to identify important
features related to the outcome variable,
diagnosis. A random effect of
subjectID was used to adjust for
non-independence (repeated measurements) if needed. There were
1000 trees used per Random Forest model
with a max feature fraction of 0.7 of the
input features for each split per decision tree in the forest. If mixed
effects Random Forests were needed, 1
iterations were performed. BorutaSHAP was used to find features that
perform repeatedly better than shuffled versions of all input features.
Features were considered important if they performed better than
100% of the SHAP importance score of the
best shuffled feature using 100 trials,
p=0.05. Categorical variables were binary
encoded and
low occuring categorical values were not removed.
These methods are from an EXPLANA feature selection report (version:
2024.08.21) created on
2025-01-03. Additional information can be
found at https://github.com/JTFouquier/explana/ or at explana.io.
This is a copy of the script used to perform this analysis.
analyst: Jennifer Fouquier
response_var: diagnosis
include_time: 'yes'
random_effect: subjectID
sample_id: sampleID
timepoint: timepoint
out: workflow-results/EXPLANA-breast-cancer-classification/
iterations: '1'
n_estimators: '1000'
max_features: '0.7'
borutashap_trials: '100'
borutashap_threshold: '100'
borutashap_p: '0.05'
analyze_original: 'yes'
analyze_first: 'no'
analyze_previous: 'no'
analyze_pairwise: 'no'
absolute_values: 'no'
include_reference_values: 'no'
analysis_notes: Classification of breast cancer tumors as benign or malignant in women.
Data obtained from https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset
enc_percent_threshold: '0'
distance_matrices: list()
df_mod: ''
delta_df_mod: |
delta_df <- delta_df
input_datasets:
metadata:
file_path: data/breast-cancer/breast-cancer.csv
df_mod: ''
dim_method: ''
dim_param_dict:
method: none
⇨ Full Analysis Directory | ⇨ Original Dataset | ⇨ First Delta Dataset | ⇨ Previous Delta Dataset | ⇨ Pairwise Delta Dataset
| Data | Original | First | Previous | Pairwise |
|---|---|---|---|---|
| % Variance Explained | 86.7% (86.8%) | NA | NA | NA |
| N Trees | 1000 | NA | NA | NA |
| Feature fraction/split | 0.7 | NA | NA | NA |
| Max Depth | 7 | NA | NA | NA |
| MERF Iters. | NA | NA | NA | NA |
| BorutaSHAP Trials | 100 | NA | NA | NA |
| BorutaSHAP Threshold | 100 | NA | NA | NA |
| P-value | 0.05 | NA | NA | NA |
| N Study IDs | 569 | NA | NA | NA |
| N Samples | 569 | NA | NA | NA |
| Input Features | 31 | NA | NA | NA |
| Accepted Features | 15 | NA | NA | NA |
| Tentative Features | 2 | NA | NA | NA |
| Rejected Features | 13 | NA | NA | NA |
| Model Type (Pass/Fail) | RF Pass | Not performed | Not performed | Not performed |
Selected feature ranks from models built using Original and, for longitudinal analyses, First, Previous and Pairwise delta datasets. Selected features are shown in black and labeled with feature rank. For true/positive instances of categorical variables (indicated with “ENC” after encoding), average impact on response/outcome is shown after the rank. For numerical features, impact is not shown because the feature relationship to response can be complex, requiring further post-hoc tests or inspection of SHAP dependence plots for additional insight. Empty grey boxes indicate features included in the model for a dataset, but not selected. Long feature names may be truncated and indicated with ellipses. A comprehensive list of input features can be found in model details below.
The following links can help with hypothesis generation. Names of variables likely need modification.
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Original |
|---|---|
| Model Type (Pass/Fail) | RF Pass |
| % Variance Explained | 86.7% (86.8%) |
| N Trees | 1000 |
| Feature fraction/split | 0.7 |
| Max Depth | 7 |
| MERF Iters. | NA |
| BorutaSHAP Trials | 100 |
| BorutaSHAP Threshold | 100 |
| P-value | 0.05 |
| N Study IDs | 569 |
| N Samples | 569 |
| Input Features | 31 |
| Accepted Features | 15 |
| Tentative Features | 2 |
| Rejected Features | 13 |
SHAP summary beeswarm plots of feature influence on the machine learning prediction of outcome values. Each point represents one sample, and the horizontal position indicates impact on the outcome as indicated on the x-axis. Points to the left indicate a negative impact, and points to the right indicate a positive impact. The colors represent the selected feature values, where purple is larger and green is smaller. For binary encoded features (‘ENC’) purple is yes[1] and green is no[0]. SHAP is generally an improvement upon other importance scores because it provides information about both rank (how helpful the feature was compared to other features [y-axis]) and impact (a positive or negative impact on outcome values [x-axis]). If multiple figures are shown, scales may vary with a maximum of ten features per plot.
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| concave points_worst | concave points_worst | 0.0956 |
| perimeter_worst | perimeter_worst | 0.0898 |
| area_worst | area_worst | 0.0730 |
| radius_worst | radius_worst | 0.0599 |
| concave points_mean | concave points_mean | 0.0538 |
| concavity_worst | concavity_worst | 0.0237 |
| texture_worst | texture_worst | 0.0225 |
| area_se | area_se | 0.0195 |
| texture_mean | texture_mean | 0.0166 |
| concavity_mean | concavity_mean | 0.0109 |
| smoothness_worst | smoothness_worst | 0.0079 |
| area_mean | area_mean | 0.0064 |
| radius_se | radius_se | 0.0062 |
| compactness_worst | compactness_worst | 0.0039 |
| perimeter_se | perimeter_se | 0.0036 |
| important_features | feature_importance_vals | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| concave points_worst | 0.0956 | 0.115 | 0.066 | 0.000 | 0.065 | 0.100 | 0.161 | 0.291 |
| perimeter_worst | 0.0898 | 107.261 | 33.603 | 50.410 | 84.110 | 97.660 | 125.400 | 251.200 |
| area_worst | 0.0730 | 880.583 | 569.357 | 185.200 | 515.300 | 686.500 | 1084.000 | 4254.000 |
| radius_worst | 0.0599 | 16.269 | 4.833 | 7.930 | 13.010 | 14.970 | 18.790 | 36.040 |
| concave points_mean | 0.0538 | 0.049 | 0.039 | 0.000 | 0.020 | 0.034 | 0.074 | 0.201 |
| concavity_worst | 0.0237 | 0.272 | 0.209 | 0.000 | 0.114 | 0.227 | 0.383 | 1.252 |
| texture_worst | 0.0225 | 25.677 | 6.146 | 12.020 | 21.080 | 25.410 | 29.720 | 49.540 |
| area_se | 0.0195 | 40.337 | 45.491 | 6.802 | 17.850 | 24.530 | 45.190 | 542.200 |
| texture_mean | 0.0166 | 19.290 | 4.301 | 9.710 | 16.170 | 18.840 | 21.800 | 39.280 |
| concavity_mean | 0.0109 | 0.089 | 0.080 | 0.000 | 0.030 | 0.062 | 0.131 | 0.427 |
| smoothness_worst | 0.0079 | 0.132 | 0.023 | 0.071 | 0.117 | 0.131 | 0.146 | 0.223 |
| area_mean | 0.0064 | 654.889 | 351.914 | 143.500 | 420.300 | 551.100 | 782.700 | 2501.000 |
| radius_se | 0.0062 | 0.405 | 0.277 | 0.112 | 0.232 | 0.324 | 0.479 | 2.873 |
| compactness_worst | 0.0039 | 0.254 | 0.157 | 0.027 | 0.147 | 0.212 | 0.339 | 1.058 |
| perimeter_se | 0.0036 | 2.866 | 2.022 | 0.757 | 1.606 | 2.287 | 3.357 | 21.980 |
| input_features | was_selected |
|---|---|
| radius_mean | no |
| texture_mean | yes |
| perimeter_mean | no |
| area_mean | yes |
| smoothness_mean | no |
| compactness_mean | no |
| concavity_mean | yes |
| concave points_mean | yes |
| symmetry_mean | no |
| fractal_dimension_mean | no |
| radius_se | yes |
| texture_se | no |
| perimeter_se | yes |
| area_se | yes |
| smoothness_se | no |
| compactness_se | no |
| concavity_se | no |
| concave points_se | no |
| symmetry_se | no |
| fractal_dimension_se | no |
| radius_worst | yes |
| texture_worst | yes |
| perimeter_worst | yes |
| area_worst | yes |
| smoothness_worst | yes |
| compactness_worst | yes |
| concavity_worst | yes |
| concave points_worst | yes |
| symmetry_worst | no |
| fractal_dimension_worst | no |
| timepoint_explana | no |
The following links can help with hypothesis generation. Names of variables likely need modification.
Binary encoded columns created for categorical input variables:
Numeric mapping was created for categoric responsevariable, diagnosis, (sorted alphabetically and factorized):
B maps to 0
M maps to 1
diagnosis (response) mapping:
B maps to 0
M maps to 1
*** Percent variation explained is not optimal for categorical response variables.
*** Use caution with interpretation.
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | First |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis |
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Previous |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis |
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Pairwise |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20diagnosis |