This is a simplified version of a more complex report for demonstration purposes. There may be sections or figures that are complex and would be better understood through training or additional reading. See figure legends in report, documentation at explana.io [documentation be expanded as time allows], or preprint, for more information). Clicking through report sections will give an idea of what a report provides. Essentially, these reports document complex analytics and results in a more standardized way to streamline hypothesis generation.
EXPLANA is an EXPLoratory ANAlysis and feature selection tool that helps find variables (numerical or categorical/text data) that relate to outcomes. EXPLANA ranks statistically significant features by importance and estimates the relationship between the features and outcome.
It can be used for cross-sectional data (using original values only, as indicated in the first report section), or longitudinal studies (using original values and differences/deltas in variables over time for each individual or study identifier that has repeated measurements [i.e., sample sites, individuals, etc.]).
For longitudinal datasets, changes in features are calculated using different reference points to obtain delta datasets (First, Previous and Pairwise delta datasets as marked in the report sections). This is important because features in longitudinal studies can carry varying degrees of importance between models built using different reference points. For longitudinal studies, several feature selection analyses are performed and included in one report.
When you are using data without prior hypotheses, you are performing exploratory analysis and should make this clear when communicating results.
This section explains the methods, software, and statistical decisions used for improved transparency with communication, reproducibility and interpretation. Feel free to use, modify or expand the following text:
EXPLANA was used for exploratory analysis to identify important
features related to the outcome variable,
Divorce. A random effect of
StudyID was used to adjust for
non-independence (repeated measurements) if needed. There were
300 trees used per Random Forest model
with a max feature fraction of 0.2 of the
input features for each split per decision tree in the forest. If mixed
effects Random Forests were needed, 1
iterations were performed. BorutaSHAP was used to find features that
perform repeatedly better than shuffled versions of all input features.
Features were considered important if they performed better than
100% of the SHAP importance score of the
best shuffled feature using 100 trials,
p=0.05. Categorical variables were binary
encoded and
low occuring categorical values were not removed.
These methods are from an EXPLANA feature selection report (version:
2024.08.21) created on
2024-10-15. Additional information can be
found at https://github.com/JTFouquier/explana/ or at explana.io.
This is a copy of the script used to perform this analysis.
analyst: Jennifer Fouquier
response_var: Divorce
include_time: 'yes'
random_effect: StudyID
sample_id: SampleID
timepoint: Timepoint
out: workflow-results/EXPLANA-kaggle-divorce-prediction/
iterations: '1'
n_estimators: '300'
max_features: '0.2'
borutashap_trials: '100'
borutashap_threshold: '100'
borutashap_p: '0.05'
analyze_original: 'yes'
analyze_first: 'no'
analyze_previous: 'no'
analyze_pairwise: 'no'
absolute_values: 'no'
include_reference_values: 'no'
analysis_notes: ''
enc_percent_threshold: '0'
distance_matrices: list()
df_mod: ''
input_datasets:
metadata:
file_path: data/kaggle/divorce_data.txt
df_mod: ''
dim_method: preprocess
dim_param_dict:
method: none
⇨ Full Analysis Directory | ⇨ Original Dataset | ⇨ First Delta Dataset | ⇨ Previous Delta Dataset | ⇨ Pairwise Delta Dataset
| Data | Original | First | Previous | Pairwise |
|---|---|---|---|---|
| % Variance Explained | 93.3% (93.2%) | NA | NA | NA |
| N Trees | 300 | NA | NA | NA |
| Feature fraction/split | 0.2 | NA | NA | NA |
| Max Depth | 7 | NA | NA | NA |
| MERF Iters. | NA | NA | NA | NA |
| BorutaSHAP Trials | 100 | NA | NA | NA |
| BorutaSHAP Threshold | 100 | NA | NA | NA |
| P-value | 0.05 | NA | NA | NA |
| N Study IDs | 170 | NA | NA | NA |
| N Samples | 170 | NA | NA | NA |
| Input Features | 55 | NA | NA | NA |
| Accepted Features | 37 | NA | NA | NA |
| Tentative Features | 4 | NA | NA | NA |
| Rejected Features | 13 | NA | NA | NA |
| Model Type (Pass/Fail) | RF Pass | Not performed | Not performed | Not performed |
Selected feature ranks from models built using Original and, for longitudinal analyses, First, Previous and Pairwise delta datasets. Selected features are shown in black and labeled with feature rank. For true/positive instances of categorical variables (indicated with “ENC” after encoding), average impact on response/outcome is shown after the rank. For numerical features, impact is not shown because the feature relationship to response can be complex, requiring further post-hoc tests or inspection of SHAP dependence plots for additional insight. Empty grey boxes indicate features included in the model for a dataset, but not selected. Long feature names may be truncated and indicated with ellipses. A comprehensive list of input features can be found in model details below.
The following links can help with hypothesis generation. Names of variables likely need modification.
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Original |
|---|---|
| Model Type (Pass/Fail) | RF Pass |
| % Variance Explained | 93.3% (93.2%) |
| N Trees | 300 |
| Feature fraction/split | 0.2 |
| Max Depth | 7 |
| MERF Iters. | NA |
| BorutaSHAP Trials | 100 |
| BorutaSHAP Threshold | 100 |
| P-value | 0.05 |
| N Study IDs | 170 |
| N Samples | 170 |
| Input Features | 55 |
| Accepted Features | 37 |
| Tentative Features | 4 |
| Rejected Features | 13 |
SHAP summary beeswarm plots of feature influence on the machine learning prediction of outcome values. Each point represents one sample, and the horizontal position indicates impact on the outcome as indicated on the x-axis. Points to the left indicate a negative impact, and points to the right indicate a positive impact. The colors represent the selected feature values, where red is larger and blue is smaller. For binary encoded features (‘ENC’) red is yes[1] and blue is no[0]. SHAP is generally an improvement upon other importance scores because it provides information about both rank (how helpful the feature was compared to other features [y-axis]) and impact (a positive or negative impact on outcome values [x-axis]). If multiple figures are shown, scales may vary with a maximum of ten features per plot.
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| Q18 | Q18 | 0.0537 |
| Q19 | Q19 | 0.0472 |
| Q11 | Q11 | 0.0470 |
| Q40 | Q40 | 0.0435 |
| Q20 | Q20 | 0.0392 |
| Q17 | Q17 | 0.0375 |
| Q26 | Q26 | 0.0341 |
| Q9 | Q9 | 0.0292 |
| Q16 | Q16 | 0.0168 |
| Q39 | Q39 | 0.0162 |
| Q36 | Q36 | 0.0156 |
| Q12 | Q12 | 0.0128 |
| Q25 | Q25 | 0.0127 |
| Q28 | Q28 | 0.0111 |
| Q30 | Q30 | 0.0095 |
| Q15 | Q15 | 0.0091 |
| Q29 | Q29 | 0.0068 |
| Q41 | Q41 | 0.0062 |
| Q14 | Q14 | 0.0062 |
| Q5 | Q5 | 0.0057 |
| Q4 | Q4 | 0.0055 |
| Q2 | Q2 | 0.0047 |
| Q10 | Q10 | 0.0043 |
| Q1 | Q1 | 0.0042 |
| Q38 | Q38 | 0.0035 |
| Q44 | Q44 | 0.0033 |
| Q8 | Q8 | 0.0031 |
| Q3 | Q3 | 0.0029 |
| Q27 | Q27 | 0.0027 |
| Q33 | Q33 | 0.0025 |
| Q31 | Q31 | 0.0021 |
| Q49 | Q49 | 0.0020 |
| Q34 | Q34 | 0.0019 |
| Q21 | Q21 | 0.0019 |
| Q54 | Q54 | 0.0015 |
| Q32 | Q32 | 0.0010 |
| Q37 | Q37 | 0.0009 |
| important_features | feature_importance_vals | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| Q18 | 0.0537 | 1.518 | 1.566 | 0 | 0 | 1.0 | 3 | 4 |
| Q19 | 0.0472 | 1.641 | 1.641 | 0 | 0 | 1.0 | 3 | 4 |
| Q11 | 0.0470 | 1.688 | 1.647 | 0 | 0 | 1.0 | 3 | 4 |
| Q40 | 0.0435 | 1.871 | 1.796 | 0 | 0 | 1.5 | 4 | 4 |
| Q20 | 0.0392 | 1.459 | 1.554 | 0 | 0 | 1.0 | 3 | 4 |
| Q17 | 0.0375 | 1.653 | 1.615 | 0 | 0 | 1.0 | 3 | 4 |
| Q26 | 0.0341 | 1.488 | 1.500 | 0 | 0 | 1.0 | 3 | 4 |
| Q9 | 0.0292 | 1.459 | 1.558 | 0 | 0 | 1.0 | 3 | 4 |
| Q16 | 0.0168 | 1.476 | 1.504 | 0 | 0 | 1.0 | 3 | 4 |
| Q39 | 0.0162 | 2.088 | 1.719 | 0 | 0 | 2.0 | 4 | 4 |
| Q36 | 0.0156 | 1.606 | 1.798 | 0 | 0 | 0.0 | 4 | 4 |
| Q12 | 0.0128 | 1.653 | 1.469 | 0 | 0 | 1.5 | 3 | 4 |
| Q25 | 0.0127 | 1.629 | 1.530 | 0 | 0 | 1.0 | 3 | 4 |
| Q28 | 0.0111 | 1.306 | 1.468 | 0 | 0 | 0.5 | 3 | 4 |
| Q30 | 0.0095 | 1.494 | 1.504 | 0 | 0 | 1.0 | 3 | 4 |
| Q15 | 0.0091 | 1.571 | 1.507 | 0 | 0 | 1.0 | 3 | 4 |
| Q29 | 0.0068 | 1.494 | 1.592 | 0 | 0 | 1.0 | 3 | 4 |
| Q41 | 0.0062 | 1.994 | 1.722 | 0 | 0 | 2.0 | 4 | 4 |
| Q14 | 0.0062 | 1.571 | 1.503 | 0 | 0 | 1.0 | 3 | 4 |
| Q5 | 0.0057 | 1.541 | 1.632 | 0 | 0 | 1.0 | 3 | 4 |
| Q4 | 0.0055 | 1.482 | 1.504 | 0 | 0 | 1.0 | 3 | 4 |
| Q2 | 0.0047 | 1.653 | 1.469 | 0 | 0 | 2.0 | 3 | 4 |
| Q10 | 0.0043 | 1.576 | 1.422 | 0 | 0 | 2.0 | 3 | 4 |
| Q1 | 0.0042 | 1.776 | 1.627 | 0 | 0 | 2.0 | 3 | 4 |
| Q38 | 0.0035 | 1.859 | 1.735 | 0 | 0 | 1.0 | 4 | 4 |
| Q44 | 0.0033 | 1.941 | 1.684 | 0 | 0 | 2.0 | 4 | 4 |
| Q8 | 0.0031 | 1.453 | 1.546 | 0 | 0 | 1.0 | 3 | 4 |
| Q3 | 0.0029 | 1.765 | 1.415 | 0 | 0 | 2.0 | 3 | 4 |
| Q27 | 0.0027 | 1.400 | 1.457 | 0 | 0 | 1.0 | 3 | 4 |
| Q33 | 0.0025 | 1.806 | 1.785 | 0 | 0 | 1.0 | 4 | 4 |
| Q31 | 0.0021 | 2.124 | 1.647 | 0 | 0 | 2.0 | 4 | 4 |
| Q49 | 0.0020 | 2.382 | 1.512 | 0 | 1 | 3.0 | 4 | 4 |
| Q34 | 0.0019 | 1.900 | 1.631 | 0 | 0 | 1.0 | 4 | 4 |
| Q21 | 0.0019 | 1.388 | 1.452 | 0 | 0 | 1.0 | 3 | 4 |
| Q54 | 0.0015 | 2.012 | 1.668 | 0 | 0 | 2.0 | 4 | 4 |
| Q32 | 0.0010 | 2.059 | 1.623 | 0 | 0 | 2.0 | 4 | 4 |
| Q37 | 0.0009 | 2.088 | 1.716 | 0 | 0 | 2.0 | 4 | 4 |
| input_features | was_selected |
|---|---|
| Q1 | yes |
| Q2 | yes |
| Q3 | yes |
| Q4 | yes |
| Q5 | yes |
| Q6 | no |
| Q7 | no |
| Q8 | yes |
| Q9 | yes |
| Q10 | yes |
| Q11 | yes |
| Q12 | yes |
| Q13 | no |
| Q14 | yes |
| Q15 | yes |
| Q16 | yes |
| Q17 | yes |
| Q18 | yes |
| Q19 | yes |
| Q20 | yes |
| Q21 | yes |
| Q22 | no |
| Q23 | no |
| Q24 | no |
| Q25 | yes |
| Q26 | yes |
| Q27 | yes |
| Q28 | yes |
| Q29 | yes |
| Q30 | yes |
| Q31 | yes |
| Q32 | yes |
| Q33 | yes |
| Q34 | yes |
| Q35 | no |
| Q36 | yes |
| Q37 | yes |
| Q38 | yes |
| Q39 | yes |
| Q40 | yes |
| Q41 | yes |
| Q42 | no |
| Q43 | no |
| Q44 | yes |
| Q45 | no |
| Q46 | no |
| Q47 | no |
| Q48 | no |
| Q49 | yes |
| Q50 | no |
| Q51 | no |
| Q52 | no |
| Q53 | no |
| Q54 | yes |
| timepoint_explana | no |
The following links can help with hypothesis generation. Names of variables likely need modification.
Binary encoded columns created for categorical input variables:
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | First |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce |
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Previous |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce |
Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.
| Data | Pairwise |
|---|---|
| Model Type (Pass/Fail) | Not performed |
| % Variance Explained | NA |
| N Trees | NA |
| Feature fraction/split | NA |
| Max Depth | NA |
| MERF Iters. | NA |
| BorutaSHAP Trials | NA |
| BorutaSHAP Threshold | NA |
| P-value | NA |
| N Study IDs | NA |
| N Samples | NA |
| Input Features | NA |
| Accepted Features | NA |
| Tentative Features | NA |
| Rejected Features | NA |
| important_features | decoded_features | feature_importance_vals |
|---|---|---|
| no_selected_features | NA | -100 |
Analysis not completed
Analysis not completed
The following links can help with hypothesis generation. Names of variables likely need modification.
| important_features | url |
|---|---|
| no_selected_features | https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce |