Analyst: Jennifer Fouquier


EXPLANA Description and Analysis Summary


EXPLANA Description

This is a simplified version of a more complex report for demonstration purposes. There may be sections or figures that are complex and would be better understood through training or additional reading. See figure legends in report, documentation at explana.io [documentation be expanded as time allows], or preprint, for more information). Clicking through report sections will give an idea of what a report provides. Essentially, these reports document complex analytics and results in a more standardized way to streamline hypothesis generation.

EXPLANA is an EXPLoratory ANAlysis and feature selection tool that helps find variables (numerical or categorical/text data) that relate to outcomes. EXPLANA ranks statistically significant features by importance and estimates the relationship between the features and outcome.

It can be used for cross-sectional data (using original values only, as indicated in the first report section), or longitudinal studies (using original values and differences/deltas in variables over time for each individual or study identifier that has repeated measurements [i.e., sample sites, individuals, etc.]).

For longitudinal datasets, changes in features are calculated using different reference points to obtain delta datasets (First, Previous and Pairwise delta datasets as marked in the report sections). This is important because features in longitudinal studies can carry varying degrees of importance between models built using different reference points. For longitudinal studies, several feature selection analyses are performed and included in one report.

When you are using data without prior hypotheses, you are performing exploratory analysis and should make this clear when communicating results.


Outcome Variable: Divorce

Analysis Notes:


Methods

This section explains the methods, software, and statistical decisions used for improved transparency with communication, reproducibility and interpretation. Feel free to use, modify or expand the following text:

EXPLANA was used for exploratory analysis to identify important features related to the outcome variable, Divorce. A random effect of StudyID was used to adjust for non-independence (repeated measurements) if needed. There were 300 trees used per Random Forest model with a max feature fraction of 0.2 of the input features for each split per decision tree in the forest. If mixed effects Random Forests were needed, 1 iterations were performed. BorutaSHAP was used to find features that perform repeatedly better than shuffled versions of all input features. Features were considered important if they performed better than 100% of the SHAP importance score of the best shuffled feature using 100 trials, p=0.05. Categorical variables were binary encoded and low occuring categorical values were not removed. These methods are from an EXPLANA feature selection report (version: 2024.08.21) created on 2024-10-15. Additional information can be found at https://github.com/JTFouquier/explana/ or at explana.io.


Workflow Plan (Configuration File)

This is a copy of the script used to perform this analysis.

analyst: Jennifer Fouquier
response_var: Divorce
include_time: 'yes'
random_effect: StudyID
sample_id: SampleID
timepoint: Timepoint
out: workflow-results/EXPLANA-kaggle-divorce-prediction/
iterations: '1'
n_estimators: '300'
max_features: '0.2'
borutashap_trials: '100'
borutashap_threshold: '100'
borutashap_p: '0.05'
analyze_original: 'yes'
analyze_first: 'no'
analyze_previous: 'no'
analyze_pairwise: 'no'
absolute_values: 'no'
include_reference_values: 'no'
analysis_notes: ''
enc_percent_threshold: '0'
distance_matrices: list()
df_mod: ''
input_datasets:
  metadata:
    file_path: data/kaggle/divorce_data.txt
    df_mod: ''
    dim_method: preprocess
    dim_param_dict:
      method: none

Results Summary

Selected Features by Dataset

Data Original First Previous Pairwise
% Variance Explained 93.3% (93.2%) NA NA NA
N Trees 300 NA NA NA
Feature fraction/split 0.2 NA NA NA
Max Depth 7 NA NA NA
MERF Iters. NA NA NA NA
BorutaSHAP Trials 100 NA NA NA
BorutaSHAP Threshold 100 NA NA NA
P-value 0.05 NA NA NA
N Study IDs 170 NA NA NA
N Samples 170 NA NA NA
Input Features 55 NA NA NA
Accepted Features 37 NA NA NA
Tentative Features 4 NA NA NA
Rejected Features 13 NA NA NA
Model Type (Pass/Fail) RF Pass Not performed Not performed Not performed

Selected feature ranks from models built using Original and, for longitudinal analyses, First, Previous and Pairwise delta datasets. Selected features are shown in black and labeled with feature rank. For true/positive instances of categorical variables (indicated with “ENC” after encoding), average impact on response/outcome is shown after the rank. For numerical features, impact is not shown because the feature relationship to response can be complex, requiring further post-hoc tests or inspection of SHAP dependence plots for additional insight. Empty grey boxes indicate features included in the model for a dataset, but not selected. Long feature names may be truncated and indicated with ellipses. A comprehensive list of input features can be found in model details below.


Original Dataset


⇨ Open File Directory

Model Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Original
Model Type (Pass/Fail) RF Pass
% Variance Explained 93.3% (93.2%)
N Trees 300
Feature fraction/split 0.2
Max Depth 7
MERF Iters. NA
BorutaSHAP Trials 100
BorutaSHAP Threshold 100
P-value 0.05
N Study IDs 170
N Samples 170
Input Features 55
Accepted Features 37
Tentative Features 4
Rejected Features 13

SHAP summary beeswarm plots of feature influence on the machine learning prediction of outcome values. Each point represents one sample, and the horizontal position indicates impact on the outcome as indicated on the x-axis. Points to the left indicate a negative impact, and points to the right indicate a positive impact. The colors represent the selected feature values, where red is larger and blue is smaller. For binary encoded features (‘ENC’) red is yes[1] and blue is no[0]. SHAP is generally an improvement upon other importance scores because it provides information about both rank (how helpful the feature was compared to other features [y-axis]) and impact (a positive or negative impact on outcome values [x-axis]). If multiple figures are shown, scales may vary with a maximum of ten features per plot.

Selected Features

important_features decoded_features feature_importance_vals
Q18 Q18 0.0537
Q19 Q19 0.0472
Q11 Q11 0.0470
Q40 Q40 0.0435
Q20 Q20 0.0392
Q17 Q17 0.0375
Q26 Q26 0.0341
Q9 Q9 0.0292
Q16 Q16 0.0168
Q39 Q39 0.0162
Q36 Q36 0.0156
Q12 Q12 0.0128
Q25 Q25 0.0127
Q28 Q28 0.0111
Q30 Q30 0.0095
Q15 Q15 0.0091
Q29 Q29 0.0068
Q41 Q41 0.0062
Q14 Q14 0.0062
Q5 Q5 0.0057
Q4 Q4 0.0055
Q2 Q2 0.0047
Q10 Q10 0.0043
Q1 Q1 0.0042
Q38 Q38 0.0035
Q44 Q44 0.0033
Q8 Q8 0.0031
Q3 Q3 0.0029
Q27 Q27 0.0027
Q33 Q33 0.0025
Q31 Q31 0.0021
Q49 Q49 0.0020
Q34 Q34 0.0019
Q21 Q21 0.0019
Q54 Q54 0.0015
Q32 Q32 0.0010
Q37 Q37 0.0009

Feature Stats

important_features feature_importance_vals mean std min 25% 50% 75% max
Q18 0.0537 1.518 1.566 0 0 1.0 3 4
Q19 0.0472 1.641 1.641 0 0 1.0 3 4
Q11 0.0470 1.688 1.647 0 0 1.0 3 4
Q40 0.0435 1.871 1.796 0 0 1.5 4 4
Q20 0.0392 1.459 1.554 0 0 1.0 3 4
Q17 0.0375 1.653 1.615 0 0 1.0 3 4
Q26 0.0341 1.488 1.500 0 0 1.0 3 4
Q9 0.0292 1.459 1.558 0 0 1.0 3 4
Q16 0.0168 1.476 1.504 0 0 1.0 3 4
Q39 0.0162 2.088 1.719 0 0 2.0 4 4
Q36 0.0156 1.606 1.798 0 0 0.0 4 4
Q12 0.0128 1.653 1.469 0 0 1.5 3 4
Q25 0.0127 1.629 1.530 0 0 1.0 3 4
Q28 0.0111 1.306 1.468 0 0 0.5 3 4
Q30 0.0095 1.494 1.504 0 0 1.0 3 4
Q15 0.0091 1.571 1.507 0 0 1.0 3 4
Q29 0.0068 1.494 1.592 0 0 1.0 3 4
Q41 0.0062 1.994 1.722 0 0 2.0 4 4
Q14 0.0062 1.571 1.503 0 0 1.0 3 4
Q5 0.0057 1.541 1.632 0 0 1.0 3 4
Q4 0.0055 1.482 1.504 0 0 1.0 3 4
Q2 0.0047 1.653 1.469 0 0 2.0 3 4
Q10 0.0043 1.576 1.422 0 0 2.0 3 4
Q1 0.0042 1.776 1.627 0 0 2.0 3 4
Q38 0.0035 1.859 1.735 0 0 1.0 4 4
Q44 0.0033 1.941 1.684 0 0 2.0 4 4
Q8 0.0031 1.453 1.546 0 0 1.0 3 4
Q3 0.0029 1.765 1.415 0 0 2.0 3 4
Q27 0.0027 1.400 1.457 0 0 1.0 3 4
Q33 0.0025 1.806 1.785 0 0 1.0 4 4
Q31 0.0021 2.124 1.647 0 0 2.0 4 4
Q49 0.0020 2.382 1.512 0 1 3.0 4 4
Q34 0.0019 1.900 1.631 0 0 1.0 4 4
Q21 0.0019 1.388 1.452 0 0 1.0 3 4
Q54 0.0015 2.012 1.668 0 0 2.0 4 4
Q32 0.0010 2.059 1.623 0 0 2.0 4 4
Q37 0.0009 2.088 1.716 0 0 2.0 4 4

Input Features

input_features was_selected
Q1 yes
Q2 yes
Q3 yes
Q4 yes
Q5 yes
Q6 no
Q7 no
Q8 yes
Q9 yes
Q10 yes
Q11 yes
Q12 yes
Q13 no
Q14 yes
Q15 yes
Q16 yes
Q17 yes
Q18 yes
Q19 yes
Q20 yes
Q21 yes
Q22 no
Q23 no
Q24 no
Q25 yes
Q26 yes
Q27 yes
Q28 yes
Q29 yes
Q30 yes
Q31 yes
Q32 yes
Q33 yes
Q34 yes
Q35 no
Q36 yes
Q37 yes
Q38 yes
Q39 yes
Q40 yes
Q41 yes
Q42 no
Q43 no
Q44 yes
Q45 no
Q46 no
Q47 no
Q48 no
Q49 yes
Q50 no
Q51 no
Q52 no
Q53 no
Q54 yes
timepoint_explana no

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
Q18 https://pubmed.ncbi.nlm.nih.gov/?term=Q18%20AND%20Divorce
Q19 https://pubmed.ncbi.nlm.nih.gov/?term=Q19%20AND%20Divorce
Q11 https://pubmed.ncbi.nlm.nih.gov/?term=Q11%20AND%20Divorce
Q40 https://pubmed.ncbi.nlm.nih.gov/?term=Q40%20AND%20Divorce
Q20 https://pubmed.ncbi.nlm.nih.gov/?term=Q20%20AND%20Divorce
Q17 https://pubmed.ncbi.nlm.nih.gov/?term=Q17%20AND%20Divorce
Q26 https://pubmed.ncbi.nlm.nih.gov/?term=Q26%20AND%20Divorce
Q9 https://pubmed.ncbi.nlm.nih.gov/?term=Q9%20AND%20Divorce
Q16 https://pubmed.ncbi.nlm.nih.gov/?term=Q16%20AND%20Divorce
Q39 https://pubmed.ncbi.nlm.nih.gov/?term=Q39%20AND%20Divorce
Q36 https://pubmed.ncbi.nlm.nih.gov/?term=Q36%20AND%20Divorce
Q12 https://pubmed.ncbi.nlm.nih.gov/?term=Q12%20AND%20Divorce
Q25 https://pubmed.ncbi.nlm.nih.gov/?term=Q25%20AND%20Divorce
Q28 https://pubmed.ncbi.nlm.nih.gov/?term=Q28%20AND%20Divorce
Q30 https://pubmed.ncbi.nlm.nih.gov/?term=Q30%20AND%20Divorce
Q15 https://pubmed.ncbi.nlm.nih.gov/?term=Q15%20AND%20Divorce
Q29 https://pubmed.ncbi.nlm.nih.gov/?term=Q29%20AND%20Divorce
Q41 https://pubmed.ncbi.nlm.nih.gov/?term=Q41%20AND%20Divorce
Q14 https://pubmed.ncbi.nlm.nih.gov/?term=Q14%20AND%20Divorce
Q5 https://pubmed.ncbi.nlm.nih.gov/?term=Q5%20AND%20Divorce
Q4 https://pubmed.ncbi.nlm.nih.gov/?term=Q4%20AND%20Divorce
Q2 https://pubmed.ncbi.nlm.nih.gov/?term=Q2%20AND%20Divorce
Q10 https://pubmed.ncbi.nlm.nih.gov/?term=Q10%20AND%20Divorce
Q1 https://pubmed.ncbi.nlm.nih.gov/?term=Q1%20AND%20Divorce
Q38 https://pubmed.ncbi.nlm.nih.gov/?term=Q38%20AND%20Divorce
Q44 https://pubmed.ncbi.nlm.nih.gov/?term=Q44%20AND%20Divorce
Q8 https://pubmed.ncbi.nlm.nih.gov/?term=Q8%20AND%20Divorce
Q3 https://pubmed.ncbi.nlm.nih.gov/?term=Q3%20AND%20Divorce
Q27 https://pubmed.ncbi.nlm.nih.gov/?term=Q27%20AND%20Divorce
Q33 https://pubmed.ncbi.nlm.nih.gov/?term=Q33%20AND%20Divorce
Q31 https://pubmed.ncbi.nlm.nih.gov/?term=Q31%20AND%20Divorce
Q49 https://pubmed.ncbi.nlm.nih.gov/?term=Q49%20AND%20Divorce
Q34 https://pubmed.ncbi.nlm.nih.gov/?term=Q34%20AND%20Divorce
Q21 https://pubmed.ncbi.nlm.nih.gov/?term=Q21%20AND%20Divorce
Q54 https://pubmed.ncbi.nlm.nih.gov/?term=Q54%20AND%20Divorce
Q32 https://pubmed.ncbi.nlm.nih.gov/?term=Q32%20AND%20Divorce
Q37 https://pubmed.ncbi.nlm.nih.gov/?term=Q37%20AND%20Divorce

Log


Binary encoded columns created for categorical input variables:

BorutaSHAP Figures

⇨ Open PDF in new window


First Delta Dataset


⇨ Open File Directory

Results

Model Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data First
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Feature Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce

Log

BorutaSHAP Figures

⇨ Open PDF in new window


Previous Delta Dataset


⇨ Open File Directory

Model Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Previous
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Feature Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce

Log

BorutaSHAP Figures

⇨ Open PDF in new window


Pairwise Delta Dataset


⇨ Open File Directory

Model Summary

Selected features may have different ranks compared to the order shown in SHAP Beeswarm plots. Features are essentially ranked using different methods.

Data Pairwise
Model Type (Pass/Fail) Not performed
% Variance Explained NA
N Trees NA
Feature fraction/split NA
Max Depth NA
MERF Iters. NA
BorutaSHAP Trials NA
BorutaSHAP Threshold NA
P-value NA
N Study IDs NA
N Samples NA
Input Features NA
Accepted Features NA
Tentative Features NA
Rejected Features NA

Selected Features

important_features decoded_features feature_importance_vals
no_selected_features NA -100

Feature Stats

Analysis not completed

Input Features

Analysis not completed

Interpretation/Literature Search

The following links can help with hypothesis generation. Names of variables likely need modification.

important_features url
no_selected_features https://pubmed.ncbi.nlm.nih.gov/?term=no_selected_features%20AND%20Divorce

Log

BorutaSHAP Figures

⇨ Open PDF in new window


See the Github repository for more information.