Binary Classifier Notebook

30 Day Readmission Risk for Patients with Diabetes

The main objective of this notebook is to provide an example on how to utilize the seismometer package to analyze a binary classification predictive model.

This notebook helps evaluate a binary classification model trained on the Diabetes Dataset. This model predicts the risk of readmission within 30 days for patients with diabetes. The model used is a simple LightGBM model and used only to provide an example on how the seismometer package could be set up/utilized.

Basic preprocessing steps have already been completed on the dataset. The prepared data is used in training the model and model performance analysis.

Documentation

To find out more about seismometer, see the documentation on GitHub.

Usage

Explore data from your organization’s model including predictions, outcomes, interventions, and sensitive cohorts. Use sm.show_info() to explore what is available.

[1]:
# Download dataset
import urllib.request
from pathlib import Path

SOURCE_REPO = "epic-open-source/seismometer-data"
BRANCH_NAME = "main"
DATASET_SOURCE = f"https://raw.githubusercontent.com/{SOURCE_REPO}/refs/heads/{BRANCH_NAME}/diabetes-v2"
files = [
    "config.yml",
    "usage_config.yml",
    "data_dictionary.yml",
    "data/predictions.parquet",
    "data/events.parquet",
    "data/metadata.json",
]
Path('data').mkdir(parents=True, exist_ok=True)
for file in files:
    _ = urllib.request.urlretrieve(f"{DATASET_SOURCE}/{file}", file)
[2]:
import seismometer as sm
sm.run_startup(config_path='.')
[3]:
sm.show_info(plot_help=True)
[3]:

Summary

The preloaded data covers 99340 predictions over 69987 entities from the dates 2024-10-08 to 2024-10-21.
Dataframe Name Rows Columns Content
predictions 99340 3 Scores, features, configured demographics, and merged events for each prediction

Plot Functions

  • sm.ExploreModelEvaluation() - Overall performance across thresholds
  • sm.ExploreCohortEvaluation() - Performance split by specified cohort
  • sm.ExploreCohortOutcomeInterventionTimes() - Compare trends of interventions to outcomes

Overview

ℹ Info

A LightGBM model trained on the Diabetes Dataset predicts if a diabetes patient will be readmitted in the 30 days time window after discharge. The first step is to provide the required information which includes configuration files, predictions data, and events data (interventions, outcomes, or target events). Datasets should be stored in the parquet format.

The seismometer package pulls configuration from the config.yml file. This file stores:

  1. the filepath to the predictions dataframe in parquet format,

  2. the filepath to the events dataframe in parquet format,

  3. the filepath to usage configuration that describes how to interpret data during run,

  4. the filepath to events definitions, that specify events,

  5. the filepath to predictions definitions, that specify cohorts, scores, and features to consider.

We have created:

  1. the predictions dataset, where each row is a patient/encounter and columns are input features, a patient identifier, the time of the prediction, and a score column corresponding to the output of the trained LightGBM model,

  2. the events dataset, where each row corresponds to a target, intervention, or outcome. Here there is only one event defined for the model: the target (y) of the train. The dataset also includes the patient identifier, the time of the event, the type of the event (relevant when there are multiple events) and the events value (in this example, a 1 indicates a readmission occurred within 30 days)

  3. the usage_config.yml data_usage node specifies:

    1. age, race and gender as the analysis cohort attributes,

    2. the LightGBM model as the primary output (score),

    3. 30 days readmission (readmitted column) event as the primary target,

    4. admission_type_id, num_medications and num_procedures as the only extra features to consider in feature analysis.

Selection

You can specify the sensitive cohorts for a more detailed study in usage_config.yml via the cohorts keyword. As mentioned above, there are three cohort attributes:

  1. age: the age group of the patient. Age groups are [0,10), [10,20), [20,50), [50,70) and 70+.

  2. race: the self-reported race of the patient. Race cohorts are ‘Caucasian’, ‘AfricanAmerican’, ‘Hispanic’, ‘Asian’, ‘Other’, ‘Unknown’.

  3. gender: the self-reported gender of the patient. Gender cohorts are ‘Female’, ‘Male’.

[4]:
sm.ExploreSubgroups()
[4]:

Feature Monitor

ℹ Info

In this section is useful for digging into any of the potential data quality alerts identified for the diabetes dataset. Click on the links below to open one of the reports.

Tips:

  • See feature monitoring for more details.

  • This section provides insight into model inputs, demographics, and the set of interventions and outcomes. During early stages this will help validate configuration; afterwards it will assist with detecting feature and population drift. Read through the alerts identified for your data, dig deeper using the feature, demographic, and event summaries, or by comparing across targets or demographics.

  • Other Warnings: The variable profiles below will identify any concerning trends in feature distributions. Depending on the model, you may want to do additional configuration to silence these alerts until certain thresholds are met.

  • Run the sm.feature_summary()/sm.cohort_comparison_report()/sm.target_feature_summary() functions in the cells below to get a report for the corresponding dataset.

Reports

Feature Alerts

View automatically identified data quality issues for the model inputs in your dataset

[5]:
sm.feature_alerts()

Feature Summary Statistics and Plots

View the summary statistics and distributions for the model inputs in your dataset.

[6]:
sm.feature_summary()

Summarize Features by Cohort Subgroup

Run sm.cohort_comparison_report(), select two different groups to compare, and hit Generate Report to generate a comparative feature report.

[7]:
sm.cohort_comparison_report()

Summarize Features by Target

Run sm.target_feature_summary() to get a link to a breakdown of your features stratified by the different target values.

In this example, there is a single target of interest: the ‘readmitted’ column from the original dataset.

[8]:
sm.target_feature_summary()

Model Performance

Overall

ℹ Info

Model Performance Plots

See model performance plots for more details.

Tips:

  • Thresholds configured for the model are highlighted on the graphs.

  • Use sm.ExploreModelEvaluation() to get model evaluation plots for your model.

Visuals

[9]:
sm.ExploreModelEvaluation()
[9]:

ℹ Info

Exploration controls allow you to see the code used to generate the plot, allowing you to reproduce them automatically

[10]:
sm.plot_model_evaluation({}, 'Readmitted within 30 Days', 'Risk30DayReadmission', (0.10, 0.20), per_context=False)
[10]:

Overall Performance for Readmitted within 30 Days (Per Observation)

2024-11-13T21:55:26.938803 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/

Fairness Overview

ℹ Info

See fairness audit for more details.

This section is useful for investigating the ‘fairness’ of the LightGBM model trained on the diabetes dataset.

[11]:
sm.ExploreFairnessAudit()
[11]:

Cohort Analysis

[12]:
sm.show_cohort_summaries(by_target=False, by_score=False)
[12]:

Cohort Summaries

Counts by Age
  Predictions Entities
Cohort    
[0-10) 160 153
[10-20) 690 527
[20-50) 15020 10559
[50-70) 39118 27978
70+ 44352 30770
Counts by Race
  Predictions Entities
Cohort    
AfricanAmerican 18772 12643
Asian 628 491
Caucasian 74220 52328
Hispanic 2017 1496
Other 1471 1140
Unknown 2232 1889
Counts by Gender
  Predictions Entities
Cohort    
Female 53454 37238
Male 45886 32749
Counts by A1C
  Predictions Entities
Cohort    
>7 3775 2881
>8 8137 6206
None 82506 57084
Norm 4922 3816
Counts by Taking Insulin
  Predictions Entities
Cohort    
Down 11908 7717
No 46376 33331
Steady 30069 21639
Up 10987 7300
Counts by Taking Metformin
  Predictions Entities
Cohort    
Down 574 455
No 79497 55089
Steady 18206 13581
Up 1063 862

ℹ Info

Cohort Performance Plots

See cohort comparisons for more details.

Tips:

  • Thresholds configured for the model are highlighted on the graphs.

  • Use sm.ExploreCohortEvaluation() to get model evaluation plots for your model split by cohort subgroups.

Visuals

[13]:
sm.ExploreCohortEvaluation()
[13]:

Outcomes

Success of integrating a predictive model depends on more than just the model’s performance. Often, it can be determined by how well the model is integrated and how effectively (and equitably) interventions are applied. This section is intended to help analyze interventions and outcomes across sensitive groups or risk categories. See analyzing outcomes for more details.

Lead-time Analysis

ℹ Info

Lead-time analysis is focused on revealing the amount of time that a high prediction gives before an event of interest. These analyses implicitly restrict data to the positive cohort, as that is expected to be the place time the event occurs. The visualization uses violin plots, where each distribution of the subpopulation is represented as a vertical, mirrored density plot. The inner box within the violin plot highlights the interquartile range, while the central line indicates the median. When the distributions overlap significantly, it indicates that the model is providing equal opportunity for action to be taken based on the scores across the cohort groups.

Visuals

[14]:
sm.ExploreCohortLeadTime()
[14]:
[15]:
sm.ExploreCohortOutcomeInterventionTimes()
[15]:

Add Your Own Analysis

You can also incorporate other packages to create your own analyses. This example uses the seaborn package to create a heatmap of average score across different age groups and procedure counts.

sm.Seismogram().dataframe is a pandas DataFrame with merged predictions and events data.

[16]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import SVG

from seismometer.controls.explore import ExplorationModelSubgroupEvaluationWidget
from seismometer.plot.mpl.decorators import render_as_svg
from seismometer.data.filter import FilterRule

@render_as_svg
def plot_heat_map(
        cohort_dict: dict[str,tuple],
        target_col: str,
        score_col: str,
        thresholds: tuple[float],
        *, per_context: bool) -> plt.Figure:
    xcol = "age"
    ycol = "num_procedures"
    hue = score_col

    sg = sm.Seismogram()
    cohort_filter = FilterRule.from_cohort_dictionary(cohort_dict)
    data = cohort_filter.filter(sg.dataframe)[[xcol, ycol, hue]]
    data = data.groupby([xcol, ycol], observed=False)[[hue]].agg('mean').reset_index()
    data = data.pivot(index=ycol, columns=xcol, values=hue)

    ax = plt.axes()
    sns.heatmap(data = data, cbar_kws= {'label': hue}, ax = ax, vmin=min(thresholds), vmax=max(thresholds), cmap="crest")
    ax.set_title(f"Heatmap of {hue} for {cohort_filter}",  wrap=True, fontsize=10)
    plt.tight_layout()
    return plt.gcf()

ExplorationModelSubgroupEvaluationWidget("Heatmap", plot_heat_map)
[16]: