seismometer.data.cohorts.get_cohort_performance_data

seismometer.data.cohorts.get_cohort_performance_data(df, cohort_feature, *, proba, true='TARGET', splits=None, censor_threshold=10)

Generates a dataframe with particular performance metrics (accuracy, sensitivity, specificity, ppv, npv, and flag rate (predicted positive condition rate)) for particular threshold values and cohort.

Parameters:
  • df (pd.DataFrame) – Dataframe of observations to use, must contain the column specified in cohort_feature. Additionally, must contain columns specified by proba and true if using strings and not arrays.

  • cohort_feature (str) – String specification of the dataframe column to split. Currently supports numeric and categorical columns.

  • proba (Union[str, SeriesOrArray]) –

    The predictions made by the model under review.

    • If string - must be a column in the dataframe.

    • If series or array - must be the same length as the dataframe.

  • true (Union[str, SeriesOrArray], default="TARGET") –

    The true label being predicted.

    • If string - must be a column in the dataframe.

    • If series or array - must be the same length as the dataframe and int values.

  • splits (Optional[List], default=None) – Optional - the numeric values to split cohorts or category values to include, treats each category value as its own split. If None, will create a dichotomy for numeric values split at the mean.

  • censor_threshold (int, default=10) – Minimum number of observations in a cohort to calculate performance metrics.

Returns:

Performance statistics for particular threshold values by cohort.

Return type:

pd.DataFrame