seismometer.data.cohorts.get_cohort_data¶

seismometer.data.cohorts.get_cohort_data(df, cohort_feature, *, proba, true='TARGET', splits=None)¶

Convenience function to create and format data for use in the cohort plots. Takes in information about the class, predictions, and true labels to output a dataset and corresponding labels.

In the case that multiple columns are used, predictions from each column are appended to the result so that rows sharing a cohort group are disjoint, and rows with different cohort columns potentially overlap.

Currently supports cohort_features of type Categorical (splits all categories) and Numeric (splits on specified values or at mean).

Parameters:

df (pd.DataFrame) – Dataframe of observations to use for plotting, must contain the column specified in cohort_feature. Additionally, must contain columns specified by proba and true if using strings and not arrays.
cohort_feature (str) – string specification of the dataframe column to split. Currently supports numeric and categorical columns.
proba (Union[str, SeriesOrArray]) –
The predictions made by the model under review.
- If string - must be a column in the dataframe.
- If series or array - must be the same length as the dataframe.
true (Union[str, SeriesOrArray]) –
The true label associated with a prediction, by default “TARGET”.
- If string - must be a column in the dataframe.
- If series or array - must be the same length as the dataframe and int values.
splits (Optional[List]) – The numeric values to split cohorts or category values to include, treats each category value as its own split, by default None. If None, will create a dichotomy for numeric values split at the mean.

Returns:

Data - ingestible by plot_cohort_* functions.

Return type:

pd.DataFrame