seismometer.data.cohorts.get_cohort_data¶
- seismometer.data.cohorts.get_cohort_data(df, cohort_feature, *, proba, true='TARGET', splits=None)¶
Convenience function to create and format data for use in the cohort plots. Takes in information about the class, predictions, and true labels to output a dataset and corresponding labels.
In the case that multiple columns are used, predictions from each column are appended to the result so that rows sharing a cohort group are disjoint, and rows with different cohort columns potentially overlap.
Currently supports cohort_features of type Categorical (splits all categories) and Numeric (splits on specified values or at mean).
- Parameters:
df (pd.DataFrame) – Dataframe of observations to use for plotting, must contain the column specified in cohort_feature. Additionally, must contain columns specified by proba and true if using strings and not arrays.
cohort_feature (str) – string specification of the dataframe column to split. Currently supports numeric and categorical columns.
proba (Union[str, SeriesOrArray]) –
The predictions made by the model under review.
If string - must be a column in the dataframe.
If series or array - must be the same length as the dataframe.
true (Union[str, SeriesOrArray]) –
The true label associated with a prediction, by default “TARGET”.
If string - must be a column in the dataframe.
If series or array - must be the same length as the dataframe and int values.
splits (Optional[List]) – The numeric values to split cohorts or category values to include, treats each category value as its own split, by default None. If None, will create a dichotomy for numeric values split at the mean.
- Returns:
Data - ingestible by plot_cohort_* functions.
- Return type:
pd.DataFrame