seismometer.data.cohorts.get_cohort_data

seismometer.data.cohorts.get_cohort_data(df, cohort_feature, *, proba, true='TARGET', splits=None)

Convenience function to create and format data for use in the cohort plots. Takes in information about the class, predictions, and true labels to output a dataset and corresponding labels.

In the case that multiple columns are used, predictions from each column are appended to the result so that rows sharing a cohort group are disjoint, and rows with different cohort columns potentially overlap.

Currently supports cohort_features of type Categorical (splits all categories) and Numeric (splits on specified values or at mean).

Parameters:
  • df (pd.DataFrame) – Dataframe of observations to use for plotting, must contain the column specified in cohort_feature. Additionally, must contain columns specified by proba and true if using strings and not arrays.

  • cohort_feature (str) – string specification of the dataframe column to split. Currently supports numeric and categorical columns.

  • proba (Union[str, SeriesOrArray]) –

    The predictions made by the model under review.

    • If string - must be a column in the dataframe.

    • If series or array - must be the same length as the dataframe.

  • true (Union[str, SeriesOrArray]) –

    The true label associated with a prediction, by default “TARGET”.

    • If string - must be a column in the dataframe.

    • If series or array - must be the same length as the dataframe and int values.

  • splits (Optional[List]) – The numeric values to split cohorts or category values to include, treats each category value as its own split, by default None. If None, will create a dichotomy for numeric values split at the mean.

Returns:

Data - ingestible by plot_cohort_* functions.

Return type:

pd.DataFrame