mealy.error_analyzer

Classes

class mealy.error_analyzer.ErrorAnalyzer(primary_model, feature_names=None, param_grid=None, probability_threshold=0.5, random_state=65537)[source]

ErrorAnalyzer analyzes the errors of a prediction model on a test set.

It uses model predictions and ground truth target to compute the model errors on the test set. It then trains a Decision Tree, called a Error Analyzer Tree, on the same test set by using the model error as target. The nodes of the decision tree are different segments of errors to be studied individually.

Parameters
  • primary_model (sklearn.base.BaseEstimator or sklearn.pipeline.Pipeline) – a sklearn model to analyze. Either an estimator or a Pipeline containing a ColumnTransformer with the preprocessing steps and an estimator as last step.

  • feature_names (list of str) – list of feature names. Defaults to None.

  • param_grid (dict) – sklearn.tree.DecisionTree hyper-parameters values for grid search.

  • random_state (int) – random seed.

_error_tree

the estimator used to train the Error Analyzer Tree

Type

DecisionTreeClassifier

evaluate(X, y, output_format='str')[source]

Evaluate performance of ErrorAnalyzer on the given test data and labels. Return ErrorAnalyzer summary metrics regarding the Error Tree.

Parameters
  • X (numpy.ndarray or pandas.DataFrame) – feature data from a test set to evaluate the primary predictor and train a Error Analyzer Tree.

  • y (numpy.ndarray or pandas.DataFrame) – target data from a test set to evaluate the primary predictor and train a Error Analyzer Tree.

  • output_format (string) – Return format used for the report. Valid values are ‘dict’ or ‘str’. Defaults to ‘str’.

Returns

dictionary or string report storing different metrics regarding the Error Decision Tree.

Return type

dict or str

fit(X, y)[source]

Fit the Error Analyzer Tree.

Trains the Error Analyzer Tree, a Decision Tree to discriminate between samples that are correctly predicted or wrongly predicted (errors) by a primary model.

Parameters
get_error_leaf_summary(leaf_selector=None, add_path_to_leaves=False, output_format='dict', rank_by='total_error_fraction')[source]

Return summary information regarding leaves.

Parameters
  • leaf_selector (None, int or array-like) – The leaves whose information will be returned * int: Only return information of the leaf with the corresponding id * array-like: Only return information of the leaves corresponding to these ids * None (default): Return information of all the leaves

  • add_path_to_leaves (bool) – Whether to add information of the path across the tree till the selected node. Defaults to False.

  • output_format (string) – Return format used for the report. Valid values are ‘dict’ or ‘str’. Defaults to ‘dict’.

  • rank_by (str) – Ranking criterion for the leaves. Valid values are: * ‘total_error_fraction’ (default): rank by the fraction of total error in the node * ‘purity’: rank by the purity (ratio of wrongly predicted samples over the total number of node samples) * ‘class_difference’: rank by the difference of number of wrongly and correctly predicted samples in a node.

Returns

list of reports (as dictionary or string) with different information on each selected leaf.

Return type

dict or str