具有交叉驗證的網格搜尋的自訂重新擬合策略#

此範例示範如何透過交叉驗證最佳化分類器，這是使用 GridSearchCV 物件在僅包含一半可用標記數據的開發集上完成的。

然後，在模型選擇步驟中未使用的專用評估集上，測量選定的超參數和訓練模型的效能。

有關可用於模型選擇的工具的更多詳細資訊，請參閱關於交叉驗證：評估估計器效能和調整估計器的超參數的章節。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

資料集#

我們將使用 digits 數據集。目標是對手寫數字圖像進行分類。我們將問題轉換為二元分類，以便更容易理解：目標是識別數字是否為 8。

from sklearn import datasets

digits = datasets.load_digits()

為了在圖像上訓練分類器，我們需要將它們展平為向量。每個 8 x 8 像素的圖像都需要轉換為 64 像素的向量。因此，我們將獲得一個形狀為 (n_images, n_pixels) 的最終數據陣列。

n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target == 8
print(
    f"The number of images is {X.shape[0]} and each image contains {X.shape[1]} pixels"
)

The number of images is 1797 and each image contains 64 pixels

如引言中所述，數據將被分成大小相等的訓練集和測試集。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

定義我們的網格搜尋策略#

我們將透過在訓練集的折疊上搜尋最佳超參數來選擇分類器。為此，我們需要定義選擇最佳候選者的分數。

scores = ["precision", "recall"]

我們還可以定義一個函數，以傳遞到 GridSearchCV 實例的 refit 參數。它將實作自訂策略，以從 GridSearchCV 的 cv_results_ 屬性中選擇最佳候選者。選定候選者後，GridSearchCV 實例會自動重新擬合。

此處的策略是將在精確度和召回率方面最佳的模型列入候選名單。從選定的模型中，我們最終選擇預測速度最快的模型。請注意，這些自訂選擇完全是任意的。

import pandas as pd


def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    # print the info about the grid-search for the different scores
    precision_threshold = 0.98

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    # Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    # Select the most performant models in terms of recall
    # (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)

    # From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

調整超參數#

一旦我們定義了選擇最佳模型的策略，我們就定義超參數的值並建立網格搜尋實例

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]

grid_search = GridSearchCV(
    SVC(), tuned_parameters, scoring=scores, refit=refit_strategy
)
grid_search.fit(X_train, y_train)

All grid-search results:
precision: 1.000 (±0.000), recall: 0.854 (±0.063), for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.257 (±0.061), for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.968 (±0.039), recall: 0.780 (±0.083), for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.905 (±0.058), recall: 0.889 (±0.074), for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.904 (±0.058), recall: 0.890 (±0.073), for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 0.695 (±0.073), recall: 0.743 (±0.065), for {'C': 1, 'kernel': 'linear'}
precision: 0.643 (±0.066), recall: 0.757 (±0.066), for {'C': 10, 'kernel': 'linear'}
precision: 0.611 (±0.028), recall: 0.744 (±0.044), for {'C': 100, 'kernel': 'linear'}
precision: 0.618 (±0.039), recall: 0.744 (±0.044), for {'C': 1000, 'kernel': 'linear'}

Models with a precision higher than 0.98:
precision: 1.000 (±0.000), recall: 0.854 (±0.063), for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.257 (±0.061), for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}

Out of the previously selected high precision models, we keep all the
the models within one standard deviation of the highest recall model:
precision: 1.000 (±0.000), recall: 0.854 (±0.063), for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 1.000 (±0.000), recall: 0.877 (±0.069), for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}


The selected final model is the fastest to predict out of the previously
selected subset of best models based on precision and recall.
Its scoring time is:

mean_score_time                                         0.006393
mean_test_recall                                        0.853676
std_test_recall                                         0.063184
mean_test_precision                                          1.0
std_test_precision                                           0.0
rank_test_recall                                               6
rank_test_precision                                            1
params                 {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
Name: 0, dtype: object

GridSearchCV(estimator=SVC(),
             param_grid=[{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],
                          'kernel': ['rbf']},
                         {'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
             refit=<function refit_strategy at 0x74b49ed164c0>,
             scoring=['precision', 'recall'])

在 Jupyter 環境中，請重新執行此儲存格以顯示 HTML 表示法或信任筆記本。
在 GitHub 上，HTML 表示法無法呈現，請嘗試使用 nbviewer.org 加載此頁面。

透過網格搜尋和我們的自訂策略選取的參數為

grid_search.best_params_

{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}

最後，我們在留出的評估集上評估微調後的模型：grid_search 物件已使用我們的自訂重新擬合策略選取的參數自動重新擬合到完整的訓練集。

我們可以使用分類報告來計算留出集上的標準分類指標

from sklearn.metrics import classification_report

y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.98      1.00      0.99       807
        True       1.00      0.85      0.92        92

    accuracy                           0.98       899
   macro avg       0.99      0.92      0.95       899
weighted avg       0.98      0.98      0.98       899

注意

問題太簡單了：超參數高原太過平坦，輸出模型在精確度和召回率方面質量相同，平分秋色。

腳本的總執行時間： (0 分鐘 10.408 秒)