文字特徵提取和評估的範例管道#

此範例中使用的資料集是20 個新聞群組文字資料集，它將自動下載、快取並重複使用於文件分類範例。

在此範例中，我們使用RandomizedSearchCV來調整特定分類器的超參數。如需其他分類器效能的示範，請參閱使用稀疏特徵對文字文件進行分類筆記本。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

資料載入#

我們從訓練集載入兩個類別。您可以將類別名稱新增至清單或在呼叫資料集載入器fetch_20newsgroups時設定categories=None，以取得所有 20 個類別，藉此調整類別數量。

from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
]

data_train = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

data_test = fetch_20newsgroups(
    subset="test",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

print(f"Loading 20 newsgroups dataset for {len(data_train.target_names)} categories:")
print(data_train.target_names)
print(f"{len(data_train.data)} documents")

Loading 20 newsgroups dataset for 2 categories:
['alt.atheism', 'talk.religion.misc']
857 documents

具有超參數調整的管道#

我們定義一個管道，將文字特徵向量化器與一個簡單但對文字分類有效的分類器結合在一起。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.pipeline import Pipeline

pipeline = Pipeline(
    [
        ("vect", TfidfVectorizer()),
        ("clf", ComplementNB()),
    ]
)
pipeline

Pipeline(steps=[('vect', TfidfVectorizer()), ('clf', ComplementNB())])

在 Jupyter 環境中，請重新執行此儲存格以顯示 HTML 表示法或信任筆記本。
在 GitHub 上，HTML 表示法無法呈現，請嘗試使用 nbviewer.org 載入此頁面。

我們定義了RandomizedSearchCV要探索的超參數網格。改用GridSearchCV會探索網格上所有可能的組合，這可能會耗費計算成本，而RandomizedSearchCV的參數n_iter會控制要評估的不同隨機組合的數量。請注意，將n_iter設定為大於網格中可能組合的數量會導致重複探索過的組合。我們搜尋特徵提取（vect__）和分類器（clf__）的最佳參數組合。

import numpy as np

parameter_grid = {
    "vect__max_df": (0.2, 0.4, 0.6, 0.8, 1.0),
    "vect__min_df": (1, 3, 5, 10),
    "vect__ngram_range": ((1, 1), (1, 2)),  # unigrams or bigrams
    "vect__norm": ("l1", "l2"),
    "clf__alpha": np.logspace(-6, 6, 13),
}

在此案例中，n_iter=40並非超參數網格的詳盡搜尋。在實務上，增加參數n_iter以獲得更多資訊豐富的分析會很有趣。因此，計算時間會增加。我們可以透過增加參數n_jobs使用的 CPU 數量來利用參數組合評估的平行化來減少計算時間。

from pprint import pprint

from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=parameter_grid,
    n_iter=40,
    random_state=0,
    n_jobs=2,
    verbose=1,
)

print("Performing grid search...")
print("Hyperparameters to be evaluated:")
pprint(parameter_grid)

Performing grid search...
Hyperparameters to be evaluated:
{'clf__alpha': array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,
       1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]),
 'vect__max_df': (0.2, 0.4, 0.6, 0.8, 1.0),
 'vect__min_df': (1, 3, 5, 10),
 'vect__ngram_range': ((1, 1), (1, 2)),
 'vect__norm': ('l1', 'l2')}

from time import time

t0 = time()
random_search.fit(data_train.data, data_train.target)
print(f"Done in {time() - t0:.3f}s")

Fitting 5 folds for each of 40 candidates, totalling 200 fits
Done in 29.171s

print("Best parameters combination found:")
best_parameters = random_search.best_estimator_.get_params()
for param_name in sorted(parameter_grid.keys()):
    print(f"{param_name}: {best_parameters[param_name]}")

Best parameters combination found:
clf__alpha: 0.01
vect__max_df: 0.2
vect__min_df: 1
vect__ngram_range: (1, 1)
vect__norm: l1

test_accuracy = random_search.score(data_test.data, data_test.target)
print(
    "Accuracy of the best parameters using the inner CV of "
    f"the random search: {random_search.best_score_:.3f}"
)
print(f"Accuracy on test set: {test_accuracy:.3f}")

Accuracy of the best parameters using the inner CV of the random search: 0.816
Accuracy on test set: 0.709

需要前綴vect和clf以避免管道中可能產生的歧義，但對於視覺化結果而言並非必要。因此，我們定義一個會重新命名已調整的超參數並改善可讀性的函式。

import pandas as pd


def shorten_param(param_name):
    """Remove components' prefixes in param_name."""
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = pd.DataFrame(random_search.cv_results_)
cv_results = cv_results.rename(shorten_param, axis=1)

我們可以利用plotly.express.scatter來視覺化評分時間和平均測試分數（即「CV 分數」）之間的權衡。將游標移至給定點上方會顯示對應的參數。誤差線對應於交叉驗證的不同折疊中計算的一個標準差。

import plotly.express as px

param_names = [shorten_param(name) for name in parameter_grid.keys()]
labels = {
    "mean_score_time": "CV Score time (s)",
    "mean_test_score": "CV score (accuracy)",
}
fig = px.scatter(
    cv_results,
    x="mean_score_time",
    y="mean_test_score",
    error_x="std_score_time",
    error_y="std_test_score",
    hover_data=param_names,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "trade-off between scoring time and mean test score",
        "y": 0.95,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

請注意，圖表左上角的模型群集在準確度和評分時間之間具有最佳的權衡。在此案例中，使用二元語法會增加所需的評分時間，但不會大幅提高管道的準確度。

注意

如需有關如何自訂自動調整以最大化分數並最小化評分時間的詳細資訊，請參閱範例筆記本具有交叉驗證的網格搜尋的自訂重新擬合策略。

我們也可以使用plotly.express.parallel_coordinates來進一步視覺化平均測試分數作為已調整超參數的函式。這有助於找出兩個以上超參數之間的交互作用，並提供其對於提高管道效能的相關性直覺。

我們在alpha軸上套用math.log10轉換，以擴展活動範圍並改善圖表的可讀性。該軸上的值\(x\)應理解為\(10^x\)。

import math

column_results = param_names + ["mean_test_score", "mean_score_time"]

transform_funcs = dict.fromkeys(column_results, lambda x: x)
# Using a logarithmic scale for alpha
transform_funcs["alpha"] = math.log10
# L1 norms are mapped to index 1, and L2 norms to index 2
transform_funcs["norm"] = lambda x: 2 if x == "l2" else 1
# Unigrams are mapped to index 1 and bigrams to index 2
transform_funcs["ngram_range"] = lambda x: x[1]

fig = px.parallel_coordinates(
    cv_results[column_results].apply(transform_funcs),
    color="mean_test_score",
    color_continuous_scale=px.colors.sequential.Viridis_r,
    labels=labels,
)
fig.update_layout(
    title={
        "text": "Parallel coordinates plot of text classifier pipeline",
        "y": 0.99,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig

平行座標圖在不同的欄中顯示超參數的值，而效能指標以顏色編碼。可以透過按一下並按住平行座標圖的任何軸來選取結果範圍。然後，您可以滑動（移動）範圍選取，並交叉選取兩個選取項目以查看交集。您可以按一下同一軸一次來取消選取。

特別是針對此超參數搜尋，有趣的是要注意到，效能最佳的模型似乎不取決於正規化norm，但它們確實取決於max_df、min_df和正規化強度alpha之間的權衡。原因是包含雜訊特徵（即max_df接近\(1.0\)或min_df接近\(0\)）往往會過度擬合，因此需要更強的正規化來補償。具有較少特徵需要較少的正規化和較少的評分時間。

當alpha介於\(10^{-6}\)和\(10^0\)之間時，無論超參數norm為何，都可以獲得最佳的準確度分數。

指令碼的總執行時間：（0 分鐘 31.252 秒）