注意

跳到結尾下載完整的範例程式碼。或透過 JupyterLite 或 Binder 在您的瀏覽器中執行此範例

部分依賴性和個別條件期望圖#

部分依賴性圖顯示目標函數 [2] 與一組感興趣的特徵之間的依賴性，邊緣化所有其他特徵（補充特徵）的值。由於人類感知的限制，感興趣的特徵集合大小必須很小（通常為一個或兩個），因此它們通常在最重要的特徵中選擇。

同樣地，個別條件期望 (ICE) 圖 [3] 顯示目標函數與感興趣的特徵之間的依賴性。然而，與顯示感興趣特徵平均效果的部分依賴性圖不同，ICE 圖會視覺化預測對每個樣本的特徵的依賴性，每個樣本一條線。ICE 圖僅支援一個感興趣的特徵。

此範例說明如何從在自行車共享資料集上訓練的 MLPRegressor 和 HistGradientBoostingRegressor 取得部分依賴性和 ICE 圖。該範例的靈感來自 [1]。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

使用不同模型的單向偏依性 (Partial Dependence)#

在本節中，我們將使用兩種不同的機器學習模型計算單向偏依性：(i) 多層感知器 (multi-layer perceptron) 和 (ii) 梯度提升模型。透過這兩個模型，我們將說明如何計算和解釋數值和類別特徵的偏依性圖 (Partial Dependence Plot, PDP) 和個別條件期望 (Individual Conditional Expectation, ICE)。

多層感知器#

讓我們擬合一個 MLPRegressor，並計算單變數偏依性圖。

from time import time

from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline

print("Training MLPRegressor...")
tic = time()
mlp_model = make_pipeline(
    mlp_preprocessor,
    MLPRegressor(
        hidden_layer_sizes=(30, 15),
        learning_rate_init=0.01,
        early_stopping=True,
        random_state=0,
    ),
)
mlp_model.fit(X_train, y_train)
print(f"done in {time() - tic:.3f}s")
print(f"Test R2 score: {mlp_model.score(X_test, y_test):.2f}")

Training MLPRegressor...
done in 0.679s
Test R2 score: 0.61

我們使用專為神經網路建立的前處理器配置了一個 pipeline，並調整了神經網路的大小和學習率，以在訓練時間和測試集上的預測效能之間取得合理的平衡。

重要的是，這個表格資料集的特徵具有非常不同的動態範圍。神經網路往往對具有不同尺度的特徵非常敏感，而忘記預處理數值特徵會導致模型效能非常差。

透過更大的神經網路，有可能獲得更高的預測效能，但訓練成本也會顯著增加。

請注意，在繪製偏依性圖之前，檢查模型在測試集上的準確度是否足夠是很重要的，因為解釋給定特徵對預測效能不佳的模型的預測函數的影響沒有什麼意義。在這方面，我們的 MLP 模型表現相當不錯。

我們將繪製平均偏依性。

import matplotlib.pyplot as plt

from sklearn.inspection import PartialDependenceDisplay

common_params = {
    "subsample": 50,
    "n_jobs": 2,
    "grid_resolution": 20,
    "random_state": 0,
}

print("Computing partial dependence plots...")
features_info = {
    # features of interest
    "features": ["temp", "humidity", "windspeed", "season", "weather", "hour"],
    # type of partial dependence plot
    "kind": "average",
    # information regarding categorical features
    "categorical_features": categorical_features,
}
tic = time()
_, ax = plt.subplots(ncols=3, nrows=2, figsize=(9, 8), constrained_layout=True)
display = PartialDependenceDisplay.from_estimator(
    mlp_model,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle(
    (
        "Partial dependence of the number of bike rentals\n"
        "for the bike rental dataset with an MLPRegressor"
    ),
    fontsize=16,
)

Partial dependence of the number of bike rentals for the bike rental dataset with an MLPRegressor

Computing partial dependence plots...
done in 0.649s

梯度提升#

現在讓我們擬合一個 HistGradientBoostingRegressor，並計算相同特徵的偏依性。我們還使用專為此模型建立的特定前處理器。

from sklearn.ensemble import HistGradientBoostingRegressor

print("Training HistGradientBoostingRegressor...")
tic = time()
hgbdt_model = make_pipeline(
    hgbdt_preprocessor,
    HistGradientBoostingRegressor(
        categorical_features=categorical_features,
        random_state=0,
        max_iter=50,
    ),
)
hgbdt_model.fit(X_train, y_train)
print(f"done in {time() - tic:.3f}s")
print(f"Test R2 score: {hgbdt_model.score(X_test, y_test):.2f}")

Training HistGradientBoostingRegressor...
done in 0.128s
Test R2 score: 0.62

在這裡，我們使用了梯度提升模型的預設超參數，沒有進行任何預處理，因為基於樹的模型本質上對數值特徵的單調轉換具有穩健性。

請注意，在這個表格資料集上，梯度提升機 (Gradient Boosting Machines) 的訓練速度明顯快於神經網路，而且更準確。調整它們的超參數也明顯便宜得多（預設值通常效果很好，但神經網路通常不是這種情況）。

我們將繪製一些數值和類別特徵的偏依性。

print("Computing partial dependence plots...")
tic = time()
_, ax = plt.subplots(ncols=3, nrows=2, figsize=(9, 8), constrained_layout=True)
display = PartialDependenceDisplay.from_estimator(
    hgbdt_model,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle(
    (
        "Partial dependence of the number of bike rentals\n"
        "for the bike rental dataset with a gradient boosting"
    ),
    fontsize=16,
)

Partial dependence of the number of bike rentals for the bike rental dataset with a gradient boosting

Computing partial dependence plots...
done in 1.218s

分析圖表#

我們將首先查看數值特徵的 PDP。對於這兩個模型，溫度 PDP 的總體趨勢是自行車租賃數量隨著溫度升高而增加。我們可以對濕度特徵進行類似的分析，但趨勢相反。當濕度增加時，自行車租賃數量會減少。最後，我們看到風速特徵的趨勢相同。對於這兩個模型，當風速增加時，自行車租賃數量會減少。我們還觀察到 MLPRegressor 的預測比 HistGradientBoostingRegressor 平滑得多。

現在，我們將查看類別特徵的偏依性圖。

我們觀察到，對於 season 特徵，春季是最低的長條。對於 weather 特徵，rain 類別是最低的長條。關於 hour 特徵，我們看到早上 7 點和晚上 6 點左右有兩個高峰。這些發現與我們之前在資料集上觀察到的結果一致。

但是，值得注意的是，如果特徵相關，我們可能會產生毫無意義的合成樣本。

ICE vs. PDP#

PDP 是特徵邊際效應的平均值。我們正在平均提供的集合中所有樣本的回應。因此，某些影響可能會被隱藏。在這方面，可以繪製每個單獨的回應。這種表示形式稱為個別效應圖 (Individual Effect Plot, ICE)。在下面的圖中，我們繪製了溫度和濕度特徵的 50 個隨機選擇的 ICE。

print("Computing partial dependence plots and individual conditional expectation...")
tic = time()
_, ax = plt.subplots(ncols=2, figsize=(6, 4), sharey=True, constrained_layout=True)

features_info = {
    "features": ["temp", "humidity"],
    "kind": "both",
    "centered": True,
}

display = PartialDependenceDisplay.from_estimator(
    hgbdt_model,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle("ICE and PDP representations", fontsize=16)

Computing partial dependence plots and individual conditional expectation...
done in 0.507s

我們看到，溫度特徵的 ICE 給了我們一些額外的資訊：一些 ICE 線是平坦的，而另一些則顯示當溫度高於攝氏 35 度時，依賴性會下降。我們觀察到濕度特徵的類似模式：當濕度高於 80% 時，某些 ICE 線會急劇下降。

並非所有 ICE 線都是平行的，這表示模型發現特徵之間存在交互作用。我們可以重複實驗，方法是限制梯度提升模型不使用特徵之間的任何交互作用，使用參數 interaction_cst

from sklearn.base import clone

interaction_cst = [[i] for i in range(X_train.shape[1])]
hgbdt_model_without_interactions = (
    clone(hgbdt_model)
    .set_params(histgradientboostingregressor__interaction_cst=interaction_cst)
    .fit(X_train, y_train)
)
print(f"Test R2 score: {hgbdt_model_without_interactions.score(X_test, y_test):.2f}")

Test R2 score: 0.38

_, ax = plt.subplots(ncols=2, figsize=(6, 4), sharey=True, constrained_layout=True)

features_info["centered"] = False
display = PartialDependenceDisplay.from_estimator(
    hgbdt_model_without_interactions,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
_ = display.figure_.suptitle("ICE and PDP representations", fontsize=16)

2D 交互作用圖#

具有兩個感興趣特徵的 PDP 使我們能夠視覺化它們之間的交互作用。但是，ICE 無法以簡單的方式繪製，因此難以解釋。我們將展示 from_estimator 中可用的表示形式，即 2D 熱圖。

print("Computing partial dependence plots...")
features_info = {
    "features": ["temp", "humidity", ("temp", "humidity")],
    "kind": "average",
}
_, ax = plt.subplots(ncols=3, figsize=(10, 4), constrained_layout=True)
tic = time()
display = PartialDependenceDisplay.from_estimator(
    hgbdt_model,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle(
    "1-way vs 2-way of numerical PDP using gradient boosting", fontsize=16
)

1-way vs 2-way of numerical PDP using gradient boosting

Computing partial dependence plots...
done in 8.160s

雙向偏依性圖顯示了自行車租賃數量對溫度和濕度聯合值的依賴性。我們清楚地看到這兩個特徵之間存在交互作用。對於高於攝氏 20 度的溫度，濕度對自行車租賃數量的影響似乎與溫度無關。

另一方面，對於低於攝氏 20 度的溫度，溫度和濕度都會持續影響自行車租賃數量。

此外，攝氏 20 度閾值的影響脊線的斜率非常依賴於濕度水平：在乾燥條件下，脊線很陡峭，但在濕度高於 70% 的潮濕條件下，脊線則平滑得多。

現在，我們將這些結果與針對限制為學習不依賴於這種非線性特徵交互作用的預測函數的模型計算的相同圖表進行比較。

print("Computing partial dependence plots...")
features_info = {
    "features": ["temp", "humidity", ("temp", "humidity")],
    "kind": "average",
}
_, ax = plt.subplots(ncols=3, figsize=(10, 4), constrained_layout=True)
tic = time()
display = PartialDependenceDisplay.from_estimator(
    hgbdt_model_without_interactions,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)
print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle(
    "1-way vs 2-way of numerical PDP using gradient boosting", fontsize=16
)

Computing partial dependence plots...
done in 7.703s

針對限制為不建模特徵交互作用的模型的一維偏依性圖顯示每個特徵的局部峰值，特別是對於「濕度」特徵。這些峰值可能反映了模型的行為退化，該模型試圖透過過度擬合特定的訓練點來彌補禁止的交互作用。請注意，此模型在測試集上測量的預測效能明顯低於原始未受限模型的預測效能。

另請注意，這些圖上可見的局部峰值數量取決於 PD 圖本身的網格解析度參數。

這些局部峰值導致一個有雜訊的網格 2D PD 圖。由於濕度特徵中的高頻振盪，很難判斷這些特徵之間是否存在交互作用。但是，可以清楚地看到，當溫度跨越 20 度邊界時觀察到的簡單交互作用效果對於此模型不再可見。

類別特徵之間的偏依性將提供離散表示形式，可以顯示為熱圖。例如，季節、天氣和目標之間的交互作用如下所示

print("Computing partial dependence plots...")
features_info = {
    "features": ["season", "weather", ("season", "weather")],
    "kind": "average",
    "categorical_features": categorical_features,
}
_, ax = plt.subplots(ncols=3, figsize=(14, 6), constrained_layout=True)
tic = time()
display = PartialDependenceDisplay.from_estimator(
    hgbdt_model,
    X_train,
    **features_info,
    ax=ax,
    **common_params,
)

print(f"done in {time() - tic:.3f}s")
_ = display.figure_.suptitle(
    "1-way vs 2-way PDP of categorical features using gradient boosting", fontsize=16
)

1-way vs 2-way PDP of categorical features using gradient boosting

Computing partial dependence plots...
done in 0.732s

3D 表示法#

讓我們對 2 個特徵交互作用繪製相同的偏依性圖，這次是 3 維。

# unused but required import for doing 3d projections with matplotlib < 3.2
import mpl_toolkits.mplot3d  # noqa: F401
import numpy as np

from sklearn.inspection import partial_dependence

fig = plt.figure(figsize=(5.5, 5))

features = ("temp", "humidity")
pdp = partial_dependence(
    hgbdt_model, X_train, features=features, kind="average", grid_resolution=10
)
XX, YY = np.meshgrid(pdp["grid_values"][0], pdp["grid_values"][1])
Z = pdp.average[0].T
ax = fig.add_subplot(projection="3d")
fig.add_axes(ax)

surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1, cmap=plt.cm.BuPu, edgecolor="k")
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
fig.suptitle(
    "PD of number of bike rentals on\nthe temperature and humidity GBDT model",
    fontsize=16,
)
# pretty init view
ax.view_init(elev=22, azim=122)
clb = plt.colorbar(surf, pad=0.08, shrink=0.6, aspect=10)
clb.ax.set_title("Partial\ndependence")
plt.show()

PD of number of bike rentals on the temperature and humidity GBDT model, Partial dependence

腳本的總執行時間：（0 分鐘 25.018 秒）

相關範例

使用部分依賴的進階繪圖

時間相關的特徵工程

scikit-learn 0.24 的發行重點

比較目標編碼器與其他編碼器

由 Sphinx-Gallery 產生

部分依賴性和個別條件期望圖#

機器學習模型的預處理器#

神經網路模型的預處理器#

梯度提升模型的前處理器#

使用不同模型的單向偏依性 (Partial Dependence)#

多層感知器#

梯度提升#

分析圖表#

ICE vs. PDP#

2D 交互作用圖#

3D 表示法#

部分依賴性和個別條件期望圖#

自行車共享資料集預處理#

機器學習模型的預處理器#

神經網路模型的預處理器#

梯度提升模型的前處理器#

使用不同模型的單向偏依性 (Partial Dependence)#

多層感知器#

梯度提升#

分析圖表#

ICE vs. PDP#

2D 交互作用圖#

3D 表示法#