比較隨機森林和直方圖梯度提升模型#

在此範例中，我們針對回歸數據集，在分數和計算時間方面，比較隨機森林（RF）和直方圖梯度提升（HGBT）模型的效能，雖然這裡提出的所有概念也適用於分類。

透過改變控制每個估計器樹數量的參數進行比較

n_estimators 控制森林中的樹木數量。它是一個固定數字。
max_iter 是基於梯度提升模型的最大迭代次數。對於回歸和二元分類問題，迭代次數對應於樹木數量。此外，模型所需的實際樹木數量取決於停止條件。

HGBT 使用梯度提升來迭代改進模型的效能，方法是將每棵樹擬合到損失函數相對於預測值的負梯度。另一方面，RF 基於套袋法，並使用多數投票來預測結果。

有關集成模型的詳細資訊，請參閱使用者指南，或參閱直方圖梯度提升樹的特徵，其中提供一個範例展示 HGBT 模型的其他一些特徵。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

載入數據集#

from sklearn.datasets import fetch_california_housing

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
n_samples, n_features = X.shape

HGBT 在分箱的特徵值上使用基於直方圖的演算法，可以有效地處理具有大量特徵（參見為什麼更快）的大型數據集（數萬個樣本或更多）。scikit-learn 的 RF 實作未使用分箱，而是依賴於精確分割，這在計算上可能很昂貴。

print(f"The dataset consists of {n_samples} samples and {n_features} features")

The dataset consists of 20640 samples and 8 features

計算分數和計算時間#

請注意，預設情況下，HistGradientBoostingClassifier 和 HistGradientBoostingRegressor 的實作的許多部分都是並行的。

也可以透過使用 n_jobs 參數在多個核心上執行RandomForestRegressor 和 RandomForestClassifier 的實作，此處設定為與主機上的實體核心數相符。如需更多資訊，請參閱平行處理。

import joblib

N_CORES = joblib.cpu_count(only_physical_cores=True)
print(f"Number of physical cores: {N_CORES}")

Number of physical cores: 2

與 RF 不同，HGBT 模型提供早期停止選項（請參閱梯度提升中的早期停止），以避免新增不必要的新樹。在內部，該演算法使用樣本外集來計算模型在每次新增樹時的泛化效能。因此，如果泛化效能的改進超過 n_iter_no_change 次迭代，它就會停止新增樹。

對這兩個模型的其他參數進行了調整，但為了保持範例簡單，這裡沒有顯示該過程。

import pandas as pd

from sklearn.ensemble import HistGradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold

models = {
    "Random Forest": RandomForestRegressor(
        min_samples_leaf=5, random_state=0, n_jobs=N_CORES
    ),
    "Hist Gradient Boosting": HistGradientBoostingRegressor(
        max_leaf_nodes=15, random_state=0, early_stopping=False
    ),
}
param_grids = {
    "Random Forest": {"n_estimators": [10, 20, 50, 100]},
    "Hist Gradient Boosting": {"max_iter": [10, 20, 50, 100, 300, 500]},
}
cv = KFold(n_splits=4, shuffle=True, random_state=0)

results = []
for name, model in models.items():
    grid_search = GridSearchCV(
        estimator=model,
        param_grid=param_grids[name],
        return_train_score=True,
        cv=cv,
    ).fit(X, y)
    result = {"model": name, "cv_results": pd.DataFrame(grid_search.cv_results_)}
    results.append(result)

注意

調整 RF 的 n_estimators 通常會導致計算能力的浪費。實際上，只需要確保它足夠大，以便將其值加倍不會顯著提高測試分數。

繪製結果#

我們可以使用plotly.express.scatter 來視覺化經過的計算時間和平均測試分數之間的權衡。將游標移到給定點上會顯示對應的參數。誤差條對應於交叉驗證的不同折疊中計算的一個標準差。

import plotly.colors as colors
import plotly.express as px
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1,
    cols=2,
    shared_yaxes=True,
    subplot_titles=["Train time vs score", "Predict time vs score"],
)
model_names = [result["model"] for result in results]
colors_list = colors.qualitative.Plotly * (
    len(model_names) // len(colors.qualitative.Plotly) + 1
)

for idx, result in enumerate(results):
    cv_results = result["cv_results"].round(3)
    model_name = result["model"]
    param_name = list(param_grids[model_name].keys())[0]
    cv_results[param_name] = cv_results["param_" + param_name]
    cv_results["model"] = model_name

    scatter_fig = px.scatter(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
        error_x="std_fit_time",
        error_y="std_test_score",
        hover_data=param_name,
        color="model",
    )
    line_fig = px.line(
        cv_results,
        x="mean_fit_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=1)
    fig.add_trace(line_trace, row=1, col=1)

    scatter_fig = px.scatter(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
        error_x="std_score_time",
        error_y="std_test_score",
        hover_data=param_name,
    )
    line_fig = px.line(
        cv_results,
        x="mean_score_time",
        y="mean_test_score",
    )

    scatter_trace = scatter_fig["data"][0]
    line_trace = line_fig["data"][0]
    scatter_trace.update(marker=dict(color=colors_list[idx]))
    line_trace.update(line=dict(color=colors_list[idx]))
    fig.add_trace(scatter_trace, row=1, col=2)
    fig.add_trace(line_trace, row=1, col=2)

fig.update_layout(
    xaxis=dict(title="Train time (s) - lower is better"),
    yaxis=dict(title="Test R2 score - higher is better"),
    xaxis2=dict(title="Predict time (s) - lower is better"),
    legend=dict(x=0.72, y=0.05, traceorder="normal", borderwidth=1),
    title=dict(x=0.5, text="Speed-score trade-off of tree-based ensembles"),
)

當增加集成中的樹木數量時，HGBT 和 RF 模型都會改進。但是，分數會達到一個高原，在這種情況下，新增樹只會使擬合和評分速度變慢。RF 模型會較早達到這種高原，並且永遠無法達到最大 HGBDT 模型的測試分數。

請注意，上述繪圖中顯示的結果在多次運行之間可能會略有變化，並且在其他機器上運行時變化更大：嘗試在您自己的本機機器上運行此範例。

總體而言，人們通常會觀察到，在「測試分數與訓練速度權衡」方面，基於直方圖的梯度提升模型一致優於隨機森林模型（HGBDT 曲線應位於 RF 曲線的左上方，而不會交叉）。「測試分數與預測速度」的權衡也可能更具爭議，但它通常更有利於 HGBDT。始終最好檢查這兩種模型（使用超參數調整），並比較它們在您的特定問題上的效能，以確定哪種模型最合適，但HGBT 幾乎總是提供比 RF 更有利的加速-準確度權衡，無論是使用預設超參數還是包括超參數調整成本。

但這個經驗法則有一個例外：當訓練具有大量可能類別的多類別分類模型時，HGBDT 會在每次提升迭代中於內部為每個類別擬合一棵樹，而 RF 模型使用的樹自然是多類別的，在這種情況下，應該會提高 RF 模型的加速準確度權衡。

腳本的總運行時間：（0 分鐘 58.300 秒）