最近鄰分類#

此範例示範如何使用 KNeighborsClassifier。我們在 iris 資料集上訓練此分類器，並觀察在參數 weights 方面所獲得的決策邊界差異。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

載入資料#

在此範例中，我們使用 iris 資料集。我們將資料分割為訓練和測試資料集。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris(as_frame=True)
X = iris.data[["sepal length (cm)", "sepal width (cm)"]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

K 近鄰分類器#

我們想要使用考慮 11 個資料點鄰域的 k 近鄰分類器。由於我們的 k 近鄰模型使用歐氏距離來尋找最近鄰，因此預先縮放資料非常重要。有關更多詳細資訊，請參閱標題為特徵縮放的重要性的範例。

因此，我們使用 Pipeline 在使用我們的分類器之前連結一個縮放器。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

clf = Pipeline(
    steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=11))]
)

決策邊界#

現在，我們使用參數 weights 的不同值來擬合兩個分類器。我們繪製每個分類器的決策邊界以及原始資料集，以觀察差異。

import matplotlib.pyplot as plt

from sklearn.inspection import DecisionBoundaryDisplay

_, axs = plt.subplots(ncols=2, figsize=(12, 5))

for ax, weights in zip(axs, ("uniform", "distance")):
    clf.set_params(knn__weights=weights).fit(X_train, y_train)
    disp = DecisionBoundaryDisplay.from_estimator(
        clf,
        X_test,
        response_method="predict",
        plot_method="pcolormesh",
        xlabel=iris.feature_names[0],
        ylabel=iris.feature_names[1],
        shading="auto",
        alpha=0.5,
        ax=ax,
    )
    scatter = disp.ax_.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors="k")
    disp.ax_.legend(
        scatter.legend_elements()[0],
        iris.target_names,
        loc="lower left",
        title="Classes",
    )
    _ = disp.ax_.set_title(
        f"3-Class classification\n(k={clf[-1].n_neighbors}, weights={weights!r})"
    )

plt.show()

3-Class classification (k=11, weights='uniform'), 3-Class classification (k=11, weights='distance')

結論#

我們觀察到參數 weights 會影響決策邊界。當 weights="unifom" 時，所有最近鄰將對決策產生相同的影響。而當 weights="distance" 時，給予每個鄰居的權重與該鄰居到查詢點的距離成反比。

在某些情況下，考慮距離可能會改善模型。

腳本的總執行時間： (0 分鐘 0.541 秒)

相關範例