介紹 `set_output` API#

此範例將示範 set_output API，以設定轉換器輸出 pandas DataFrames。set_output 可以透過呼叫 set_output 方法針對每個估計器進行設定，或透過設定 set_config(transform_output="pandas") 全域設定。如需詳細資訊，請參閱SLEP018。

首先，我們載入 iris 資料集作為 DataFrame，以示範 set_output API。

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X_train.head()

	萼片長度 (公分)	萼片寬度 (公分)	花瓣長度 (公分)	花瓣寬度 (公分)
60	5.0	2.0	3.5	1.0
1	4.9	3.0	1.4	0.2
8	4.4	2.9	1.4	0.2
93	5.0	2.3	3.3	1.0
106	4.9	2.5	4.5	1.7

若要設定諸如 preprocessing.StandardScaler 之類的估計器傳回 DataFrames，請呼叫 set_output。此功能需要安裝 pandas。

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
X_test_scaled.head()

	萼片長度 (公分)	萼片寬度 (公分)	花瓣長度 (公分)	花瓣寬度 (公分)
39	-0.894264	0.798301	-1.271411	-1.327605
12	-1.244466	-0.086944	-1.327407	-1.459074
48	-0.660797	1.462234	-1.271411	-1.327605
23	-0.894264	0.576989	-1.159419	-0.933197
81	-0.427329	-1.414810	-0.039497	-0.275851

set_output 可以在 fit 之後呼叫，以在事後設定 transform。

scaler2 = StandardScaler()

scaler2.fit(X_train)
X_test_np = scaler2.transform(X_test)
print(f"Default output type: {type(X_test_np).__name__}")

scaler2.set_output(transform="pandas")
X_test_df = scaler2.transform(X_test)
print(f"Configured pandas output type: {type(X_test_df).__name__}")

Default output type: ndarray
Configured pandas output type: DataFrame

在 pipeline.Pipeline 中，set_output 將設定所有步驟輸出 DataFrames。

from sklearn.feature_selection import SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

clf = make_pipeline(
    StandardScaler(), SelectPercentile(percentile=75), LogisticRegression()
)
clf.set_output(transform="pandas")
clf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('selectpercentile', SelectPercentile(percentile=75)),
                ('logisticregression', LogisticRegression())])

在 Jupyter 環境中，請重新執行此儲存格以顯示 HTML 表示法或信任筆記本。
在 GitHub 上，HTML 表示法無法呈現，請嘗試使用 nbviewer.org 載入此頁面。

管線中的每個轉換器都設定為傳回 DataFrames。這表示最後的邏輯回歸步驟包含輸入的特徵名稱。

clf[-1].feature_names_in_

array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

注意

如果使用 set_params 方法，轉換器將會被具有預設輸出格式的新轉換器取代。

clf.set_params(standardscaler=StandardScaler())
clf.fit(X_train, y_train)
clf[-1].feature_names_in_

array(['x0', 'x2', 'x3'], dtype=object)

若要保持預期的行為，請事先在新轉換器上使用 set_output

scaler = StandardScaler().set_output(transform="pandas")
clf.set_params(standardscaler=scaler)
clf.fit(X_train, y_train)
clf[-1].feature_names_in_

array(['sepal length (cm)', 'petal length (cm)', 'petal width (cm)'],
      dtype=object)

接下來，我們載入 titanic 資料集以示範具有 compose.ColumnTransformer 和異質資料的 set_output。

from sklearn.datasets import fetch_openml

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

可以使用 set_config 並將 transform_output 設定為 "pandas"，來全域設定 set_output API。

from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

set_config(transform_output="pandas")

num_pipe = make_pipeline(SimpleImputer(), StandardScaler())
num_cols = ["age", "fare"]
ct = ColumnTransformer(
    (
        ("numerical", num_pipe, num_cols),
        (
            "categorical",
            OneHotEncoder(
                sparse_output=False, drop="if_binary", handle_unknown="ignore"
            ),
            ["embarked", "sex", "pclass"],
        ),
    ),
    verbose_feature_names_out=False,
)
clf = make_pipeline(ct, SelectPercentile(percentile=50), LogisticRegression())
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.7621951219512195

透過全域設定，所有轉換器都會輸出 DataFrames。這讓我們可以輕鬆地繪製具有對應特徵名稱的邏輯回歸係數。

import pandas as pd

log_reg = clf[-1]
coef = pd.Series(log_reg.coef_.ravel(), index=log_reg.feature_names_in_)
_ = coef.sort_values().plot.barh()

為了展示下方的 config_context 功能，讓我們先將 transform_output 重設為其預設值。

set_config(transform_output="default")

當使用 config_context 設定輸出類型時，transform 或 fit_transform 被呼叫時的設定才算數。僅在您建構或擬合轉換器時設定這些並無效果。

from sklearn import config_context

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

StandardScaler()

在 Jupyter 環境中，請重新執行此儲存格以顯示 HTML 表示法或信任筆記本。
在 GitHub 上，HTML 表示法無法呈現，請嘗試使用 nbviewer.org 載入此頁面。

with config_context(transform_output="pandas"):
    # the output of transform will be a Pandas DataFrame
    X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled.head()

	年齡	票價
1088	0.151101	-0.479229
1001	NaN	-0.188153
660	-0.393297	-0.263234
657	-1.975455	-0.263234
285	2.532843	3.546068

在內容管理員之外，輸出將為 NumPy 陣列

X_test_scaled = scaler.transform(X_test[num_cols])
X_test_scaled[:5]

array([[ 0.1511007 , -0.47922861],
       [        nan, -0.18815268],
       [-0.39329747, -0.26323428],
       [-1.97545464, -0.26323428],
       [ 2.53284267,  3.54606834]])

指令碼總執行時間： (0 分鐘 0.157 秒)