scikit-learn 0.23 的發布重點#

我們很高興宣佈發佈 scikit-learn 0.23!添加了許多錯誤修復和改進,以及一些新的關鍵功能。我們在下面詳細介紹了此版本的幾個主要功能。如需所有變更的詳盡列表,請參閱發佈說明

要安裝最新版本(使用 pip)

pip install --upgrade scikit-learn

或使用 conda

conda install -c conda-forge scikit-learn

廣義線性模型和用於梯度提升的泊松損失#

期待已久的具有非正態損失函數的廣義線性模型現已推出。特別是,實施了三個新的迴歸器:PoissonRegressorGammaRegressorTweedieRegressor。泊松迴歸器可用於對正整數計數或相對頻率進行建模。請在使用者指南中閱讀更多資訊。此外,HistGradientBoostingRegressor 也支援新的 ‘poisson’ 損失。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X = rng.randn(n_samples, n_features)
# positive integer target correlated with X[:, 5] with many zeros:
y = rng.poisson(lam=np.exp(X[:, 5]) / 2)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
glm = PoissonRegressor()
gbdt = HistGradientBoostingRegressor(loss="poisson", learning_rate=0.01)
glm.fit(X_train, y_train)
gbdt.fit(X_train, y_train)
print(glm.score(X_test, y_test))
print(gbdt.score(X_test, y_test))
0.35776189065725783
0.42425183539869415

估計器的豐富視覺表示#

現在可以透過啟用 display='diagram' 選項在筆記本中視覺化估計器。這對於總結管道和其他複合估計器的結構特別有用,並具有互動性以提供詳細資訊。按一下下面的範例圖像以展開管道元素。請參閱視覺化複合估計器以了解如何使用此功能。

from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

set_config(display="diagram")

num_proc = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

cat_proc = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = make_column_transformer(
    (num_proc, ("feat1", "feat3")), (cat_proc, ("feat0", "feat2"))
)

clf = make_pipeline(preprocessor, LogisticRegression())
clf
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ('feat1', 'feat3')),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ('feat0', 'feat2'))])),
                ('logisticregression', LogisticRegression())])
在 Jupyter 環境中,請重新執行此儲存格以顯示 HTML 表示,或信任筆記本。
在 GitHub 上,HTML 表示無法呈現,請嘗試使用 nbviewer.org 載入此頁面。


KMeans 的可擴展性和穩定性改進#

KMeans 估計器已完全重新設計,現在速度明顯更快且更穩定。此外,Elkan 演算法現在與稀疏矩陣相容。估計器使用基於 OpenMP 的平行處理,而不是依賴 joblib,因此 n_jobs 參數不再起作用。如需有關如何控制執行緒數量的詳細資訊,請參閱我們的平行處理註釋。

import scipy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import completeness_score

rng = np.random.RandomState(0)
X, y = make_blobs(random_state=rng)
X = scipy.sparse.csr_matrix(X)
X_train, X_test, _, y_test = train_test_split(X, y, random_state=rng)
kmeans = KMeans(n_init="auto").fit(X_train)
print(completeness_score(kmeans.predict(X_test), y_test))
0.6684259852425617

基於直方圖的梯度提升估計器的改進#

HistGradientBoostingClassifierHistGradientBoostingRegressor 進行了各種改進。除了上述的泊松損失外,這些估計器現在還支援樣本權重。此外,還增加了自動提前停止標準:當樣本數量超過 1 萬時,預設會啟用提前停止。最後,使用者現在可以定義單調約束,以根據特定特徵的變化限制預測。在以下範例中,我們建構了一個目標,該目標通常與第一個特徵正相關,並帶有一些雜訊。應用單調約束允許預測捕獲第一個特徵的整體影響,而不是擬合雜訊。如需使用案例範例,請參閱直方圖梯度提升樹中的特徵

import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# from sklearn.inspection import plot_partial_dependence
from sklearn.inspection import PartialDependenceDisplay
from sklearn.ensemble import HistGradientBoostingRegressor

n_samples = 500
rng = np.random.RandomState(0)
X = rng.randn(n_samples, 2)
noise = rng.normal(loc=0.0, scale=0.01, size=n_samples)
y = 5 * X[:, 0] + np.sin(10 * np.pi * X[:, 0]) - noise

gbdt_no_cst = HistGradientBoostingRegressor().fit(X, y)
gbdt_cst = HistGradientBoostingRegressor(monotonic_cst=[1, 0]).fit(X, y)

# plot_partial_dependence has been removed in version 1.2. From 1.2, use
# PartialDependenceDisplay instead.
# disp = plot_partial_dependence(
disp = PartialDependenceDisplay.from_estimator(
    gbdt_no_cst,
    X,
    features=[0],
    feature_names=["feature 0"],
    line_kw={"linewidth": 4, "label": "unconstrained", "color": "tab:blue"},
)
# plot_partial_dependence(
PartialDependenceDisplay.from_estimator(
    gbdt_cst,
    X,
    features=[0],
    line_kw={"linewidth": 4, "label": "constrained", "color": "tab:orange"},
    ax=disp.axes_,
)
disp.axes_[0, 0].plot(
    X[:, 0], y, "o", alpha=0.5, zorder=-1, label="samples", color="tab:green"
)
disp.axes_[0, 0].set_ylim(-3, 3)
disp.axes_[0, 0].set_xlim(-1, 1)
plt.legend()
plt.show()
plot release highlights 0 23 0

Lasso 和 ElasticNet 的樣本權重支援#

兩個線性迴歸器 LassoElasticNet 現在支援樣本權重。

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
import numpy as np

n_samples, n_features = 1000, 20
rng = np.random.RandomState(0)
X, y = make_regression(n_samples, n_features, random_state=rng)
sample_weight = rng.rand(n_samples)
X_train, X_test, y_train, y_test, sw_train, sw_test = train_test_split(
    X, y, sample_weight, random_state=rng
)
reg = Lasso()
reg.fit(X_train, y_train, sample_weight=sw_train)
print(reg.score(X_test, y_test, sw_test))
0.999791942438998

腳本的總執行時間:(0 分鐘 0.621 秒)

相關範例

scikit-learn 1.4 的發行重點

scikit-learn 1.4 的發行重點

scikit-learn 0.24 的發行重點

scikit-learn 0.24 的發行重點

單調約束

單調約束

scikit-learn 1.1 的發行重點

scikit-learn 1.1 的發行重點

由 Sphinx-Gallery 產生的圖庫