稠密和稀疏資料上的 Lasso#

我們顯示 linear_model.Lasso 對於稠密和稀疏資料提供相同的結果,並且在稀疏資料的情況下,速度會提高。

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

from time import time

from scipy import linalg, sparse

from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso

比較稠密資料上兩個 Lasso 實作#

我們建立適合 Lasso 的線性迴歸問題,也就是說,特徵比樣本多。然後,我們將資料矩陣以稠密 (一般的) 和稀疏格式儲存,並在每個矩陣上訓練 Lasso。我們計算兩者的執行時間,並透過計算它們所學習的係數之間的差的歐幾里得範數,檢查它們是否學習到相同的模型。由於資料是稠密的,我們預期稠密資料格式會獲得較佳的執行時間。

X, y = make_regression(n_samples=200, n_features=5000, random_state=0)
# create a copy of X in sparse format
X_sp = sparse.coo_matrix(X)

alpha = 1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=1000)

t0 = time()
sparse_lasso.fit(X_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")

t0 = time()
dense_lasso.fit(X, y)
print(f"Dense Lasso done in {(time() - t0):.3f}s")

# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")

#
Sparse Lasso done in 0.110s
Dense Lasso done in 0.038s
Distance between coefficients : 1.01e-13

比較稀疏資料上兩個 Lasso 實作#

我們透過將所有小值替換為 0 來使先前的問題變為稀疏,並執行與上述相同的比較。由於資料現在是稀疏的,我們預期使用稀疏資料格式的實作會更快。

# make a copy of the previous data
Xs = X.copy()
# make Xs sparse by replacing the values lower than 2.5 with 0s
Xs[Xs < 2.5] = 0.0
# create a copy of Xs in sparse format
Xs_sp = sparse.coo_matrix(Xs)
Xs_sp = Xs_sp.tocsc()

# compute the proportion of non-zero coefficient in the data matrix
print(f"Matrix density : {(Xs_sp.nnz / float(X.size) * 100):.3f}%")

alpha = 0.1
sparse_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)
dense_lasso = Lasso(alpha=alpha, fit_intercept=False, max_iter=10000)

t0 = time()
sparse_lasso.fit(Xs_sp, y)
print(f"Sparse Lasso done in {(time() - t0):.3f}s")

t0 = time()
dense_lasso.fit(Xs, y)
print(f"Dense Lasso done in  {(time() - t0):.3f}s")

# compare the regression coefficients
coeff_diff = linalg.norm(sparse_lasso.coef_ - dense_lasso.coef_)
print(f"Distance between coefficients : {coeff_diff:.2e}")
Matrix density : 0.626%
Sparse Lasso done in 0.200s
Dense Lasso done in  0.742s
Distance between coefficients : 8.65e-12

腳本的總執行時間:(0 分鐘 1.159 秒)

相關範例

用於稀疏訊號的基於 L1 的模型

用於稀疏訊號的基於 L1 的模型

具有多任務 Lasso 的聯合特徵選擇

具有多任務 Lasso 的聯合特徵選擇

Lasso、Lasso-LARS 和彈性網路路徑

Lasso、Lasso-LARS 和彈性網路路徑

Lasso 模型選擇:AIC-BIC / 交叉驗證

Lasso 模型選擇:AIC-BIC / 交叉驗證

由 Sphinx-Gallery 產生的圖庫