注意
前往結尾下載完整的範例程式碼。或透過 JupyterLite 或 Binder 在您的瀏覽器中執行此範例
TSNE 中的近似最近鄰#
此範例展示如何在管道中鏈接 KNeighborsTransformer 和 TSNE。它還展示了如何封裝 nmslib
和 pynndescent
套件以取代 KNeighborsTransformer 並執行近似最近鄰。這些套件可以使用 pip install nmslib pynndescent
安裝。
注意:在 KNeighborsTransformer 中,我們使用一個定義,其中每個訓練點都將其自身包含在 n_neighbors
的計數中作為自己的鄰居,並且為了相容性原因,當 mode == 'distance'
時,會計算一個額外的鄰居。請注意,我們在建議的 nmslib
包裝器中也執行相同的操作。
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
首先,我們嘗試匯入套件,並在遺失套件時警告使用者。
我們定義一個包裝類別,以將 scikit-learn API 實作到 nmslib
,以及一個載入函數。
import joblib
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle
class NMSlibTransformer(TransformerMixin, BaseEstimator):
"""Wrapper for using nmslib as sklearn's KNeighborsTransformer"""
def __init__(self, n_neighbors=5, metric="euclidean", method="sw-graph", n_jobs=-1):
self.n_neighbors = n_neighbors
self.method = method
self.metric = metric
self.n_jobs = n_jobs
def fit(self, X):
self.n_samples_fit_ = X.shape[0]
# see more metric in the manual
# https://github.com/nmslib/nmslib/tree/master/manual
space = {
"euclidean": "l2",
"cosine": "cosinesimil",
"l1": "l1",
"l2": "l2",
}[self.metric]
self.nmslib_ = nmslib.init(method=self.method, space=space)
self.nmslib_.addDataPointBatch(X.copy())
self.nmslib_.createIndex()
return self
def transform(self, X):
n_samples_transform = X.shape[0]
# For compatibility reasons, as each sample is considered as its own
# neighbor, one extra neighbor will be computed.
n_neighbors = self.n_neighbors + 1
if self.n_jobs < 0:
# Same handling as done in joblib for negative values of n_jobs:
# in particular, `n_jobs == -1` means "as many threads as CPUs".
num_threads = joblib.cpu_count() + self.n_jobs + 1
else:
num_threads = self.n_jobs
results = self.nmslib_.knnQueryBatch(
X.copy(), k=n_neighbors, num_threads=num_threads
)
indices, distances = zip(*results)
indices, distances = np.vstack(indices), np.vstack(distances)
indptr = np.arange(0, n_samples_transform * n_neighbors + 1, n_neighbors)
kneighbors_graph = csr_matrix(
(distances.ravel(), indices.ravel(), indptr),
shape=(n_samples_transform, self.n_samples_fit_),
)
return kneighbors_graph
def load_mnist(n_samples):
"""Load MNIST, shuffle the data, and return only n_samples."""
mnist = fetch_openml("mnist_784", as_frame=False)
X, y = shuffle(mnist.data, mnist.target, random_state=2)
return X[:n_samples] / 255, y[:n_samples]
我們對不同的精確/近似最近鄰轉換器進行基準測試。
import time
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsTransformer
from sklearn.pipeline import make_pipeline
datasets = [
("MNIST_10000", load_mnist(n_samples=10_000)),
("MNIST_20000", load_mnist(n_samples=20_000)),
]
n_iter = 500
perplexity = 30
metric = "euclidean"
# TSNE requires a certain number of neighbors which depends on the
# perplexity parameter.
# Add one since we include each sample as its own neighbor.
n_neighbors = int(3.0 * perplexity + 1) + 1
tsne_params = dict(
init="random", # pca not supported for sparse matrices
perplexity=perplexity,
method="barnes_hut",
random_state=42,
n_iter=n_iter,
learning_rate="auto",
)
transformers = [
(
"KNeighborsTransformer",
KNeighborsTransformer(n_neighbors=n_neighbors, mode="distance", metric=metric),
),
(
"NMSlibTransformer",
NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),
),
(
"PyNNDescentTransformer",
PyNNDescentTransformer(
n_neighbors=n_neighbors, metric=metric, parallel_batch_queries=True
),
),
]
for dataset_name, (X, y) in datasets:
msg = f"Benchmarking on {dataset_name}:"
print(f"\n{msg}\n" + str("-" * len(msg)))
for transformer_name, transformer in transformers:
longest = np.max([len(name) for name, model in transformers])
start = time.time()
transformer.fit(X)
fit_duration = time.time() - start
print(f"{transformer_name:<{longest}} {fit_duration:.3f} sec (fit)")
start = time.time()
Xt = transformer.transform(X)
transform_duration = time.time() - start
print(f"{transformer_name:<{longest}} {transform_duration:.3f} sec (transform)")
if transformer_name == "PyNNDescentTransformer":
start = time.time()
Xt = transformer.transform(X)
transform_duration = time.time() - start
print(
f"{transformer_name:<{longest}} {transform_duration:.3f} sec"
" (transform)"
)
範例輸出
Benchmarking on MNIST_10000:
----------------------------
KNeighborsTransformer 0.007 sec (fit)
KNeighborsTransformer 1.139 sec (transform)
NMSlibTransformer 0.208 sec (fit)
NMSlibTransformer 0.315 sec (transform)
PyNNDescentTransformer 4.823 sec (fit)
PyNNDescentTransformer 4.884 sec (transform)
PyNNDescentTransformer 0.744 sec (transform)
Benchmarking on MNIST_20000:
----------------------------
KNeighborsTransformer 0.011 sec (fit)
KNeighborsTransformer 5.769 sec (transform)
NMSlibTransformer 0.733 sec (fit)
NMSlibTransformer 1.077 sec (transform)
PyNNDescentTransformer 14.448 sec (fit)
PyNNDescentTransformer 7.103 sec (transform)
PyNNDescentTransformer 1.759 sec (transform)
請注意,由於 numba 即時編譯器的開銷,PyNNDescentTransformer
在第一次 fit
和第一次 transform
期間需要更多時間。但在第一次呼叫之後,編譯的 Python 程式碼會由 numba 保存在快取中,後續呼叫不會受到此初始開銷的影響。 KNeighborsTransformer
和 NMSlibTransformer
在此僅執行一次,因為它們會顯示更穩定的 fit
和 transform
時間(它們沒有 PyNNDescentTransformer 的冷啟動問題)。
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
transformers = [
("TSNE with internal NearestNeighbors", TSNE(metric=metric, **tsne_params)),
(
"TSNE with KNeighborsTransformer",
make_pipeline(
KNeighborsTransformer(
n_neighbors=n_neighbors, mode="distance", metric=metric
),
TSNE(metric="precomputed", **tsne_params),
),
),
(
"TSNE with NMSlibTransformer",
make_pipeline(
NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),
TSNE(metric="precomputed", **tsne_params),
),
),
]
# init the plot
nrows = len(datasets)
ncols = np.sum([1 for name, model in transformers if "TSNE" in name])
fig, axes = plt.subplots(
nrows=nrows, ncols=ncols, squeeze=False, figsize=(5 * ncols, 4 * nrows)
)
axes = axes.ravel()
i_ax = 0
for dataset_name, (X, y) in datasets:
msg = f"Benchmarking on {dataset_name}:"
print(f"\n{msg}\n" + str("-" * len(msg)))
for transformer_name, transformer in transformers:
longest = np.max([len(name) for name, model in transformers])
start = time.time()
Xt = transformer.fit_transform(X)
transform_duration = time.time() - start
print(
f"{transformer_name:<{longest}} {transform_duration:.3f} sec"
" (fit_transform)"
)
# plot TSNE embedding which should be very similar across methods
axes[i_ax].set_title(transformer_name + "\non " + dataset_name)
axes[i_ax].scatter(
Xt[:, 0],
Xt[:, 1],
c=y.astype(np.int32),
alpha=0.2,
cmap=plt.cm.viridis,
)
axes[i_ax].xaxis.set_major_formatter(NullFormatter())
axes[i_ax].yaxis.set_major_formatter(NullFormatter())
axes[i_ax].axis("tight")
i_ax += 1
fig.tight_layout()
plt.show()
範例輸出
Benchmarking on MNIST_10000:
----------------------------
TSNE with internal NearestNeighbors 24.828 sec (fit_transform)
TSNE with KNeighborsTransformer 20.111 sec (fit_transform)
TSNE with NMSlibTransformer 21.757 sec (fit_transform)
Benchmarking on MNIST_20000:
----------------------------
TSNE with internal NearestNeighbors 51.955 sec (fit_transform)
TSNE with KNeighborsTransformer 50.994 sec (fit_transform)
TSNE with NMSlibTransformer 43.536 sec (fit_transform)
我們可以觀察到,預設的 TSNE
估計器及其內部的 NearestNeighbors
實作,在效能方面大致等同於具有 TSNE
和 KNeighborsTransformer
的管道。這是預期的,因為兩個管道在內部都依賴於相同的 NearestNeighbors
實作,該實作執行精確的鄰居搜尋。近似 NMSlibTransformer
在最小的資料集上已經比精確搜尋快一點,但這種速度差異預期會在具有大量樣本的資料集上變得更加顯著。
然而,請注意,並非所有近似搜尋方法都能保證提高預設精確搜尋方法的速度:事實上,自 scikit-learn 1.1 以來,精確搜尋實作已顯著改善。此外,暴力破解精確搜尋方法不需要在 fit
時間建置索引。因此,為了在 TSNE
管道的內容中獲得整體效能提升,近似搜尋在 transform
時的增益需要大於在 fit
時間建置近似搜尋索引所花費的額外時間。
最後,無論最近鄰搜尋如何,TSNE 演算法本身也計算密集型。因此,將最近鄰搜尋步驟的速度提高 5 倍並不會使整個管道的速度提高 5 倍。
相關範例