時間序列分割#

class sklearn.model_selection.TimeSeriesSplit(n_splits=5, *, max_train_size=None, test_size=None, gap=0)[原始碼]#

時間序列交叉驗證器。

提供訓練/測試索引，以將在固定時間間隔觀察到的時間序列資料樣本分割成訓練/測試集。在每次分割中，測試索引必須比之前的高，因此在交叉驗證器中進行洗牌是不適當的。

此交叉驗證物件是 KFold 的變體。在第 k 次分割中，它會將前 k 個折疊返回為訓練集，並將第 (k+1) 個折疊返回為測試集。

請注意，與標準交叉驗證方法不同，連續的訓練集是先前訓練集的超集。

在使用者指南中閱讀更多內容。

有關交叉驗證行為的可視化以及常見 scikit-learn 分割方法之間的比較，請參閱在 scikit-learn 中視覺化交叉驗證行為。

在 0.18 版本中新增。

參數:

n_splitsint，預設值=5: 分割次數。必須至少為 2。

在 0.22 版本中變更： n_splits 的預設值從 3 變更為 5。
max_train_sizeint，預設值=None: 單個訓練集的最大大小。
test_sizeint，預設值=None: 用於限制測試集的大小。預設值為 n_samples // (n_splits + 1)，這是 gap=0 時允許的最大值。

在 0.24 版本中新增。
gapint，預設值=0: 在測試集之前，從每個訓練集的末尾排除的樣本數。

在 0.24 版本中新增。

注意事項

訓練集的大小為 i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1)，其中 i 為第 i 次分割，測試集的大小預設為 n_samples//(n_splits + 1)，其中 n_samples 是樣本數。請注意，此公式僅在 test_size 和 max_train_size 保留為其預設值時才有效。

範例

>>> import numpy as np
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit()
>>> print(tscv)
TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)
>>> for i, (train_index, test_index) in enumerate(tscv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}")
...     print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[0]
  Test:  index=[1]
Fold 1:
  Train: index=[0 1]
  Test:  index=[2]
Fold 2:
  Train: index=[0 1 2]
  Test:  index=[3]
Fold 3:
  Train: index=[0 1 2 3]
  Test:  index=[4]
Fold 4:
  Train: index=[0 1 2 3 4]
  Test:  index=[5]
>>> # Fix test_size to 2 with 12 samples
>>> X = np.random.randn(12, 2)
>>> y = np.random.randint(0, 2, 12)
>>> tscv = TimeSeriesSplit(n_splits=3, test_size=2)
>>> for i, (train_index, test_index) in enumerate(tscv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}")
...     print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[0 1 2 3 4 5]
  Test:  index=[6 7]
Fold 1:
  Train: index=[0 1 2 3 4 5 6 7]
  Test:  index=[8 9]
Fold 2:
  Train: index=[0 1 2 3 4 5 6 7 8 9]
  Test:  index=[10 11]
>>> # Add in a 2 period gap
>>> tscv = TimeSeriesSplit(n_splits=3, test_size=2, gap=2)
>>> for i, (train_index, test_index) in enumerate(tscv.split(X)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}")
...     print(f"  Test:  index={test_index}")
Fold 0:
  Train: index=[0 1 2 3]
  Test:  index=[6 7]
Fold 1:
  Train: index=[0 1 2 3 4 5]
  Test:  index=[8 9]
Fold 2:
  Train: index=[0 1 2 3 4 5 6 7]
  Test:  index=[10 11]

有關更詳細的範例，請參閱時間相關的特徵工程。

get_metadata_routing()[原始碼]#

取得此物件的中繼資料路由。

請查看使用者指南，了解路由機制如何運作。

回傳值:

routingMetadataRequest: 一個封裝路由資訊的 MetadataRequest。

get_n_splits(X=None, y=None, groups=None)[原始碼]#

傳回交叉驗證器中的分割迭代次數。

參數:

Xobject: 始終忽略，為了相容性而存在。
yobject: 始終忽略，為了相容性而存在。
groupsobject: 始終忽略，為了相容性而存在。

回傳值:

n_splitsint: 傳回交叉驗證器中的分割迭代次數。

split(X, y=None, groups=None)[原始碼]#

產生將資料分割為訓練集和測試集的索引。

參數:

Xarray-like，形狀為 (n_samples, n_features): 訓練資料，其中 n_samples 是樣本數，而 n_features 是特徵數。
yarray-like，形狀為 (n_samples,): 始終忽略，為了相容性而存在。
groupsarray-like，形狀為 (n_samples,): 始終忽略，為了相容性而存在。

產生:

trainndarray: 該分割的訓練集索引。
testndarray: 該分割的測試集索引。

範例展示#

在 scikit-learn 中視覺化交叉驗證行為