如何在 sklearn LeaveOneGroupOut 中访问每个 cv 迭代的组值

Question

我有一些代码来验证我想在我的数据中使用每年作为保留集的模型。因此，我正在使用 sklearn LeaveOneGroupOut:

log_loss_data = [] 
acc_data = []

years = np.arange(df.year.min(),df.year.max()+1)[::-1]

groups = df['year']

X = df[[__my_features__]]
y = df[__my_target__]

logo = LeaveOneGroupOut()
logo.get_n_splits(X, y, groups)

logo.get_n_splits(groups=groups)

for year, (train_index, test_index) in zip(years, logo.split(X, y, groups)):
  print(f'Leaving out {year}...')

  X_train, X_test = X.iloc[train_index].copy(), X.iloc[test_index].copy()
  y_train, y_test = y.iloc[train_index].copy(), y.iloc[test_index].copy()

  model = LGBMClassifier()
  model.fit(X_train, y_train)

  X_test["make_pred"] = (pd.
                         DataFrame(model.predict_proba(X_test),index=X_test.index,columns=[0,"pred"])[["pred"]]
                         )

  log_loss_data.append(log_loss(y_test,X_test["pred"])) 
  acc_data.append(accuracy_score(y_test,np.round(X_test["pred"])))

完成后，我就有了每组的对数损失和准确度分数列表。上面的代码假定组的顺序是从大到小，但我不确定是否是这样。我想将我的简历分数与他们相应的组年相关联，以查看是否有任何年份（或 years/seasonality 组）导致 different/worse 分数。在文档中，似乎只有两种方法 .get_n_splits() 和 .split()。我认为肯定有一种方法可以在每次 cv 迭代中访问组值...我这个假设不正确吗？

编辑：我做了一些测试，结果发现数字组可能按从小到大的顺序迭代。为了检查这一点，我创建了两个不同的模型。一个使用我数据中最早的一年作为测试集，另一个使用最新的。这些模型的相应分数分别与第一个和最后一个分组的 cv 迭代分数相匹配。虽然没有官方文档（我遇到过）证实这一点，但根据这个测试，我非常有信心它们确实按照从最小到最大的顺序进行了迭代。

Answer 1

是的，正如您所发现的，拆分是按照组标识符的顺序进行的。

在the source, you can see this: the group array is passed through numpy.unique, which returns the items in order中，然后循环。

如何在 sklearn LeaveOneGroupOut 中访问每个 cv 迭代的组值

How to access the group value for each cv iteration in sklearn LeaveOneGroupOut

python

scikit-learn