如何使用 scikit 学习计算多类案例的精度、召回率、准确性和 f1 分数？

Question

我正在处理情绪分析问题，数据如下所示：

label instances
    5    1190
    4     838
    3     239
    1     204
    2     127

所以我的数据是不平衡的，因为 1190 instances 被标记为 5。对于我使用 scikit 的分类 SVC。问题是我不知道如何以正确的方式平衡我的数据，以便准确计算多类案例的精度、召回率、准确性和 f1 分数。所以我尝试了以下方法：

第一个：

    wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
    wclf.fit(X, y)
    weighted_prediction = wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
                              average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
                                    average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)

第二个：

auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)

print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)

print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
                            average='weighted')

print 'Recall:', recall_score(y_test, auto_weighted_prediction,
                              average='weighted')

print 'Precision:', precision_score(y_test, auto_weighted_prediction,
                                    average='weighted')

print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)

print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)

第三名：

clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)


from sklearn.metrics import precision_score, \
    recall_score, confusion_matrix, classification_report, \
    accuracy_score, f1_score

print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)


F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
  sample_weight=sample_weight)
 0.930416613529

但是，我收到这样的警告：

/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with 
multiclass or multilabel data or pos_label=None will result in an 
exception. Please set an explicit value for `average`, one of (None, 
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for 
instance, scoring="f1_weighted" instead of scoring="f1"

如何正确处理我的不平衡数据，以便以正确的方式计算分类器的指标？

Answer 1

首先，仅使用计数分析来判断您的数据是否不平衡有点困难。例如：千分之一的正面观察只是噪音、错误或科学上的突破？你永远不知道。
因此，最好利用所有可用的知识并明智地选择其状态。

好吧，如果真的不平衡怎么办？
再一次——看看你的数据。有时您可以找到一两次观察值乘以一百倍。有时创建这个假的-class-观察结果很有用。
如果所有数据都是干净的，下一步是在预测模型中使用 class 权重。

那么多class指标呢？
根据我的经验，通常会使用 none 的指标。主要有两个原因。
第一：使用概率总是比使用可靠预测更好（因为如果它们都给你相同的 class，你还能如何区分具有 0.9 和 0.6 预测的模型？）
第二：比较您的预测模型并根据一个好的指标构建新模型要容易得多。
根据我的经验，我可以推荐 logloss or MSE（或者只是均方误差）。

如何修复 sklearn 警告？
只是简单地（正如杨杰注意到的那样）用其中之一覆盖 average 参数值：'micro'（全局计算指标），'macro'（为每个标签计算指标）或'weighted'（与宏相同，但具有自动权重）。

f1_score(y_test, prediction, average='weighted')

你所有的警告都是在使用默认 average 值 'binary' 调用度量函数之后发出的，这不适合 multiclass 预测。
祝你好运，享受机器学习的乐趣！

编辑：
我发现另一个回答者建议切换到我不同意的回归方法（例如 SVR）。据我所知，甚至没有 multiclass 回归这样的东西。是的，存在非常不同的多标签回归，是的，在某些情况下可以在回归和 class 化之间切换（如果 classes 以某种方式排序），但它非常罕见。

我建议（在 scikit-learn 的范围内）尝试另一个非常强大的 class化工具：gradient boosting, random forest (my favorite), KNeighbors 等等。

之后您可以计算预测之间的算术或几何平均数，大多数情况下您会得到更好的结果。

final_prediction = (KNNprediction * RFprediction) ** 0.5

Answer 2

提出问题

回答问题'what metric should be used for multi-class classification with imbalanced data'：Macro-F1-measure。也可以使用 Macro Precision 和 Macro Recall，但它们不像 binary classificaion 那样容易解释，它们已经被纳入 F-measure，过多的指标会使方法比较、参数调整等复杂化。

微平均对 class 不平衡很敏感：例如，如果您的方法适用于最常见的标签并且完全混淆其他标签，则微平均指标显示出良好的结果。

加权平均不太适合不平衡数据，因为它按标签计数加权。而且，它太难解释了，不受欢迎：比如下面很详细的就没有提到这样的平均survey 强烈建议看一下：

Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437.

特定于应用程序的问题

但是，回到你的任务，我会研究 2 个主题：

通常用于您的特定任务的指标 - 它让 (a) 将您的方法与其他人进行比较并了解您是否做了某事错了，并且（b）不要自己探索这个并重用某人其他人的调查结果；
你的方法的不同错误的成本 - 对于例如，您的应用程序的用例可能依赖于 4 星和 5 星仅评论 - 在这种情况下，好的指标应该只计算这 2 标签。

常用指标。 在查阅文献后我可以推断，有 2 个主要评估指标：

Accuracy，例如在

Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using Yelp Business."

(link) - 请注意，作者使用几乎相同的评级分布，请参见图 5。

Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales." Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2005.

(link)

MSE (or, less often, Mean Absolute Error - MAE) - 例如，参见

Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with restaurant reviews." Final Projects from CS N 224 (2010).

(link) - 他们探索准确性和 MSE，考虑到后者更好

Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis." Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.

(link) - 他们利用 scikit-learn 进行评估和基线方法，并声明他们的代码可用；但是，我找不到它，所以如果你需要它，请写信给作者，作品很新，似乎是用Python.

写的

不同错误的成本。如果您更关心避免严重失误，例如评估 1 星到 5 星的评论或类似的东西，看看 MSE；如果差异很重要，但不是那么重要，请尝试 MAE，因为它不平方差异；否则保持准确性。

关于方法，而不是指标

尝试回归方法，例如SVR，因为它们通常优于 Multiclass classifiers，如 SVC 或 OVA SVM。

Answer 3

我认为对于哪些权重用于什么存在很多混淆。我不确定我是否确切地知道是什么困扰着您，所以我将涵盖不同的主题，请耐心等待 ;)。

Class 权重

来自 class_weight 参数的权重用于 训练 classifier。它们 不会用于计算您正在使用的任何指标 ：使用不同的 class 权重，数字会有所不同，因为 classifier 是不同。

基本上在每个 scikit-learn classifier 中，class 权重用于告诉您的模型 class 有多重要。这意味着在训练期间，classifier 将付出额外的努力来 class 正确地验证具有高权重的 classes。
他们如何做到这一点是特定于算法的。如果您想了解有关 SVC 如何工作的详细信息并且该文档对您没有意义，请随时提及。

指标

拥有 classifier 后，您想知道它的性能如何。在这里您可以使用您提到的指标：accuracy、recall_score、f1_score...

通常当 class 分布不平衡时，准确性被认为是一个糟糕的选择，因为它会给仅预测最频繁 class 的模型高分。

我不会详细说明所有这些指标，但请注意，除了 accuracy，它们自然适用于 class 级别：正如您在 print 中看到的在 class化报告中，它们是为每个 class 定义的。他们依赖于 true positives 或 false negative 等概念，这些概念需要定义哪个 class 是 positive。

             precision    recall  f1-score   support

          0       0.65      1.00      0.79        17
          1       0.57      0.75      0.65        16
          2       0.33      0.06      0.10        17
avg / total       0.52      0.60      0.51        50

警告

F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The 
default `weighted` averaging is deprecated, and from version 0.18, 
use of precision, recall or F-score with multiclass or multilabel data  
or pos_label=None will result in an exception. Please set an explicit 
value for `average`, one of (None, 'micro', 'macro', 'weighted', 
'samples'). In cross validation use, for instance, 
scoring="f1_weighted" instead of scoring="f1".

您收到此警告是因为您在使用 f1 分数、召回率和精度时没有定义它们的计算方式！这个问题可以换个说法：从上面的 classification 报告中，你如何输出 one f1-score 的全局数字？你可以：

取每个 class 的 f1 分数的平均值：即上面的 avg / total 结果。也叫宏平均
使用真阳性/假阴性等的全局计数计算 f1 分数（将每个真阳性/假阴性的数量相加 class）。又名微平均。
计算 f1 分数的加权平均值。在 scikit-learn 中使用 'weighted' 会通过 class 的支持权衡 f1-score：a class 的元素越多，f1-score 对于这个 class在计算中。

这是 scikit-learn 中的 3 个选项，警告是说你必须选择一个。所以你必须为 score 方法指定一个 average 参数。

你选择哪一个取决于你想如何衡量 classifier 的性能：例如宏平均不考虑 class 不平衡和 f1 分数class 1 与 class 5 的 f1 分数一样重要。但是，如果使用加权平均，class 5 的重要性会更高。

这些指标中的整个参数规范目前在 scikit-learn 中不是很清楚，根据文档，它会在 0.18 版中变得更好。他们正在删除一些不明显的标准行为，并发出警告以便开发人员注意到它。

计算分数

我想提的最后一件事（如果您知道，请随意跳过它）是分数只有在根据 classifier 计算的数据时才有意义没见过。这一点非常重要，因为您在用于拟合 class 的数据上获得的任何分数都是完全不相关的。

这里有一种使用 StratifiedShuffleSplit 的方法，它可以随机拆分数据（洗牌后），从而保留标签分布。

from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
    X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    print(f1_score(y_test, y_pred, average="macro"))
    print(precision_score(y_test, y_pred, average="macro"))
    print(recall_score(y_test, y_pred, average="macro"))

希望这对您有所帮助。

Answer 4

这里有很多非常详细的答案，但我认为您没有回答正确的问题。据我了解这个问题，有两个问题：

如何对多class 问题进行评分？
如何处理不平衡的数据？

1.

您可以将 scikit-learn 中的大部分评分函数用于多 class 问题和单 class 问题。例如：

from sklearn.metrics import precision_recall_fscore_support as score

predicted = [1,2,3,4,5,1,2,1,1,4,5] 
y_test = [1,2,3,4,5,1,2,1,1,4,1]

precision, recall, fscore, support = score(y_test, predicted)

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))

这样您就可以为每个 classes 得到有形且可解释的数字。

| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1     | 94%       | 83%    | 0.88   | 204     |
| 2     | 71%       | 50%    | 0.54   | 127     |
| ...   | ...       | ...    | ...    | ...     |
| 4     | 80%       | 98%    | 0.89   | 838     |
| 5     | 93%       | 81%    | 0.91   | 1190    |

那么...

2.

...您可以判断不平衡数据是否是一个问题。如果代表较少的 classes（class 1 和 2）的得分低于具有更多训练样本的 classes（class 4 和 5），那么您知道不平衡的数据实际上是一个问题，您可以采取相应的行动，如该线程中的其他一些答案所述。但是，如果您要预测的数据中存在相同的 class 分布，则您的不平衡训练数据很好地代表了数据，因此，不平衡是一件好事。

如何使用 scikit 学习计算多类案例的精度、召回率、准确性和 f1 分数？

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

python

nlp

artificial-intelligence

machine-learning

scikit-learn

Class 权重

指标

警告

计算分数

1.

2.