Scikit Learn 如何计算 f1_macro 进行多类分类?
How does Scikit Learn compute f1_macro for multiclass classification?
我认为 Scikit 中多类的 f1_macro 将使用以下方法计算:
2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)
但手动检查显示并非如此,该值略高于 scikit 计算的值。我浏览了文档,找不到公式。
例如,鸢尾花数据集产生这样的结果:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
'species':iris.target
})
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)
输出:
Prec_macro_scikit: 0.9555555555555556
Rec_macro_scikit: 0.9666666666666667
f1_macro_scikit: 0.9586466165413534
但是,手动计算使用:
f1_macro_manual = 2 * pre_macro * recall_macro / (pre_macro + recall_macro )
产量:
f1_macro_manual: 0.9610789980732178
我正在尝试找出差异。
最终更新:
由于 user2357112
非常有价值的评论(也请参阅下面的 his/her 回答)并且在网上阅读了一些误解和虚假信息之后,最后我不得不做出某种关于 宏类型 f1-score 公式的调查。
正如下面的 user2357112
所揭示的(实际上是首先),f1_macro
的算法与您在手动计算中使用的算法略有不同。
最后我找到了一个 reliable source.
证明sklearn
使用这个公式:
来自 sklearn
的 classification.py
模块的 precision_recall_fscore_support()
方法的片段:
precision = _prf_divide(tp_sum, pred_sum,
'precision', 'predicted', average, warn_for)
recall = _prf_divide(tp_sum, true_sum,
'recall', 'true', average, warn_for)
# Don't need to warn for F: either P or R warned, or tp == 0 where pos
# and true are nonzero, in which case, F is well-defined and zero
f_score = ((1 + beta2) * precision * recall /
(beta2 * precision + recall))
f_score[tp_sum == 0] = 0.0
# Average the results
if average == 'weighted':
weights = true_sum
if weights.sum() == 0:
return 0, 0, 0, None
elif average == 'samples':
weights = sample_weight
else:
weights = None
if average is not None:
assert average != 'binary' or len(precision) == 1
precision = np.average(precision, weights=weights)
recall = np.average(recall, weights=weights)
f_score = np.average(f_score, weights=weights)
true_sum = None # return no support
return precision, recall, f_score, true_sum
正如我们所见sklearn
在精度和召回率必须被平均之前进行最终平均:
precision = np.average(precision, weights=weights)
recall = np.average(recall, weights=weights)
f_score = np.average(f_score, weights=weights)
最后稍微修改了你的代码:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
'species':iris.target
})
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
f1_score_raw = metrics.f1_score(y_test, y_pred, average=None)
f1_macro_manual = f1_score_raw.mean()
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)
print("f1_score_raw:", f1_score_raw)
print("f1_macro_manual:", f1_macro_manual)
输出:
[[16 0 0]
[ 0 15 0]
[ 0 6 8]]
precision recall f1-score support
0 1.00 1.00 1.00 16
1 0.71 1.00 0.83 15
2 1.00 0.57 0.73 14
avg / total 0.90 0.87 0.86 45
Prec_macro_scikit: 0.9047619047619048
Rec_macro_scikit: 0.8571428571428571
f1_macro_scikit: 0.8535353535353535
f1_score_raw: [1. 0.83333333 0.72727273]
f1_macro_manual: 0.8535353535353535
或者你可以像你一样做一个"manual calculation":
import numpy as np
pre = metrics.precision_score(y_test, y_pred, average=None)
recall = metrics.recall_score(y_test, y_pred, average=None)
f1_macro_manual = 2 * pre * recall / (pre + recall )
f1_macro_manual = np.average(f1_macro_manual)
print("f1_macro_manual_2:", f1_macro_manual)
输出:
f1_macro_manual_2: 0.8535353535353535
宏平均不是那样工作的。宏观平均 f1 分数不是根据宏观平均精度和召回值计算的。
Macro-averaging computes the value of a metric for each class and returns an unweighted average of the individual values. 因此,用 average='macro'
计算 f1_score
计算每个 class 的 f1 分数和 returns 这些分数的平均值。
如果您想自己计算宏平均值,请指定 average=None
以获得每个 class 的二进制 f1 分数数组,然后取该数组的 mean()
:
binary_scores = metrics.f1_score(y_test, y_pred, average=None)
manual_f1_macro = binary_scores.mean()
可运行演示 here。
我认为 Scikit 中多类的 f1_macro 将使用以下方法计算:
2 * Macro_precision * Macro_recall / (Macro_precision + Macro_recall)
但手动检查显示并非如此,该值略高于 scikit 计算的值。我浏览了文档,找不到公式。
例如,鸢尾花数据集产生这样的结果:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
'species':iris.target
})
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)
输出:
Prec_macro_scikit: 0.9555555555555556
Rec_macro_scikit: 0.9666666666666667
f1_macro_scikit: 0.9586466165413534
但是,手动计算使用:
f1_macro_manual = 2 * pre_macro * recall_macro / (pre_macro + recall_macro )
产量:
f1_macro_manual: 0.9610789980732178
我正在尝试找出差异。
最终更新:
由于 user2357112
非常有价值的评论(也请参阅下面的 his/her 回答)并且在网上阅读了一些误解和虚假信息之后,最后我不得不做出某种关于 宏类型 f1-score 公式的调查。
正如下面的 user2357112
所揭示的(实际上是首先),f1_macro
的算法与您在手动计算中使用的算法略有不同。
最后我找到了一个 reliable source.
证明sklearn
使用这个公式:
来自 sklearn
的 classification.py
模块的 precision_recall_fscore_support()
方法的片段:
precision = _prf_divide(tp_sum, pred_sum,
'precision', 'predicted', average, warn_for)
recall = _prf_divide(tp_sum, true_sum,
'recall', 'true', average, warn_for)
# Don't need to warn for F: either P or R warned, or tp == 0 where pos
# and true are nonzero, in which case, F is well-defined and zero
f_score = ((1 + beta2) * precision * recall /
(beta2 * precision + recall))
f_score[tp_sum == 0] = 0.0
# Average the results
if average == 'weighted':
weights = true_sum
if weights.sum() == 0:
return 0, 0, 0, None
elif average == 'samples':
weights = sample_weight
else:
weights = None
if average is not None:
assert average != 'binary' or len(precision) == 1
precision = np.average(precision, weights=weights)
recall = np.average(recall, weights=weights)
f_score = np.average(f_score, weights=weights)
true_sum = None # return no support
return precision, recall, f_score, true_sum
正如我们所见sklearn
在精度和召回率必须被平均之前进行最终平均:
precision = np.average(precision, weights=weights)
recall = np.average(recall, weights=weights)
f_score = np.average(f_score, weights=weights)
最后稍微修改了你的代码:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
'species':iris.target
})
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]
y=data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#Compute metrics using scikit
from sklearn import metrics
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
pre_macro = metrics.precision_score(y_test, y_pred, average="macro")
recall_macro = metrics.recall_score(y_test, y_pred, average="macro")
f1_macro_scikit = metrics.f1_score(y_test, y_pred, average="macro")
f1_score_raw = metrics.f1_score(y_test, y_pred, average=None)
f1_macro_manual = f1_score_raw.mean()
print ("Prec_macro_scikit:", pre_macro)
print ("Rec_macro_scikit:", recall_macro)
print ("f1_macro_scikit:", f1_macro_scikit)
print("f1_score_raw:", f1_score_raw)
print("f1_macro_manual:", f1_macro_manual)
输出:
[[16 0 0]
[ 0 15 0]
[ 0 6 8]]
precision recall f1-score support
0 1.00 1.00 1.00 16
1 0.71 1.00 0.83 15
2 1.00 0.57 0.73 14
avg / total 0.90 0.87 0.86 45
Prec_macro_scikit: 0.9047619047619048
Rec_macro_scikit: 0.8571428571428571
f1_macro_scikit: 0.8535353535353535
f1_score_raw: [1. 0.83333333 0.72727273]
f1_macro_manual: 0.8535353535353535
或者你可以像你一样做一个"manual calculation":
import numpy as np
pre = metrics.precision_score(y_test, y_pred, average=None)
recall = metrics.recall_score(y_test, y_pred, average=None)
f1_macro_manual = 2 * pre * recall / (pre + recall )
f1_macro_manual = np.average(f1_macro_manual)
print("f1_macro_manual_2:", f1_macro_manual)
输出:
f1_macro_manual_2: 0.8535353535353535
宏平均不是那样工作的。宏观平均 f1 分数不是根据宏观平均精度和召回值计算的。
Macro-averaging computes the value of a metric for each class and returns an unweighted average of the individual values. 因此,用 average='macro'
计算 f1_score
计算每个 class 的 f1 分数和 returns 这些分数的平均值。
如果您想自己计算宏平均值,请指定 average=None
以获得每个 class 的二进制 f1 分数数组,然后取该数组的 mean()
:
binary_scores = metrics.f1_score(y_test, y_pred, average=None)
manual_f1_macro = binary_scores.mean()
可运行演示 here。