我的 sklearn_crfsuite 模型没有学到任何东西
My sklearn_crfsuite model does not learn anything
我正在尝试按照教程 here 创建注释预测模型,但我的模型没有学习任何东西。
这是我的训练数据和标签的示例:
[{'bias': 1.0, 'word.lower()': '\nreference\nissue\ndate\ndgt86620\n4\n
\n19-dec-05\nfalcon\n7x\ntype\ncertification\n27_4-100\nthis\ndocument\nis\nthe\nintellectual\nprop...nairbrakes\nhandle\nposition\n
\n \n \n \n \n \n \n \n \n \n \n \n \n
\n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', 'word[-3:]': 'es\n',
'word[-2:]': 's\n', 'word.isupper()': False, 'word.istitle()': False,
'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO',
'w_emb_0': 0.03418987928976114, 'w_emb_1': 0.617338281 1066742,
'w_emb_2': 0.004420982990809508, 'w_emb_3': 0.08293022662242588,
'w_emb_4': 0.22162269482070363, 'w_emb_5': 0.4334545347397811,
'w_emb_6': 0.7844891779932379, 'w_emb_7': 0.028043262790094503,
'w_emb_8': 0.5233847386564157, 'w_emb_9': 0.9685677133128328, 'w_em
b_10': 0.19379126558708126, 'w_emb_11': 0.2809608896964926,
'w_emb_12': 0.384759230815804, 'w_emb_13': 0.15385904662767336,
'w_emb_14': 0.5206500040610533, 'w_emb_15': 0.009148526006733215,
'w_emb_16': 0.5894118695171416, 'w_emb_17': 0.7356989708459056,
'w_emb_18': 0. 5576774100159024, 'w_emb_19': 0.2185294430010376,
'BOS': True, '+1:word.lower()': 'reference', '+1:word.istitle()':
False, '+1:word.isupper()': True, '+1:postag': 'POS', '+1:postag[:2]':
'PO'}, {'bias': 1.0, 'word.lower()': 'reference', 'word[-3:]': 'NCE',
'word[-2:]' : 'CE', 'word.isupper()': True, 'word.istitle()': False,
'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO',
'w_emb_0': -0.390038, 'w_emb_1': 0.30677223, 'w_emb_2': -1.010975,
'w_emb_3': 0.3656154, 'w_emb_4': 0.5319459, 'w_emb_5': 0.45572615,
'w_emb_6': -0.4 6090943, 'w_emb_7': 0.87250936, 'w_emb_8':
0.036648277, 'w_emb_9': -0.3057043, 'w_emb_10': 0.33427167, 'w_emb_11': -0.19664396, 'w_emb_12': -0.64899784, 'w_emb_13':
-0.1785065, 'w_emb_14': -0.117423356, 'w_emb_15': 0.16247013, 'w_emb_16': 0.11694676, 'w_emb_17': -0.30 693895, 'w_emb_18':
-1.0026807, 'w_emb_19': 0.9946743, '-1:word.lower()': '\nreference...n \n \n \n \n \n \n \n
\n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', '-1:word.istitle()':
False, '-1:word.isupper()': False, '-1:postag': 'POS',
'-1:postag[:2]': 'PO', '+1:word.lower()': 'issue',
'+1:word.istitle()': False, '+1:word. isupper()': True, '+1:postag':
'POS', '+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'issue',
'word[-3:]': 'SUE', 'word[-2:]': 'UE', 'word.isupper()': True,
'word.istitle()': False, 'word.isdigit()': False, 'postag': 'POS',
'postag[:2]': 'PO', 'w_emb_0': -1.220 4882, 'w_emb_1': 0.8920707,
'w_emb_2': -3.8380668, 'w_emb_3': 1.5641377, 'w_emb_4': 2.1918254,
'w_emb_5': 1.8509868, 'w_emb_6': -2.0664182, 'w_emb_7': 3.1591077,
'w_emb_8': -0.33126026, 'w_emb_9': -1.4278139, 'w_emb_10': 0.9291533,
'w_emb_11': -0.6761407, 'w_emb_12':
-2.9582167, 'w_emb_13': -0.5395561, 'w_emb_14': -0.8363763, 'w_emb_15': 0.25568742, 'w_emb_16': 0.4932978, 'w_emb_17': -1.6198335,
'w_emb_18': -4.183924, 'w_emb_19': 4.281094, '-1:word.lower()':
'reference', '-1:word.istitle()': False, '-1:word.isupper()': True,
'-1:p ostag': 'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': 'date',
'+1:word.istitle()': False, '+1:word.isupper()': True, '+1:postag':
'POS', '+1:postag[:2]': 'PO'}...]
y_train = ['O', 'O', 'O'...'I-data-c-a-s_message-type'....'B-data-c-a-s_message-type']
这是模型定义和训练:
`
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
msg = metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4)
print(msg)
`
不幸的是,我的模型没有学到任何东西:
precision recall f1-score support
B-data-c-a-s_message-type 0.0000 0.0000 0.0000 23
I-data-c-a-s_message-type 0.0000 0.0000 0.0000 90
micro avg 0.0000 0.0000 0.0000 113
macro avg 0.0000 0.0000 0.0000 113
weighted avg 0.0000 0.0000 0.0000 113
问题已解决。
正如你在上面看到的,支持(评估样本的数量)总共是 113。但是,训练集中的样本数量只有大约 14!这太小了!我只是没有注意到这种差异。我已经将训练和测试数据集倒置,现在,性能是这样的:
precision recall f1-score support
B-data-c-a-s_message-type 0.0000 0.0000 0.0000 0
I-data-c-a-s_message-type 0.6364 1.0000 0.7778 14
micro avg 0.6364 1.0000 0.7778 14
macro avg 0.3182 0.5000 0.3889 14
weighted avg 0.6364 1.0000 0.7778 14
我正在尝试按照教程 here 创建注释预测模型,但我的模型没有学习任何东西。 这是我的训练数据和标签的示例:
[{'bias': 1.0, 'word.lower()': '\nreference\nissue\ndate\ndgt86620\n4\n \n19-dec-05\nfalcon\n7x\ntype\ncertification\n27_4-100\nthis\ndocument\nis\nthe\nintellectual\nprop...nairbrakes\nhandle\nposition\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', 'word[-3:]': 'es\n', 'word[-2:]': 's\n', 'word.isupper()': False, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', 'w_emb_0': 0.03418987928976114, 'w_emb_1': 0.617338281 1066742, 'w_emb_2': 0.004420982990809508, 'w_emb_3': 0.08293022662242588, 'w_emb_4': 0.22162269482070363, 'w_emb_5': 0.4334545347397811, 'w_emb_6': 0.7844891779932379, 'w_emb_7': 0.028043262790094503, 'w_emb_8': 0.5233847386564157, 'w_emb_9': 0.9685677133128328, 'w_em b_10': 0.19379126558708126, 'w_emb_11': 0.2809608896964926, 'w_emb_12': 0.384759230815804, 'w_emb_13': 0.15385904662767336, 'w_emb_14': 0.5206500040610533, 'w_emb_15': 0.009148526006733215, 'w_emb_16': 0.5894118695171416, 'w_emb_17': 0.7356989708459056, 'w_emb_18': 0. 5576774100159024, 'w_emb_19': 0.2185294430010376, 'BOS': True, '+1:word.lower()': 'reference', '+1:word.istitle()': False, '+1:word.isupper()': True, '+1:postag': 'POS', '+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'reference', 'word[-3:]': 'NCE', 'word[-2:]' : 'CE', 'word.isupper()': True, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', 'w_emb_0': -0.390038, 'w_emb_1': 0.30677223, 'w_emb_2': -1.010975, 'w_emb_3': 0.3656154, 'w_emb_4': 0.5319459, 'w_emb_5': 0.45572615, 'w_emb_6': -0.4 6090943, 'w_emb_7': 0.87250936, 'w_emb_8': 0.036648277, 'w_emb_9': -0.3057043, 'w_emb_10': 0.33427167, 'w_emb_11': -0.19664396, 'w_emb_12': -0.64899784, 'w_emb_13': -0.1785065, 'w_emb_14': -0.117423356, 'w_emb_15': 0.16247013, 'w_emb_16': 0.11694676, 'w_emb_17': -0.30 693895, 'w_emb_18': -1.0026807, 'w_emb_19': 0.9946743, '-1:word.lower()': '\nreference...n \n \n \n \n \n \n \n \n0\ntable\n1\n:\nairbrake\ncas\nmessages\n', '-1:word.istitle()': False, '-1:word.isupper()': False, '-1:postag': 'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': 'issue', '+1:word.istitle()': False, '+1:word. isupper()': True, '+1:postag': 'POS', '+1:postag[:2]': 'PO'}, {'bias': 1.0, 'word.lower()': 'issue', 'word[-3:]': 'SUE', 'word[-2:]': 'UE', 'word.isupper()': True, 'word.istitle()': False, 'word.isdigit()': False, 'postag': 'POS', 'postag[:2]': 'PO', 'w_emb_0': -1.220 4882, 'w_emb_1': 0.8920707, 'w_emb_2': -3.8380668, 'w_emb_3': 1.5641377, 'w_emb_4': 2.1918254, 'w_emb_5': 1.8509868, 'w_emb_6': -2.0664182, 'w_emb_7': 3.1591077, 'w_emb_8': -0.33126026, 'w_emb_9': -1.4278139, 'w_emb_10': 0.9291533, 'w_emb_11': -0.6761407, 'w_emb_12': -2.9582167, 'w_emb_13': -0.5395561, 'w_emb_14': -0.8363763, 'w_emb_15': 0.25568742, 'w_emb_16': 0.4932978, 'w_emb_17': -1.6198335, 'w_emb_18': -4.183924, 'w_emb_19': 4.281094, '-1:word.lower()': 'reference', '-1:word.istitle()': False, '-1:word.isupper()': True, '-1:p ostag': 'POS', '-1:postag[:2]': 'PO', '+1:word.lower()': 'date', '+1:word.istitle()': False, '+1:word.isupper()': True, '+1:postag': 'POS', '+1:postag[:2]': 'PO'}...]
y_train = ['O', 'O', 'O'...'I-data-c-a-s_message-type'....'B-data-c-a-s_message-type']
这是模型定义和训练:
`
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
max_iterations=100,
all_possible_transitions=True
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
msg = metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=4)
print(msg)
`
不幸的是,我的模型没有学到任何东西:
precision recall f1-score support
B-data-c-a-s_message-type 0.0000 0.0000 0.0000 23
I-data-c-a-s_message-type 0.0000 0.0000 0.0000 90
micro avg 0.0000 0.0000 0.0000 113
macro avg 0.0000 0.0000 0.0000 113
weighted avg 0.0000 0.0000 0.0000 113
问题已解决。 正如你在上面看到的,支持(评估样本的数量)总共是 113。但是,训练集中的样本数量只有大约 14!这太小了!我只是没有注意到这种差异。我已经将训练和测试数据集倒置,现在,性能是这样的:
precision recall f1-score support
B-data-c-a-s_message-type 0.0000 0.0000 0.0000 0
I-data-c-a-s_message-type 0.6364 1.0000 0.7778 14
micro avg 0.6364 1.0000 0.7778 14
macro avg 0.3182 0.5000 0.3889 14
weighted avg 0.6364 1.0000 0.7778 14