训练SVM模型后如何加载未标记数据进行情感分类？

Question

我正在尝试进行情感分类，我使用了 sklearn SVM 模型。我使用标记数据来训练模型并获得了 89% 的准确率。现在我想使用该模型来预测未标记数据的情绪。我怎样才能做到这一点？以及未标注数据分类后，如何判断是正类还是负类？

我用的是python3.7。下面是代码。

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)

train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics


clf = Pipeline([
    ('vectorizer', CountVectorizer(analyzer="word",
                                   tokenizer=word_tokenize,
                                   preprocessor=lambda text: text.replace("<br />", " "),
                                   max_features=None)),
    ('classifier', LinearSVC())
])

clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))

当我运行这段代码时，我得到输出：

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) Accuracy : 0.8977272727272727 Precision : 0.8604651162790697 Recall : 0.925

ConvergenceWarning 是什么意思？

提前致谢！

Answer 1

Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?

基本上，您聚合未标记数据的方式与生成 train_x 或 test_x 的方式相同。它可能是形状为 n_samples x 1 的二维矩阵，然后您将在 clf.predict 中使用它来获得预测。 clf.predict 输出最可能的 class。在您的情况下，0 是负数，1 是正数，但是如果没有数据集就很难判断。

What is the meaning of ConvergenceWarning?

LinearSVC 模型使用迭代算法进行了优化。有一个参数 max_iter （默认为 1000）控制最大迭代次数。如果在此过程中未满足停止条件，您将得到 ConvergenceWarning。只要您在准确性或其他指标方面的表现可以接受，它应该不会太困扰您。

Answer 2

查看有关模型持久性的site。然后你只需加载它并调用 predict 方法。模型将 return 预测标签。如果你使用了任何编码器（LabelEncoder，OneHotEncoder），你需要单独转储和加载它。

如果我是你，我宁愿采用完全数据驱动的方法并使用一些预训练的嵌入器。它也适用于开箱即用的数十种语言，而且非常简洁。

有 LASER from facebook. There's also pypi 包，虽然不是官方的。它工作得很好。现在有很多预训练模型，所以达到接近开创性的分数应该不难。

Answer 3

What is the meaning of ConvergenceWarning?

正如 Pavel 已经提到的，ConvergenceWArning 意味着 max_iter 被击中，您可以在此处取消警告：How to disable ConvergenceWarning using sklearn?

Now I want to use the model to predict the sentiment of unlabeled data. How can I do that?

你会用命令来完成：pred_y = clf.predict(test_x)，你唯一要调整的是：pred_y（这是你的自由选择），而test_x，这应该是你新的看不见的数据，它必须具有与你的数据 test_x 和 train_x.

相同数量的特征

在你的情况下：

sentiment_data = list(zip(data['Articles'], data['Sentiment']))

您正在形成一个元组：Check this out 然后你正在洗牌，unzip 前 350 行：

train_x, train_y = zip(*sentiment_data[:350])

这里你 train_x 是列：data['Articles']，所以如果你有新数据，你所要做的就是：

new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])

how to see whether it is classified as positive or negative?

您可以运行然后： pred_y 您的结果将是 1 或 0。通常 0 应该是负数，但这取决于你的数据集

训练SVM模型后如何加载未标记数据进行情感分类？

How to load unlabelled data for sentiment classification after training SVM model?

machine-learning

svm

sentiment-analysis

sklearn-pandas

python-3.7