属性值为字符串的数据分类

Question

我有一个带标签的数据集，其中包含 7 个属性和大约 80,000 行。但是，其中 3 个属性包含超过 50% 的缺失数据。我过滤了数据以忽略具有任何空值的行，这给我留下了大约 30,000 行完整数据。每个属性的值的格式都是 "this is the value of an instance of attribute i." 中的字符串。所需的输出（标签）是二进制的（0 或 1），并且每个实例都有一个关联的标签。我想训练一个分类器来预测测试集上的所需输出。我正在使用 Python 和 sklearn，并且一直在研究如何从该数据集中提取特征。任何建议将不胜感激。谢谢

Answer 1

Scikit-learn 有几个明确设计用于从文本输入中提取特征的工具；请参阅文档的 Text Feature Extraction 部分。

下面是一个根据字符串列表构建的分类器示例：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

data = [['this is about dogs', 'dogs are really great'],
        ['this is about cats', 'cats are evil']]
labels = ['dogs',
          'cats']

vec = CountVectorizer()  # count word occurrences
X = vec.fit_transform([' '.join(row) for row in data])

clf = MultinomialNB()  # very simple model for word counts
clf.fit(X, labels)

new_data = ['this is about cats too', 'I think cats are awesome']
new_X = vec.transform([' '.join(new_data)])

print(clf.predict(new_X))
# ['cats']

属性值为字符串的数据分类

classification of data where attribute values are strings

python

classification

text-mining

feature-extraction

scikit-learn