特征是文本（标签）和数字组合的方法

Approach where features are combination of text(labels) and numerical

我正在尝试为包含文本的数据集找出一个好的方法，这些文本实际上更像是标签和数字数据。比如数据集中，我有city，state，lat/lon，我要分类。这是受监督的，我有数据的标签 (y)。

所以在这种情况下，文本真的不是词袋之类的东西。它实际上只是一个标签，更像是 0、1... 但是，我不~认为~我想让算法知道这些是真实值。我尝试了几种不同的算法，包括 svm.SVC 和 LinearSVC，以及 DecisionTree。对于 svm，我使用几种不同的方法（包括 LabelEncoder）将城市和州转换为数值。但这在直觉上似乎不对，我对分数不满意。

非常感谢任何想法或意见。

您似乎在寻找 OneHotEncoder. For an explanation take a look at the Encoding categorical features section of the docs. The idea is that you will make a column for each city with 0/1 values if the sample belongs to the current city. You might also be interested in DictVectorizer。

特征是文本（标签）和数字组合的方法

Approach where features are combination of text(labels) and numerical

python

scikit-learn