如何处理分类特征的未知数量的值？

Question

我有一个 pandas 数据框，看起来像这样

Text                  | Label

Some text             |   0
hellow bye what       |   1
...

每一行都是一个数据点。标签是 0/1 二进制。唯一的特征是 Text，它包含一组单词。我想使用每个单词的存在与否作为特征。例如，特征可以是 contains_some contains_what contains_hello contains_bye 等。这是典型的热编码。

但是我不想手动创建那么多特征，一个词汇表中的每个单词一个（词汇量不大，所以我不担心特征集爆炸）。但我只想将单词列表作为单列提供给 tensorflow，我希望它为词汇表中的每个单词创建一个二进制特征。

tensorflow/keras 有 API 可以做到这一点吗？

Answer 1

你可以为此使用 sklearn，试试这个：

from sklearn.feature_extraction.text import CountVectorizer
    
vectorizer = CountVectorizer(binary=True)
    
X = vectorizer.fit_transform(old_df['Text'])

new_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    
new_df['Label'] = old_df['label']

这应该给你：

bye hellow  some    text    what    target
0     0       1     1         0         0
1     1       0     0         1         1

CountVectorizer 将文本文档集合转换为标记计数矩阵。此实现使用 scipy.sparse.csr_matrix 生成计数的稀疏表示，如果 binary = True，则所有非零计数都设置为 1。这对于模拟二进制事件而不是整数计数的离散概率模型很有用。

Answer 2

您正在寻找的是一个（二进制）词袋，您可以使用它们的 CountVectorizer here.

从 scikit-learn 中获得这些词

您可以这样做：

from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(ngram_range=(1, 1), binary=True)

X_train = bow.fit_transform(df_train['text'].values)

这将创建一个二进制值数组，指示每个文本中是否存在一个词。如果单词存在，则使用 binary=True 输出 1 或 0。如果没有此字段，您将获得每个单词的出现次数，这两种方法都可以。

为了检查计数，您可以使用以下内容：

# Create sample dataframe of BoW outputs
count_vect_df = pd.DataFrame(X_train[:1].todense(),columns=bow.get_feature_names())

# Re-order counts in descending order. Keep top 10 counts for demo purposes
count_vect_df= count_vect_df[count_vect_df.iloc[-1,:].sort_values(ascending=False).index[:10]]

# print combination of original train dataframe with BoW counts
pd.concat([df_train['text'][:1].reset_index(drop=True), count_vect_df], axis=1)

更新

如果您的特征包含分类数据，您可以尝试使用 tf.keras 中的 to_categorical。有关详细信息，请参阅 docs。

如何处理分类特征的未知数量的值？

How to handle unknown number of values for a categorical feature?

pandas

keras

tensorflow

tensorflow2.0