我如何使用 Keras 对字符串列表进行热编码?
How can I one hot encode a list of strings with Keras?
我有一个列表:
code = ['<s>', 'are', 'defined', 'in', 'the', '"editable', 'parameters"', '\n', 'section.', '\n', 'A', 'larger', '`tsteps`', 'value', 'means', 'that', 'the', 'LSTM', 'will', 'need', 'more', 'memory', '\n', 'to', 'figure', 'out']
而且我想转换为一种热编码。我试过了:
to_categorical(code)
我得到一个错误:ValueError: invalid literal for int() with base 10: '<s>'
我做错了什么?
首先尝试将其转换为 numpy
数组:
from numpy import array
然后:
to_categorical(array(code))
keras
只支持one-hot-encoding已经integer-encoded的数据。您可以像这样手动 integer-encode 您的字符串:
手动编码
# this integer encoding is purely based on position, you can do this in other ways
integer_mapping = {x: i for i,x in enumerate(code)}
vec = [integer_mapping[word] for word in code]
# vec is
# [0, 1, 2, 3, 16, 5, 6, 22, 8, 22, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
使用scikit-learn
from sklearn.preprocessing import LabelEncoder
import numpy as np
code = np.array(code)
label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(code)
# array([ 2, 6, 7, 9, 19, 1, 16, 0, 17, 0, 3, 10, 5, 21, 11, 18, 19,
# 4, 22, 14, 13, 12, 0, 20, 8, 15])
您现在可以将其输入 keras.utils.to_categorical
:
from keras.utils import to_categorical
to_categorical(vec)
改为使用
pandas.get_dummies(y_train)
tf.keras.layers.CategoryEncoding
在 TF 2.6.0 中,可以使用 tf.keras.layers.CategoryEncoding , tf.keras.layers.StringLookup, and tf.keras.layers.IntegerLookup.
实现 One Hot Encoding (OHE) 或 Multi Hot Encoding (MHE)
我认为这种方式在TF 2中是不合理的。4.x所以它一定是在之后实现的。
请参阅 Classify structured data using Keras preprocessing layers 以了解实际实施。
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a layer that turns strings into integer indices.
if dtype == 'string':
index = layers.StringLookup(max_tokens=max_tokens)
# Otherwise, create a layer that turns integer values into integer indices.
else:
index = layers.IntegerLookup(max_tokens=max_tokens)
# Prepare a `tf.data.Dataset` that only yields the feature.
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Encode the integer indices.
encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply multi-hot encoding to the indices. The lambda function captures the
# layer, so you can use them, or include them in the Keras Functional model later.
return lambda feature: encoder(index(feature))
我有一个列表:
code = ['<s>', 'are', 'defined', 'in', 'the', '"editable', 'parameters"', '\n', 'section.', '\n', 'A', 'larger', '`tsteps`', 'value', 'means', 'that', 'the', 'LSTM', 'will', 'need', 'more', 'memory', '\n', 'to', 'figure', 'out']
而且我想转换为一种热编码。我试过了:
to_categorical(code)
我得到一个错误:ValueError: invalid literal for int() with base 10: '<s>'
我做错了什么?
首先尝试将其转换为 numpy
数组:
from numpy import array
然后:
to_categorical(array(code))
keras
只支持one-hot-encoding已经integer-encoded的数据。您可以像这样手动 integer-encode 您的字符串:
手动编码
# this integer encoding is purely based on position, you can do this in other ways
integer_mapping = {x: i for i,x in enumerate(code)}
vec = [integer_mapping[word] for word in code]
# vec is
# [0, 1, 2, 3, 16, 5, 6, 22, 8, 22, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
使用scikit-learn
from sklearn.preprocessing import LabelEncoder
import numpy as np
code = np.array(code)
label_encoder = LabelEncoder()
vec = label_encoder.fit_transform(code)
# array([ 2, 6, 7, 9, 19, 1, 16, 0, 17, 0, 3, 10, 5, 21, 11, 18, 19,
# 4, 22, 14, 13, 12, 0, 20, 8, 15])
您现在可以将其输入 keras.utils.to_categorical
:
from keras.utils import to_categorical
to_categorical(vec)
改为使用
pandas.get_dummies(y_train)
tf.keras.layers.CategoryEncoding
在 TF 2.6.0 中,可以使用 tf.keras.layers.CategoryEncoding , tf.keras.layers.StringLookup, and tf.keras.layers.IntegerLookup.
实现 One Hot Encoding (OHE) 或 Multi Hot Encoding (MHE)我认为这种方式在TF 2中是不合理的。4.x所以它一定是在之后实现的。
请参阅 Classify structured data using Keras preprocessing layers 以了解实际实施。
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a layer that turns strings into integer indices.
if dtype == 'string':
index = layers.StringLookup(max_tokens=max_tokens)
# Otherwise, create a layer that turns integer values into integer indices.
else:
index = layers.IntegerLookup(max_tokens=max_tokens)
# Prepare a `tf.data.Dataset` that only yields the feature.
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Encode the integer indices.
encoder = layers.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply multi-hot encoding to the indices. The lambda function captures the
# layer, so you can use them, or include them in the Keras Functional model later.
return lambda feature: encoder(index(feature))