如何在 2 列上训练 ML 模型来解决分类问题?
How to train ML model on 2 columns to solve for classification?
我在数据集中有三列,我正在对其进行情绪分析(类 0
,1
,2
):
text thing sentiment
但问题是我只能在 text
或 thing
上训练我的数据并得到预测 sentiment
。有没有办法在 text
和 thing
上训练数据,然后预测 sentiment
?
问题案例(比方说):
|text thing sentiment
0 | t1 thing1 0
. |
. |
54| t1 thing2 2
这个例子告诉我们情绪也应该取决于thing
。如果我尝试将两列一个接一个地连接起来,然后尝试,但那是不正确的,因为我们不会将两列之间的任何关系提供给模型。
我的 测试集 包含两列 test
和 thing
,我必须根据这两列的训练模型预测情绪列。
现在我正在使用 tokenizer
,然后是下面的模型:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
关于如何继续或使用哪种模型或编码操作的任何指示?
您可能想要转向 Keras 功能 API 并训练多输入模型。
根据 Keras 的创建者 François CHOLLET 在他的书 Deep Learning with Python [Manning, 2017](第 7 章,第 1 节)中的说法:
Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.
我认为 Concatenate 功能是解决这种情况的方法,总体思路如下。请根据您的用例进行调整。
### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))
### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])
### sample output layer
output = Dense(3)(merged_inputs)
### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)
你必须将多列作为列表,然后对原始数据进行嵌入和预处理后合并训练。
示例:
train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level
X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()
X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))
然后预处理和嵌入
我在数据集中有三列,我正在对其进行情绪分析(类 0
,1
,2
):
text thing sentiment
但问题是我只能在 text
或 thing
上训练我的数据并得到预测 sentiment
。有没有办法在 text
和 thing
上训练数据,然后预测 sentiment
?
问题案例(比方说):
|text thing sentiment
0 | t1 thing1 0
. |
. |
54| t1 thing2 2
这个例子告诉我们情绪也应该取决于thing
。如果我尝试将两列一个接一个地连接起来,然后尝试,但那是不正确的,因为我们不会将两列之间的任何关系提供给模型。
我的 测试集 包含两列 test
和 thing
,我必须根据这两列的训练模型预测情绪列。
现在我正在使用 tokenizer
,然后是下面的模型:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
关于如何继续或使用哪种模型或编码操作的任何指示?
您可能想要转向 Keras 功能 API 并训练多输入模型。
根据 Keras 的创建者 François CHOLLET 在他的书 Deep Learning with Python [Manning, 2017](第 7 章,第 1 节)中的说法:
Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.
我认为 Concatenate 功能是解决这种情况的方法,总体思路如下。请根据您的用例进行调整。
### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))
### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])
### sample output layer
output = Dense(3)(merged_inputs)
### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)
你必须将多列作为列表,然后对原始数据进行嵌入和预处理后合并训练。 示例:
train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level
X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()
X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))
然后预处理和嵌入