如何在 2 列上训练 ML 模型来解决分类问题？

Question

我在数据集中有三列，我正在对其进行情绪分析(类 0,1,2):

text    thing    sentiment

但问题是我只能在 text 或 thing 上训练我的数据并得到预测 sentiment。有没有办法在 text 和 thing 上训练数据，然后预测 sentiment？

问题案例（比方说）：

  |text  thing  sentiment
0 | t1   thing1    0
. |
. |
54| t1   thing2    2

这个例子告诉我们情绪也应该取决于thing。如果我尝试将两列一个接一个地连接起来，然后尝试，但那是不正确的，因为我们不会将两列之间的任何关系提供给模型。

我的 测试集 包含两列 test 和 thing，我必须根据这两列的训练模型预测情绪列。

现在我正在使用 tokenizer，然后是下面的模型：

model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

关于如何继续或使用哪种模型或编码操作的任何指示？

Answer 1

您可能想要转向 Keras 功能 API 并训练多输入模型。

根据 Keras 的创建者 François CHOLLET 在他的书 Deep Learning with Python [Manning, 2017]（第 7 章，第 1 节）中的说法：

Some tasks, require multimodal inputs: they merge data coming from different input sources, processing each type of data using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all available input modalities simultaneously: a model with three input branches.

Answer 2

我认为 Concatenate 功能是解决这种情况的方法，总体思路如下。请根据您的用例进行调整。

### whatever preprocessing you may want to do
text_input = Input(shape=(1, ))
thing_input = Input(shape=(1,))

### now bring them together
merged_inputs = Concatenate(axis = 1)([text_input, thing_input])

### sample output layer
output = Dense(3)(merged_inputs)

### pass your inputs and outputs to the model
model = Model(inputs = [text_input, thing_input], outputs = output)

Answer 3

你必须将多列作为列表，然后对原始数据进行嵌入和预处理后合并训练。示例：

train = pd.read_csv('COVID19 multifeature Emotion - 50 data.csv', nrows=49)
# This dataset has two text column field and different class level

X_train_doctor_opinion = train["doctor-opinion"].str.lower()
X_train_patient_opinion = train["patient-opinion"].str.lower()

X_train = list(X_train_doctor_opinion) + list(X_train_patient_opinion))

然后预处理和嵌入

如何在 2 列上训练 ML 模型来解决分类问题？

How to train ML model on 2 columns to solve for classification?

python

classification

machine-learning

training-data

logistic-regression

问题案例（比方说）：