如何使用keras库训练NLP分类?
How to train NLP classification using keras library?
这是我的训练数据,我想使用 keras 库预测 'y' 和 X_data。很多时候我都遇到错误,我知道它是关于数据形状的,但我被困了一段时间。希望大家帮帮忙。
X_data =
0 [construction, materials, labour, charges, con...
1 [catering, catering, lunch]
2 [passenger, transport, local, transport, passe...
3 [goods, transport, road, transport, goods, inl...
4 [rental, rental, aircrafts]
5 [supporting, transport, cargo, handling, agenc...
6 [postal, courier, postal, courier, local, deli...
7 [electricity, charges, reimbursement, electric...
8 [facility, management, facility, management, p...
9 [leasing, leasing, aircrafts]
10 [professional, technical, business, selling, s...
11 [telecommunications, broadcasting, information...
12 [support, personnel, search, contract, tempora...
13 [maintenance, repair, installation, maintenanc...
14 [manufacturing, physical, inputs, owned, other...
15 [accommodation, hotel, accommodation, hotel, i...
16 [leasing, rental, leasing, renting, motor, veh...
17 [real, estate, rental, leasing, involving, pro...
18 [rental, transport, vehicles, rental, road, ve...
19 [cleaning, sanitary, pad, vending, machine]
20 [royalty, transfer, use, ip, intellectual, pro...
21 [legal, accounting, legal, accounting, legal, ...
22 [veterinary, clinic, health, care, relation, a...
23 [human, health, social, care, inpatient, medic...
Name: Data, dtype: object
这是我的训练预测器
y =
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 10
16 2
17 10
18 2
19 2
20 10
21 10
22 10
23 10
我正在使用这个模型:
top_words = 5000
length= len(X_data)
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(embedding_vecor_length, top_words, input_length=length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_data, y, epochs=3, batch_size=32)
ValueError: Error when checking input: expected embedding_8_input to have shape (None, 24) but got array with shape (24, 1)
在这个模型中使用这些数据有什么问题?我想使用输入 X_data?
预测 'y'
您需要将 pandas 数据帧转换为 numpy 数组,数组将变得参差不齐,因此您需要填充它们。您还需要设置词向量字典,因为您不能直接将词直接传递到神经网络。一些例子是,here,here, and here。您将需要在这里进行自己的研究,不可能对您提供的数据样本做太多事情
length = len(X_data)
是你有多少个数据样本,keras 不关心这个,它想知道你有多少个单词作为输入,(每个单词必须相同,哪个这就是前面提到填充的原因)
所以你对网络的输入是你有多少列
#assuming you converted X_data correctly to numpy arrays and word vectors
model.add(Embedding(embedding_vecor_length, top_words, input_length=X_data.shape[1]))
您的分类值需要是二元的。
from keras.utils import to_categorical
y = to_categorical(y)
你的最后一个密集层现在是 10,假设你有 10 个类别并且正确的激活是 softmax
对于多类问题
model.add(Dense(10, activation='softmax'))
你的损失现在必须是 categorical_crossentropy
,因为这是多类
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
这是我的训练数据,我想使用 keras 库预测 'y' 和 X_data。很多时候我都遇到错误,我知道它是关于数据形状的,但我被困了一段时间。希望大家帮帮忙。
X_data =
0 [construction, materials, labour, charges, con...
1 [catering, catering, lunch]
2 [passenger, transport, local, transport, passe...
3 [goods, transport, road, transport, goods, inl...
4 [rental, rental, aircrafts]
5 [supporting, transport, cargo, handling, agenc...
6 [postal, courier, postal, courier, local, deli...
7 [electricity, charges, reimbursement, electric...
8 [facility, management, facility, management, p...
9 [leasing, leasing, aircrafts]
10 [professional, technical, business, selling, s...
11 [telecommunications, broadcasting, information...
12 [support, personnel, search, contract, tempora...
13 [maintenance, repair, installation, maintenanc...
14 [manufacturing, physical, inputs, owned, other...
15 [accommodation, hotel, accommodation, hotel, i...
16 [leasing, rental, leasing, renting, motor, veh...
17 [real, estate, rental, leasing, involving, pro...
18 [rental, transport, vehicles, rental, road, ve...
19 [cleaning, sanitary, pad, vending, machine]
20 [royalty, transfer, use, ip, intellectual, pro...
21 [legal, accounting, legal, accounting, legal, ...
22 [veterinary, clinic, health, care, relation, a...
23 [human, health, social, care, inpatient, medic...
Name: Data, dtype: object
这是我的训练预测器
y =
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 10
16 2
17 10
18 2
19 2
20 10
21 10
22 10
23 10
我正在使用这个模型:
top_words = 5000
length= len(X_data)
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(embedding_vecor_length, top_words, input_length=length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_data, y, epochs=3, batch_size=32)
ValueError: Error when checking input: expected embedding_8_input to have shape (None, 24) but got array with shape (24, 1)
在这个模型中使用这些数据有什么问题?我想使用输入 X_data?
预测 'y'您需要将 pandas 数据帧转换为 numpy 数组,数组将变得参差不齐,因此您需要填充它们。您还需要设置词向量字典,因为您不能直接将词直接传递到神经网络。一些例子是,here,here, and here。您将需要在这里进行自己的研究,不可能对您提供的数据样本做太多事情
length = len(X_data)
是你有多少个数据样本,keras 不关心这个,它想知道你有多少个单词作为输入,(每个单词必须相同,哪个这就是前面提到填充的原因)
所以你对网络的输入是你有多少列
#assuming you converted X_data correctly to numpy arrays and word vectors
model.add(Embedding(embedding_vecor_length, top_words, input_length=X_data.shape[1]))
您的分类值需要是二元的。
from keras.utils import to_categorical
y = to_categorical(y)
你的最后一个密集层现在是 10,假设你有 10 个类别并且正确的激活是 softmax
对于多类问题
model.add(Dense(10, activation='softmax'))
你的损失现在必须是 categorical_crossentropy
,因为这是多类
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])