如何为 8 个独立的分类器创建一个函数?
How can I make one function for 8 separate classifiers?
我想计算 8 个不同模型的标签“negative”的 f1 分数。我的前 3 个模型的代码和带有结果的数据框如下。 如何创建一个不需要为每个模型编写单独代码的函数?
# Train model with vectorizer and classifier
# Model training
from sklearn.model_selection import train_test_split
Independent_var = reviews_english['tokenized']
Dependent_var = reviews_english['sentiment']
IV_train, IV_test, DV_train, DV_test = train_test_split(Independent_var, Dependent_var, test_size = 0.2, random_state = 500 )
print('IV_train :', len(IV_train))
print('IV_test :', len(IV_test))
print('DV_train :', len(DV_train))
print('DV_test :', len(DV_test))
#Calculate f1 score for all 8 models
#RandomForestClassifier
model = Pipeline([('vectorizer', tvec),('classifier', RandomForestClassifier)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train= model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_rand = f1_score(DV_test, pred_test, pos_label='negative', average='binary')
#Multinominal NB
model = Pipeline([('vectorizer', tvec),('classifier', MultinominalNB)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_multi = f1_score(DV_test, pred_test, pos_label='negative', average='binary')
#BernoulliNB
model = Pipeline([('vectorizer', tvec),('classifier', BernoulliNB)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_bern = f1_score(DV_test, pred_test,pos_label='negative', average='binary')
IFF 您的代码对于所有模型都是 相同的,您可以迭代每个模型。您已经创建了一个分类器列表 clf_list
,因此只需将每个分类器传递给一个执行所有常见步骤的函数即可。请注意,如果您有一些步骤对于每个模型都是唯一的,您将需要为它们创建不同的功能(通常)或在需要的地方添加 if...else
块。
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
RandomForestClassifier = RandomForestClassifier()
MultinominalNB = MultinomialNB()
... # the lines above
KNeighborsClassifier = KNeighborsClassifier(n_neighbors=5)
clf_list = [RandomForestClassifier, MultinominalNB, BernoulliNB,
XGBClassifier, GradientBoostingClassifier, LogisticRegression,
LinearSVC, KNeighborsClassifier]
# only the names, for your dataframe, order must match:
clf_names = ['RandomForestClassifier',
'MultinominalNB',
... # add the rest
'KneighborsClassifier']
def do_something_with_classifier(clf):
tvec = TfidfVectorizer()
model = Pipeline([('vectorizer', tvec),('classifier', clf)])
# Model learning
model.fit(IV_train, DV_train) # where are these variables from?
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
return f1_score(DV_test, pred_test, pos_label='negative', average='binary')
data = []
for clf in clf_list:
data.append(do_something_with_classifier(clf))
# or the above as a list comprehension:
data = [do_something_with_classifier(clf) for clf in clf_list]
model_comparison = pd.DataFrame(data, columns=['model', 'f1 score "negative"'])
顺便说一句,与其创建指向每个分类器实例的变量以便将它们添加到列表中,不如直接从这些实例创建列表,并跳过为每个实例创建单独的变量。或者更好的是,由于您需要为每个标签在数据框中使用“文本”标签,因此创建一个字典,其中键是您的 label/text,值是通用函数的结果:
classifiers = {
'RandomForestClassifier': RandomForestClassifier(),
'MultinomialNB': MultinomialNB(),
'BernoulliNB': BernoulliNB(),
... # add the rest here
'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=5),
}
data = [[name, do_something_with_classifier(clf)] for name, clf in classifiers.items()]
model_comparison = pd.DataFrame(data, columns=['model', 'f1 score "negative"'])
我想计算 8 个不同模型的标签“negative”的 f1 分数。我的前 3 个模型的代码和带有结果的数据框如下。 如何创建一个不需要为每个模型编写单独代码的函数?
# Train model with vectorizer and classifier
# Model training
from sklearn.model_selection import train_test_split
Independent_var = reviews_english['tokenized']
Dependent_var = reviews_english['sentiment']
IV_train, IV_test, DV_train, DV_test = train_test_split(Independent_var, Dependent_var, test_size = 0.2, random_state = 500 )
print('IV_train :', len(IV_train))
print('IV_test :', len(IV_test))
print('DV_train :', len(DV_train))
print('DV_test :', len(DV_test))
#Calculate f1 score for all 8 models
#RandomForestClassifier
model = Pipeline([('vectorizer', tvec),('classifier', RandomForestClassifier)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train= model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_rand = f1_score(DV_test, pred_test, pos_label='negative', average='binary')
#Multinominal NB
model = Pipeline([('vectorizer', tvec),('classifier', MultinominalNB)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_multi = f1_score(DV_test, pred_test, pos_label='negative', average='binary')
#BernoulliNB
model = Pipeline([('vectorizer', tvec),('classifier', BernoulliNB)])
# Model learning
model.fit(IV_train, DV_train)
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
f1_bern = f1_score(DV_test, pred_test,pos_label='negative', average='binary')
IFF 您的代码对于所有模型都是 相同的,您可以迭代每个模型。您已经创建了一个分类器列表 clf_list
,因此只需将每个分类器传递给一个执行所有常见步骤的函数即可。请注意,如果您有一些步骤对于每个模型都是唯一的,您将需要为它们创建不同的功能(通常)或在需要的地方添加 if...else
块。
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
RandomForestClassifier = RandomForestClassifier()
MultinominalNB = MultinomialNB()
... # the lines above
KNeighborsClassifier = KNeighborsClassifier(n_neighbors=5)
clf_list = [RandomForestClassifier, MultinominalNB, BernoulliNB,
XGBClassifier, GradientBoostingClassifier, LogisticRegression,
LinearSVC, KNeighborsClassifier]
# only the names, for your dataframe, order must match:
clf_names = ['RandomForestClassifier',
'MultinominalNB',
... # add the rest
'KneighborsClassifier']
def do_something_with_classifier(clf):
tvec = TfidfVectorizer()
model = Pipeline([('vectorizer', tvec),('classifier', clf)])
# Model learning
model.fit(IV_train, DV_train) # where are these variables from?
# Model prediction on training and test data
pred_train = model.predict(IV_train)
pred_test = model.predict(IV_test)
return f1_score(DV_test, pred_test, pos_label='negative', average='binary')
data = []
for clf in clf_list:
data.append(do_something_with_classifier(clf))
# or the above as a list comprehension:
data = [do_something_with_classifier(clf) for clf in clf_list]
model_comparison = pd.DataFrame(data, columns=['model', 'f1 score "negative"'])
顺便说一句,与其创建指向每个分类器实例的变量以便将它们添加到列表中,不如直接从这些实例创建列表,并跳过为每个实例创建单独的变量。或者更好的是,由于您需要为每个标签在数据框中使用“文本”标签,因此创建一个字典,其中键是您的 label/text,值是通用函数的结果:
classifiers = {
'RandomForestClassifier': RandomForestClassifier(),
'MultinomialNB': MultinomialNB(),
'BernoulliNB': BernoulliNB(),
... # add the rest here
'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=5),
}
data = [[name, do_something_with_classifier(clf)] for name, clf in classifiers.items()]
model_comparison = pd.DataFrame(data, columns=['model', 'f1 score "negative"'])