使用 TF-IDF 分数进行文本分类的 KNN
KNN for Text Classification using TF-IDF scores
我有一个 CSV 文件 (corpus.csv),其中包含语料库中以下格式的分级摘要(文本):
Institute, Score, Abstract
----------------------------------------------------------------------
UoM, 3.0, Hello, this is abstract one
UoM, 3.2, Hello, this is abstract two and yet counting.
UoE, 3.1, Hello, yet another abstract but this is a unique one.
UoE, 2.2, Hello, please no more abstract.
我正在尝试在 python 中创建一个 KNN 分类程序,它能够获取用户输入摘要,例如 "This is a new unique abstract",然后对最接近语料库的用户输入摘要进行分类( CSV) 以及预测摘要的 returns score/grade。我怎样才能做到这一点?
我有以下代码:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
institute,score,abstract = row
if len(abstract.split()) > 0:
institute_list.append(institute)
score = float(score)
score_list.append(score)
abstract = abstract.translate(string.punctuation).lower()
abstract_list.append(abstract)
row_count = row_count + 1
print("Total processed data: ", row_count)
#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
feature_names = vectorizer.get_feature_names()
在上述代码中,如何使用TF-IDF计算的特征进行上述KNN分类? (可能使用sklearn.neighborsKNeighborsClassifier框架)
P.S。此应用案例的 类 是相应的 scores/grades 摘要。
我有视觉深度学习的背景,但是我在文本分类方面缺乏很多知识,尤其是使用 KNN。任何帮助将非常感激。提前谢谢你。
KNN 是一种 class 化算法 - 这意味着您必须具有 class 属性。 KNN 可以使用 TFIDF 的输出作为输入矩阵 - TrainX,但您仍然需要 TrainY - 数据中每一行的 class。但是,您可以使用 KNN 回归器。
使用您的分数作为 class 变量:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
from sklearn import neighbors
#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
institute,score,abstract = row[0], row[1], row[2]
if len(abstract.split()) > 0:
institute_list.append(institute)
score = float(score)
score_list.append(score)
abstract = abstract.translate(string.punctuation).lower()
abstract_list.append(abstract)
row_count = row_count + 1
print("Total processed data: ", row_count)
#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
classes = score_list
feature_names = vectorizer.get_feature_names()
clf = neighbors.KNeighborsRegressor(n_neighbors=1)
clf.fit(response, classes)
clf.predict(response)
"predict" 将预测每个实例的分数。
我有一个 CSV 文件 (corpus.csv),其中包含语料库中以下格式的分级摘要(文本):
Institute, Score, Abstract
----------------------------------------------------------------------
UoM, 3.0, Hello, this is abstract one
UoM, 3.2, Hello, this is abstract two and yet counting.
UoE, 3.1, Hello, yet another abstract but this is a unique one.
UoE, 2.2, Hello, please no more abstract.
我正在尝试在 python 中创建一个 KNN 分类程序,它能够获取用户输入摘要,例如 "This is a new unique abstract",然后对最接近语料库的用户输入摘要进行分类( CSV) 以及预测摘要的 returns score/grade。我怎样才能做到这一点?
我有以下代码:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
institute,score,abstract = row
if len(abstract.split()) > 0:
institute_list.append(institute)
score = float(score)
score_list.append(score)
abstract = abstract.translate(string.punctuation).lower()
abstract_list.append(abstract)
row_count = row_count + 1
print("Total processed data: ", row_count)
#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
feature_names = vectorizer.get_feature_names()
在上述代码中,如何使用TF-IDF计算的特征进行上述KNN分类? (可能使用sklearn.neighborsKNeighborsClassifier框架)
P.S。此应用案例的 类 是相应的 scores/grades 摘要。
我有视觉深度学习的背景,但是我在文本分类方面缺乏很多知识,尤其是使用 KNN。任何帮助将非常感激。提前谢谢你。
KNN 是一种 class 化算法 - 这意味着您必须具有 class 属性。 KNN 可以使用 TFIDF 的输出作为输入矩阵 - TrainX,但您仍然需要 TrainY - 数据中每一行的 class。但是,您可以使用 KNN 回归器。 使用您的分数作为 class 变量:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string
from sklearn import neighbors
#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
institute,score,abstract = row[0], row[1], row[2]
if len(abstract.split()) > 0:
institute_list.append(institute)
score = float(score)
score_list.append(score)
abstract = abstract.translate(string.punctuation).lower()
abstract_list.append(abstract)
row_count = row_count + 1
print("Total processed data: ", row_count)
#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
classes = score_list
feature_names = vectorizer.get_feature_names()
clf = neighbors.KNeighborsRegressor(n_neighbors=1)
clf.fit(response, classes)
clf.predict(response)
"predict" 将预测每个实例的分数。