在 scikit-learn 管道中使用 Word2Vec
Using Word2Vec in scikit-learn pipeline
我正在尝试 运行 这个数据样本的 w2v
Statement Label
Says the Annies List political group supports third-trimester abortions on demand. FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""" TRUE
Health care reform legislation is likely to mandate free sex change surgeries. FALSE
The economic turnaround started at the end of my term. TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades. TRUE
Jim Dunnam has not lived in the district he represents for years now. FALSE
使用此 GitHub 文件夹 (FeatureSelection.py) 中提供的代码:
https://github.com/nishitpatel01/Fake_News_Detection
我想在我的朴素贝叶斯模型中包含 word2vec 特征。
首先我考虑了 X 和 y 并使用 train_test_split:
X = df['Statement']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
dataset = pd.concat([X_train, y_train], axis=1)
这是我目前使用的代码:
#Using Word2Vec
with open("glove.6B.50d.txt", "rb") as lines:
w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
for line in lines}
training_sentences = DataPrep.train_news['Statement']
model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y): # what are X and y?
return self
def transform(self, X): # should it be training_sentences?
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
"""
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y):
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
# if a word was never seen - it must be at least as infrequent
# as any of the known words - so the default idf is the max of
# known idf's
max_idf = max(tfidf.idf_)
self.word2weight = defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] * self.word2weight[w]
for w in words if w in self.word2vec] or
[np.zeros(self.dim)], axis=0)
for words in X
])
"""
在classifier.py,我是运行宁
nb_pipeline = Pipeline([
('NBCV',FeaturesSelection.w2v),
('nb_clf',MultinomialNB())])
但是这不起作用,我收到此错误:
TypeError Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
2 nb_pipeline = Pipeline([
3 ('NBCV',FeaturesSelection.w2v),
----> 4 ('nb_clf',MultinomialNB())])
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
112 self.memory = memory
113 self.verbose = verbose
--> 114 self._validate_steps()
115
116 def get_params(self, deep=True):
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
160 "transformers and implement fit and transform "
161 "or be the string 'passthrough' "
--> 162 "'%s' (type %s) doesn't" % (t, type(t)))
163
164 # We allow last estimator to be None as an identity transformation
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527, 0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....
我正在使用该文件夹中的所有程序,因此如果您使用它们,代码可以重现。
如果您能向我解释如何修复它以及需要对代码进行哪些其他更改,那就太好了。我的目标是将模型(朴素贝叶斯、随机森林等)与 BoW、TF-IDF 和 Word2Vec 进行比较。
更新:
在下面的回答(来自 Ismail)之后,我更新了代码如下:
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec, size=100):
self.word2vec = word2vec
self.dim = size
和
#building Linear SVM classfier
svm_pipeline = Pipeline([
('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
('svm_clf',svm.LinearSVC())
])
svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])
但是,我仍然遇到错误。
步骤 1. MultinomialNB FeaturesSelection.w2v
是一个 dict
,它没有 fit
或 fit_transform
函数。另外 MultinomialNB
需要非负值,所以它不起作用。所以我决定添加一个预处理阶段来规范化负值。
from sklearn.preprocessing import MinMaxScaler
nb_pipeline = Pipeline([
('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
('nb_norm', MinMaxScaler()),
('nb_clf',MultinomialNB())
])
...而不是
nb_pipeline = Pipeline([
('NBCV',FeatureSelection.w2v),
('nb_clf',MultinomialNB())
])
第 2 步。我在 word2vec.itervalues().next()
上遇到错误。因此,我决定使用与 Word2Vec
大小相同的预定义尺寸形状更改尺寸形状。
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec, size=100):
self.word2vec = word2vec
self.dim = size
...而不是
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.dim = len(word2vec.itervalues().next())
我正在尝试 运行 这个数据样本的 w2v
Statement Label
Says the Annies List political group supports third-trimester abortions on demand. FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration. TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran.""" TRUE
Health care reform legislation is likely to mandate free sex change surgeries. FALSE
The economic turnaround started at the end of my term. TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades. TRUE
Jim Dunnam has not lived in the district he represents for years now. FALSE
使用此 GitHub 文件夹 (FeatureSelection.py) 中提供的代码:
https://github.com/nishitpatel01/Fake_News_Detection
我想在我的朴素贝叶斯模型中包含 word2vec 特征。 首先我考虑了 X 和 y 并使用 train_test_split:
X = df['Statement']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
dataset = pd.concat([X_train, y_train], axis=1)
这是我目前使用的代码:
#Using Word2Vec
with open("glove.6B.50d.txt", "rb") as lines:
w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
for line in lines}
training_sentences = DataPrep.train_news['Statement']
model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y): # what are X and y?
return self
def transform(self, X): # should it be training_sentences?
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
"""
class TfidfEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.word2weight = None
self.dim = len(word2vec.itervalues().next())
def fit(self, X, y):
tfidf = TfidfVectorizer(analyzer=lambda x: x)
tfidf.fit(X)
# if a word was never seen - it must be at least as infrequent
# as any of the known words - so the default idf is the max of
# known idf's
max_idf = max(tfidf.idf_)
self.word2weight = defaultdict(
lambda: max_idf,
[(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] * self.word2weight[w]
for w in words if w in self.word2vec] or
[np.zeros(self.dim)], axis=0)
for words in X
])
"""
在classifier.py,我是运行宁
nb_pipeline = Pipeline([
('NBCV',FeaturesSelection.w2v),
('nb_clf',MultinomialNB())])
但是这不起作用,我收到此错误:
TypeError Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
2 nb_pipeline = Pipeline([
3 ('NBCV',FeaturesSelection.w2v),
----> 4 ('nb_clf',MultinomialNB())])
/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
112 self.memory = memory
113 self.verbose = verbose
--> 114 self._validate_steps()
115
116 def get_params(self, deep=True):
/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
160 "transformers and implement fit and transform "
161 "or be the string 'passthrough' "
--> 162 "'%s' (type %s) doesn't" % (t, type(t)))
163
164 # We allow last estimator to be None as an identity transformation
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527, 0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....
我正在使用该文件夹中的所有程序,因此如果您使用它们,代码可以重现。
如果您能向我解释如何修复它以及需要对代码进行哪些其他更改,那就太好了。我的目标是将模型(朴素贝叶斯、随机森林等)与 BoW、TF-IDF 和 Word2Vec 进行比较。
更新:
在下面的回答(来自 Ismail)之后,我更新了代码如下:
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec, size=100):
self.word2vec = word2vec
self.dim = size
和
#building Linear SVM classfier
svm_pipeline = Pipeline([
('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
('svm_clf',svm.LinearSVC())
])
svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])
但是,我仍然遇到错误。
步骤 1. MultinomialNB FeaturesSelection.w2v
是一个 dict
,它没有 fit
或 fit_transform
函数。另外 MultinomialNB
需要非负值,所以它不起作用。所以我决定添加一个预处理阶段来规范化负值。
from sklearn.preprocessing import MinMaxScaler
nb_pipeline = Pipeline([
('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
('nb_norm', MinMaxScaler()),
('nb_clf',MultinomialNB())
])
...而不是
nb_pipeline = Pipeline([
('NBCV',FeatureSelection.w2v),
('nb_clf',MultinomialNB())
])
第 2 步。我在 word2vec.itervalues().next()
上遇到错误。因此,我决定使用与 Word2Vec
大小相同的预定义尺寸形状更改尺寸形状。
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec, size=100):
self.word2vec = word2vec
self.dim = size
...而不是
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
self.dim = len(word2vec.itervalues().next())