多类 SVM 未能使用 20 新闻组数据集
Multiclass SVM failed to use 20 News Group Dataset
我正在尝试使用来自 Mblondel Multiclass SVM 的多类 SVM 代码,我阅读了他的论文并且他使用了来自 sklearn 20newsgroup 的数据集,但是当我尝试使用它时,代码无法正常工作。
我试图更改代码以匹配 20newsgroup 数据集。但我被这个错误困住了..
Traceback (most recent call last):
File "F:\env\chatbotstripped\CSSVM.py", line 157, in
clf.fit(X, y)
File "F:\env\chatbotstripped\CSSVM.py", line 106, in fit
v = self._violation(g, y, i)
File "F:\env\chatbotstripped\CSSVM.py", line 50, in _violation
elif k != y[i] and self.dual_coef_[k, i] >= 0:
IndexError: index 20 is out of bounds for axis 0 with size 20
这是主要代码:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train')
X, y = news_train.data[:100], news_train.target[:100]
clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
X = TfidfVectorizer().fit_transform(X)
clf.fit(X, y)
print(clf.score(X, y))
这是合适的代码:
def fit(self, X, y):
n_samples, n_features = X.shape
self._label_encoder = LabelEncoder()
y = self._label_encoder.fit_transform(y)
n_classes = len(self._label_encoder.classes_)
self.dual_coef_ = np.zeros((n_classes, n_samples), dtype=np.float64)
self.coef_ = np.zeros((n_classes, n_features))
norms = np.sqrt(np.sum(X.power(2), axis=1)) # i changed this code
rs = check_random_state(self.random_state)
ind = np.arange(n_samples)
rs.shuffle(ind)
# i added this sparse
sparse = sp.isspmatrix(X)
if sparse:
X = np.asarray(X.data, dtype=np.float64, order='C')
for it in range(self.max_iter):
violation_sum = 0
for ii in range(n_samples):
i = ind[ii]
if norms[i] == 0:
continue
g = self._partial_gradient(X, y, i)
v = self._violation(g, y, i)
violation_sum += v
if v < 1e-12:
continue
delta = self._solve_subproblem(g, y, norms, i)
self.coef_ += (delta * X[i][:, np.newaxis]).T
self.dual_coef_[:, i] += delta
if it == 0:
violation_init = violation_sum
vratio = violation_sum / violation_init
if self.verbose >= 1:
print("iter", it + 1, "violation", vratio)
if vratio < self.tol:
if self.verbose >= 1:
print("Converged")
break
return self
和_违规代码:
def _violation(self, g, y, i):
smallest = np.inf
for k in range(g.shape[0]):
if k == y[i] and self.dual_coef_[k, i] >= self.C:
continue
elif k != y[i] and self.dual_coef_[k, i] >= 0:
continue
smallest = min(smallest, g[k].all()) # and i added .all()
return g.max() - smallest
我知道索引有问题,我不确定如何解决这个问题,我不想破坏代码,因为我真的不明白这段代码是如何工作的。
您必须将 tfidf vectorizer 的稀疏矩阵输出转换为密集矩阵,然后将其制成二维数组。试试这个!
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
news_train = fetch_20newsgroups(subset='train')
text, y = news_train.data[:1000], news_train.target[:1000]
clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
vectorizer= TfidfVectorizer(min_df=20,stop_words='english')
X = np.asarray(vectorizer.fit_transform(text).todense())
clf.fit(X, y)
print(clf.score(X, y))
输出:
iter 1 violation 1.0
iter 2 violation 0.07075102408683964
iter 3 violation 0.018288133735158228
iter 4 violation 0.009149083942255389
Converged
0.953
我正在尝试使用来自 Mblondel Multiclass SVM 的多类 SVM 代码,我阅读了他的论文并且他使用了来自 sklearn 20newsgroup 的数据集,但是当我尝试使用它时,代码无法正常工作。
我试图更改代码以匹配 20newsgroup 数据集。但我被这个错误困住了..
Traceback (most recent call last):
File "F:\env\chatbotstripped\CSSVM.py", line 157, in
clf.fit(X, y)
File "F:\env\chatbotstripped\CSSVM.py", line 106, in fit
v = self._violation(g, y, i)
File "F:\env\chatbotstripped\CSSVM.py", line 50, in _violation
elif k != y[i] and self.dual_coef_[k, i] >= 0:
IndexError: index 20 is out of bounds for axis 0 with size 20
这是主要代码:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train')
X, y = news_train.data[:100], news_train.target[:100]
clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
X = TfidfVectorizer().fit_transform(X)
clf.fit(X, y)
print(clf.score(X, y))
这是合适的代码:
def fit(self, X, y):
n_samples, n_features = X.shape
self._label_encoder = LabelEncoder()
y = self._label_encoder.fit_transform(y)
n_classes = len(self._label_encoder.classes_)
self.dual_coef_ = np.zeros((n_classes, n_samples), dtype=np.float64)
self.coef_ = np.zeros((n_classes, n_features))
norms = np.sqrt(np.sum(X.power(2), axis=1)) # i changed this code
rs = check_random_state(self.random_state)
ind = np.arange(n_samples)
rs.shuffle(ind)
# i added this sparse
sparse = sp.isspmatrix(X)
if sparse:
X = np.asarray(X.data, dtype=np.float64, order='C')
for it in range(self.max_iter):
violation_sum = 0
for ii in range(n_samples):
i = ind[ii]
if norms[i] == 0:
continue
g = self._partial_gradient(X, y, i)
v = self._violation(g, y, i)
violation_sum += v
if v < 1e-12:
continue
delta = self._solve_subproblem(g, y, norms, i)
self.coef_ += (delta * X[i][:, np.newaxis]).T
self.dual_coef_[:, i] += delta
if it == 0:
violation_init = violation_sum
vratio = violation_sum / violation_init
if self.verbose >= 1:
print("iter", it + 1, "violation", vratio)
if vratio < self.tol:
if self.verbose >= 1:
print("Converged")
break
return self
和_违规代码:
def _violation(self, g, y, i):
smallest = np.inf
for k in range(g.shape[0]):
if k == y[i] and self.dual_coef_[k, i] >= self.C:
continue
elif k != y[i] and self.dual_coef_[k, i] >= 0:
continue
smallest = min(smallest, g[k].all()) # and i added .all()
return g.max() - smallest
我知道索引有问题,我不确定如何解决这个问题,我不想破坏代码,因为我真的不明白这段代码是如何工作的。
您必须将 tfidf vectorizer 的稀疏矩阵输出转换为密集矩阵,然后将其制成二维数组。试试这个!
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
news_train = fetch_20newsgroups(subset='train')
text, y = news_train.data[:1000], news_train.target[:1000]
clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
vectorizer= TfidfVectorizer(min_df=20,stop_words='english')
X = np.asarray(vectorizer.fit_transform(text).todense())
clf.fit(X, y)
print(clf.score(X, y))
输出:
iter 1 violation 1.0
iter 2 violation 0.07075102408683964
iter 3 violation 0.018288133735158228
iter 4 violation 0.009149083942255389
Converged
0.953