多类 SVM 未能使用 20 新闻组数据集

Multiclass SVM failed to use 20 News Group Dataset

我正在尝试使用来自 Mblondel Multiclass SVM 的多类 SVM 代码,我阅读了他的论文并且他使用了来自 sklearn 20newsgroup 的数据集,但是当我尝试使用它时,代码无法正常工作。

我试图更改代码以匹配 20newsgroup 数据集。但我被这个错误困住了..

Traceback (most recent call last):

File "F:\env\chatbotstripped\CSSVM.py", line 157, in

clf.fit(X, y)

File "F:\env\chatbotstripped\CSSVM.py", line 106, in fit

v = self._violation(g, y, i)

File "F:\env\chatbotstripped\CSSVM.py", line 50, in _violation

elif k != y[i] and self.dual_coef_[k, i] >= 0:

IndexError: index 20 is out of bounds for axis 0 with size 20

这是主要代码:

from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups(subset='train')
X, y = news_train.data[:100], news_train.target[:100]

clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
X = TfidfVectorizer().fit_transform(X)
clf.fit(X, y)
print(clf.score(X, y))

这是合适的代码:

def fit(self, X, y):
    n_samples, n_features = X.shape

    self._label_encoder = LabelEncoder()
    y = self._label_encoder.fit_transform(y)

    n_classes = len(self._label_encoder.classes_)
    self.dual_coef_ = np.zeros((n_classes, n_samples), dtype=np.float64)
    self.coef_ = np.zeros((n_classes, n_features))

    norms = np.sqrt(np.sum(X.power(2), axis=1)) # i changed this code

    rs = check_random_state(self.random_state)
    ind = np.arange(n_samples)
    rs.shuffle(ind)

    # i added this sparse
    sparse = sp.isspmatrix(X)
    if sparse:
        X = np.asarray(X.data, dtype=np.float64, order='C')

    for it in range(self.max_iter):
        violation_sum = 0
        for ii in range(n_samples):
            i = ind[ii]
        
            if norms[i] == 0:
                continue
        
            g = self._partial_gradient(X, y, i)
            v = self._violation(g, y, i)
            violation_sum += v
         
            if v < 1e-12:
                continue

            delta = self._solve_subproblem(g, y, norms, i)
            self.coef_ += (delta * X[i][:, np.newaxis]).T
            self.dual_coef_[:, i] += delta

        if it == 0:
            violation_init = violation_sum

        vratio = violation_sum / violation_init

        if self.verbose >= 1:
            print("iter", it + 1, "violation", vratio)

        if vratio < self.tol:
            if self.verbose >= 1:
                print("Converged")
            break
    return self

和_违规代码:

def _violation(self, g, y, i):
    smallest = np.inf
    for k in range(g.shape[0]):
        if k == y[i] and self.dual_coef_[k, i] >= self.C:
            continue
        elif k != y[i] and self.dual_coef_[k, i] >= 0:
            continue

        smallest = min(smallest, g[k].all()) # and i added .all()
    return g.max() - smallest

我知道索引有问题,我不确定如何解决这个问题,我不想破坏代码,因为我真的不明白这段代码是如何工作的。

您必须将 tfidf vectorizer 的稀疏矩阵输出转换为密集矩阵,然后将其制成二维数组。试试这个!

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
news_train = fetch_20newsgroups(subset='train')
text, y = news_train.data[:1000], news_train.target[:1000]

clf = MulticlassSVM(C=0.1, tol=0.01, max_iter=100, random_state=0, verbose=1)
vectorizer= TfidfVectorizer(min_df=20,stop_words='english')
X = np.asarray(vectorizer.fit_transform(text).todense())
clf.fit(X, y)
print(clf.score(X, y))

输出:

iter 1 violation 1.0
iter 2 violation 0.07075102408683964
iter 3 violation 0.018288133735158228
iter 4 violation 0.009149083942255389
Converged
0.953