pca 和随机套索错误

Error with pca and randomized lasso

有两个 .csv 文件包含推文和每个推文的分类:posnegneutralclass 表示分类,text 一条推文。

这是我的代码:

def prediction():
    print("Reading files...")

    #Will learn from this data set.
    train = file2SentencesArray('twitter-sanders-apple3')

    #Test dataset.
    test = file2SentencesArray('twitter-sanders-apple2')
    print("Complete!")

    print("Cleaning sentences...")
    #cleanSenteces will remove html, stop words and some characters.
    cleanTrainSentences = cleanSentences(train["text"])
    cleanTestSentences = cleanSentences(test["text"])
    print("Complete!...")

    print("Fiting sentences...")
    vectorizer = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words=None, max_features=5000)
    trainDataFeatures = vectorizer.fit_transform(cleanTrainSentences)
    np.asarray(trainDataFeatures)

    testDataFeatures = vectorizer.transform(cleanTestSentences)
    np.asarray(testDataFeatures)

    #Getting error here.
    randomized_lasso = RandomizedLasso()
    randomized_lasso.fit_transform(trainDataFeatures, testDataFeatures)
    trainDataFeatures = randomized_lasso.transform(trainDataFeatures)

    #and here.
    #pca = decomposition.PCA(n_components=2)
    #pca.fit_transform(trainDataFeatures)
    #trainDataFeatures = pca.transform(trainDataFeatures)
    print("Complete!")

    print("Predicting...")
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainDataFeatures, train["class"])
    result = forest.predict(testDataFeatures)
    print("Complete...")

    return result

随机套索和 PCA 都抛出异常:

PCA – PCA does not support sparse input.

随机套索 – bad input shape

我的 trainDataFeatures 看起来像这样:

(0, 573)   1
(0, 1411)  2
(0, 2748)  1
(0, 1073)  1
(1, 126)   1
(2, 1203)  1

PCA 和随机套索的输入格式都不正确。请替换以下两行并重试。

np.asarray(trainDataFeatures)
np.asarray(testDataFeatures)
# replace the above two lines with these
trainDataFeatures = trainDataFeatures.toarray()
testDataFeatures = testDataFeatures.toarray()