无法在逻辑回归中使用 decision_function() 评估分数
Unable to evaluate score using decision_function() in Logistic Regression
我正在读这个大学。在华盛顿作业中,我必须在 LogisticRegression 中使用 decision_function() 来预测 sample_test_matrix(最后几行)的分数。但是我得到的错误是
ValueError: X has 145 features per sample; expecting 113092
这是代码:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
products = pd.read_csv('amazon_baby.csv')
def remove_punct (text) :
import string
text = str(text)
for i in string.punctuation:
text = text.replace(i,"")
return(text)
products['review_clean'] = products['review'].apply(remove_punct)
products = products[products.rating != 3]
products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else -1 )
train_data_index = pd.read_json('module-2-assignment-train-idx.json')
test_data_index = pd.read_json('module-2-assignment-test-idx.json')
train_data = products.loc[train_data_index[0], :]
test_data = products.loc[test_data_index[0], :]
train_data = train_data.dropna()
test_data = test_data.dropna()
from sklearn.feature_extraction.text import CountVectorizer
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])
print (sentiment_model.coef_)
sample_data = test_data[10:13]
print (sample_data)
sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)
这是产品数据:
Name Review Rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5
此行导致后续行出错:
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
把上面的改成这样:
test_matrix = vectorizer.transform(test_data['review_clean'])
解释:使用fit_transform()将在测试数据上重新拟合CountVectorizer。因此所有关于训练数据的信息都将丢失,词汇量将仅根据测试数据计算。
那么您正在使用 vectorizer
对象来转换 sample_data['review_clean']
。所以其中的特征将只是那些从 test_data
.
中学到的特征
但是 sentiment_model
接受了 train_data
的词汇训练。因此功能不同。
始终对测试数据使用 transform()
,从不使用 fit_transform()
。
我正在读这个大学。在华盛顿作业中,我必须在 LogisticRegression 中使用 decision_function() 来预测 sample_test_matrix(最后几行)的分数。但是我得到的错误是
ValueError: X has 145 features per sample; expecting 113092
这是代码:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
products = pd.read_csv('amazon_baby.csv')
def remove_punct (text) :
import string
text = str(text)
for i in string.punctuation:
text = text.replace(i,"")
return(text)
products['review_clean'] = products['review'].apply(remove_punct)
products = products[products.rating != 3]
products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else -1 )
train_data_index = pd.read_json('module-2-assignment-train-idx.json')
test_data_index = pd.read_json('module-2-assignment-test-idx.json')
train_data = products.loc[train_data_index[0], :]
test_data = products.loc[test_data_index[0], :]
train_data = train_data.dropna()
test_data = test_data.dropna()
from sklearn.feature_extraction.text import CountVectorizer
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])
print (sentiment_model.coef_)
sample_data = test_data[10:13]
print (sample_data)
sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)
这是产品数据:
Name Review Rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5
此行导致后续行出错:
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
把上面的改成这样:
test_matrix = vectorizer.transform(test_data['review_clean'])
解释:使用fit_transform()将在测试数据上重新拟合CountVectorizer。因此所有关于训练数据的信息都将丢失,词汇量将仅根据测试数据计算。
那么您正在使用 vectorizer
对象来转换 sample_data['review_clean']
。所以其中的特征将只是那些从 test_data
.
但是 sentiment_model
接受了 train_data
的词汇训练。因此功能不同。
始终对测试数据使用 transform()
,从不使用 fit_transform()
。