使用数据框从 K-Means 集群获取值
Get values from K-Means clusters using dataframe
我有这个数据框(text_df):
有 10 位不同的作者,13834 行文本。
然后我创建了一个词袋并像这样使用了 TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
X 的形状是 (13834,2701)
我决定为 KMeans 使用 7 个集群:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
我想提取每个集群中文本的作者,以查看作者是否一致地归入同一集群。不确定解决此问题的最佳方法。谢谢!
更新:
尝试使用嵌套字典可视化每个集群的作者计数:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
输出:
每个集群应该有更多的计数,每个集群可能不止一位作者。我想使用所有预测来获得更准确的计数,而不是使用一个子集。但对替代解决方案持开放态度。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')
我有这个数据框(text_df):
有 10 位不同的作者,13834 行文本。
然后我创建了一个词袋并像这样使用了 TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
X 的形状是 (13834,2701)
我决定为 KMeans 使用 7 个集群:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
我想提取每个集群中文本的作者,以查看作者是否一致地归入同一集群。不确定解决此问题的最佳方法。谢谢!
更新:
尝试使用嵌套字典可视化每个集群的作者计数:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
输出:
每个集群应该有更多的计数,每个集群可能不止一位作者。我想使用所有预测来获得更准确的计数,而不是使用一个子集。但对替代解决方案持开放态度。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')