来自 train_test_split 的 seaborn 可视化数据训练和数据测试
Visualization data train and data test from train_test_split with seaborn
我有一个包含第 9583 行的数据,我用 train_test_split
拆分了它。我想像这个例子一样使用 barplot 可视化我的数据训练和数据测试:
import pandas as pd
df = pd.read_excel("Data/data_clean_spacy_for_implementation.xlsx")
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df["text"], df["label"], test_size=0.2, stratify=df["label"], random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
X_array = X_train.toarray()
print(X_train.shape) #output (7666, 12222)
print(X_test.shape) #output (1917, 12222)
怎么做?
我的数据github
您可以使用 value_counts
to count unique values of each label, follow by sns.barplot
,将 index
作为 x 轴,将 values
作为 y 轴。如果对您的分析有意义,您可以使用 sharey='row'
(plt.subplots(..., sharey='row')
),这样每一行(两列,train
和 test
)将共享相同的 y 轴.
...
...
print(X_train.shape) #output (7666, 12222)
print(X_test.shape) #output (1917, 12222)
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(1,2, figsize=(12,5))
for idx, group in enumerate([('Train', y_train), ('Test', y_test)]):
data = group[1].value_counts()
sns.barplot(ax=ax[idx], x=data.index, y=data.values)
ax[idx].set_title(f'{group[0]} Label Count')
ax[idx].set_xlabel(f'{group[0]} Labels')
ax[idx].set_ylabel('Label Count')
plt.show()
我有一个包含第 9583 行的数据,我用 train_test_split
拆分了它。我想像这个例子一样使用 barplot 可视化我的数据训练和数据测试:
import pandas as pd
df = pd.read_excel("Data/data_clean_spacy_for_implementation.xlsx")
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df["text"], df["label"], test_size=0.2, stratify=df["label"], random_state=42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
X_array = X_train.toarray()
print(X_train.shape) #output (7666, 12222)
print(X_test.shape) #output (1917, 12222)
怎么做?
我的数据github
您可以使用 value_counts
to count unique values of each label, follow by sns.barplot
,将 index
作为 x 轴,将 values
作为 y 轴。如果对您的分析有意义,您可以使用 sharey='row'
(plt.subplots(..., sharey='row')
),这样每一行(两列,train
和 test
)将共享相同的 y 轴.
...
...
print(X_train.shape) #output (7666, 12222)
print(X_test.shape) #output (1917, 12222)
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(1,2, figsize=(12,5))
for idx, group in enumerate([('Train', y_train), ('Test', y_test)]):
data = group[1].value_counts()
sns.barplot(ax=ax[idx], x=data.index, y=data.values)
ax[idx].set_title(f'{group[0]} Label Count')
ax[idx].set_xlabel(f'{group[0]} Labels')
ax[idx].set_ylabel('Label Count')
plt.show()