绘制 pandas 数据框日期

Question

我有一个 pandas 数据框，其中包含 27 列用电量，第一列代表两年期间的日期和时间，其他列记录了 26 所房屋的每小时用电量两年。我正在做的是使用 k-means 进行聚类。每当我尝试在 x 轴上绘制日期并在 y 轴上绘制电力消耗值时，我都会遇到一个问题，即 x 和 y 必须具有相同的大小。我尝试reshape，问题没有解决

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)

我总是收到相同的错误消息，X 和 Y 必须保存大小，请尝试重塑您的数据。当我尝试重塑数据时它不起作用，因为日期列的大小总是小于其余列的大小。

Answer 1

我认为您实际上是在对所有家庭进行时间序列聚类，以找到随时间变化的类似用电模式。

为此，每个时间戳成为一个 'feature'，而每个家庭的使用情况成为您的数据行。这将使应用 sklearn 聚类方法变得更加容易，这些方法通常采用 method.fit(x) 的形式，其中 x 表示特征（将数据作为具有 (row, column) 形状的二维数组传递） .所以你的数据需要转置。

重构代码如下：

# what you have done 
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')

# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)

# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()

你应该会看到这样的东西（不要介意显示的数据，我创建了一些与你的相似的虚拟数据）。

接下来，对于聚类，您可以这样做：

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_

# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans

最后，对于绘图，您可以使用以下代码生成您想要的散点图：

import matplotlib.pyplot as plt

color = ['red','green','blue']

plt.figure(figsize=(16,4))

for index, row in df.iterrows():
    plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)

for index, cluster_center in enumerate(kmeans.cluster_centers_):
    plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)

plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()

但我建议为单个集群绘制线图，更具视觉吸引力（对我而言）：

plt.figure(figsize=(16,16))

for cluster_index in [0,1,2]:

    plt.subplot(3,1,cluster_index + 1)

    for index, row in df.iterrows():
        if row.iloc[-1] == cluster_index:
            plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)

    plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)

    plt.xticks(rotation='vertical')
    plt.ylabel('Electricity Consumption')
    plt.title(f'Cluster {cluster_index}', fontsize=20)

plt.tight_layout()
plt.show()

干杯！

绘制 pandas 数据框日期

plotting pandas dataframe date

python

cluster-analysis

python-3.x

pandas