使用 Start_Date 和 End_Date 绘制 Pandas 数据帧的计数

Question

我正在尝试 plot 一个 daily follower count 用于各种 twitter handles。结果类似于您在下面看到的内容，但可以通过 1 个以上的推特句柄进行过滤：

通常，我会通过简单地将从 Twitter 提取的新数据集附加到原始 table 以及提取日志的日期来完成此操作。然而，这会让我在短短几天内得到一百万行代码。而且它不允许我清楚地看到用户何时离开。

作为 alternative，从 Twitter 提取数据后，我的 pandas dataframe 结构如下：

Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017

其中：

Handles:是我拉粉丝的账户
Follower_ID:用户是否在关注handle

所以，例如，如果我是Follower_ID 100，我可以同时关注handle x和handle y

我想知道准备数据 (pivot、clean through a function、groupby) 的最佳方法是什么，以便可以相应地绘制数据。有什么想法吗？

Answer 1

我最终以一种天真的方式使用了 iterrows，因此可能会有更有效的方法来利用 pandas 重塑等。但我的想法是制作一个函数，它采用在您的数据框和您要绘制的句柄中，然后 returns 另一个具有该句柄的每日关注者计数的数据框。为此，函数

仅将 df 过滤为所需的句柄，
取每个日期范围（例如，21/04/2017 到 29/05/2017），
将其转换为 pandas date_range 和
将所有日期放在一个列表中。

到那时，单个列表上的 collections.Counter 是按天计算结果的简单方法。

需要注意的是，null End_Dates 应该合并到图表上您想要的任何结束日期。当我处理数据时，我称之为 max_date。所以一共：

from io import StringIO
from collections import Counter
import pandas as pd

def get_counts(df, handle):
    """Inputs: your dataframe and the handle
    you want to plot.

    Returns a dataframe of daily follower counts.
    """

    # filters the df to the desired handle only
    df_handle = df[df['Handles'] == handle]

    all_dates = []

    for _, row in df_handle.iterrows():
        # Take each date range (for example, 21/04/2017 to 29/05/2017),
        # turn that into a pandas `date_range`, and
        # put all the dates in a single list
        all_dates.extend(pd.date_range(row['Start_Date'],
                                       row['End_Date']) \
                           .tolist())

    counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
                         .rename(columns={0: handle}) \
                         .sort_index()

    return counts

就是这个功能。现在阅读和整理您的数据 ...

data = StringIO("""Follower_ID          Handles    Start_Date  End_Date
100                  x          30/05/2017  NaN
101                  x          21/04/2017  29/05/2017
201                  y          14/06/2017  NaN
100                  y          16/06/2017  28/06/2017""")

df = pd.read_csv(data, delim_whitespace=True)

# fill in missing end dates
max_date = pd.Timestamp('2017-06-30') 
df['End_Date'].fillna(max_date, inplace=True)

# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])

print(get_counts(df, 'y'))

最后一行打印句柄 y:

            y
2017-06-14  1
2017-06-15  1
2017-06-16  2
2017-06-17  2
2017-06-18  2
2017-06-19  2
2017-06-20  2
2017-06-21  2
2017-06-22  2
2017-06-23  2
2017-06-24  2
2017-06-25  2
2017-06-26  2
2017-06-27  2
2017-06-28  2
2017-06-29  1
2017-06-30  1

您可以使用您喜欢的包绘制此数据框。

使用 Start_Date 和 End_Date 绘制 Pandas 数据帧的计数

Plot Count of Pandas Dataframe with Start_Date and End_Date

python

pandas

pygal

plotly

bokeh