从其他数据框创建和保存数据框

Question

我的 df 摘录（总共 35k 行）：

stop_id                      time
7909    2022-04-06T03:47:00+03:00
7909    2022-04-06T04:07:00+03:00
1009413 2022-04-06T04:10:00+03:00
1002246 2022-04-06T04:19:00+03:00
1009896 2022-04-06T04:20:00+03:00

我想为每个唯一的 stop_id 列创建单独的数据帧，在每个数据帧中我需要有 stop_id （每一行都是常量）， time 字段只有唯一值和 number 列，该列聚合具有相同 time 值的行。因此，假设有 50 个唯一的 stop_id 值，我想获得包含上述所有数据的 50 个单独的 csv 文件。我该怎么做？

希望解释的不乱

我有这行代码

df.groupby(['time']).agg({'time':'size','stop_id': ", ".join})

但它不保留 stop_id

的值

预期输出： csv1

stop_id   time                      number   
7909      2022-04-06T03:47:00+03:00 1
7909      2022-04-06T04:07:00+03:00 1
7909      2022-04-06T05:00:00+03:00 2
...

csv2

stop_id      time                      number   
1009413      2022-04-06T04:10:00+03:00 1
1009413      2022-04-06T04:19:00+03:00 3
1009413      2022-04-06T04:30:00+03:00 5
...

Answer 1

您可以在 stop_id 和 time 上使用 group_by 并使用 size() 聚合来获取每个中的 number 列数据框。之后，您可以过滤所有唯一的 stop_id 并迭代每个组以构建单个数据框，如下所示：

import pandas as pd


data = {"stop_id": [...], "time": [...]} # Your Data

df = pd.DataFrame(data=data) # Create the DataFrame from the data

# The GroupBy DataFrame has the MultiIndex with the form (stop_id, time)
g = df.groupby(['stop_id', 'time']).size()

# Set of stop_ids, you can also use df.stop_id.unique()
stops = { i[0] for i in g.index }

# Iterate over every unique stop_id
for stop in stops:
    # Filter only the groups with the right stop_id
    times = filter(lambda x: x[0] == stop, g.index)
    
    # Prepare new DataFrame
    data = { "stop_id": [], "time": [], "number": []}
    
    # Iterate over each unique time for the specific stop_id
    for time in times:
        data["stop_id"].append(stop) # add the stop_id
        data["time"].append(time[1]) # add the current time
        data["number"].append(g[(stop, time[1])]) # add its count

    # Save the DataFrame as a CSV
    pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)

编辑地址评论

如果我没理解错的话，您现在想要的不是数字，而是前面脚本中的元素列表。这要归功于 apply() 方法，按以下方式使用：

import pandas as pd


data = {"stop_id": [...], "route_name": [...], "time": [...]}

df = pd.DataFrame(data=data)

# The GroupBy DataFrame has the tuple (stop_id, time) as Indexes
# Apply the list() function over the values of "route_name" in the group
g = df.groupby(['stop_id', 'time'])["route_name"].apply(list)
print(g)

# Set of stop_ids
stops = { i[0] for i in g.index }
print(g)
for stop in stops:
    times = filter(lambda x: x[0] == stop, g.index)
    data = { "stop_id": [], "time": [], "route_names": []}
    for time in times:
        data["stop_id"].append(stop)
        data["time"].append(time[1])
        data["route_names"].append(g[(stop, time[1])])
    pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)

注意这样，如果将生成的CSV之一读入Pandas Data Frame，则必须将字段route_names从字符串转换为列表. Here 你可以找到一些方法来执行这个操作。

从其他数据框创建和保存数据框

create and save dataframes from other dataframe

python

dataframe

pandas