从其他数据框创建和保存数据框
create and save dataframes from other dataframe
我的 df 摘录(总共 35k 行):
stop_id time
7909 2022-04-06T03:47:00+03:00
7909 2022-04-06T04:07:00+03:00
1009413 2022-04-06T04:10:00+03:00
1002246 2022-04-06T04:19:00+03:00
1009896 2022-04-06T04:20:00+03:00
我想为每个唯一的 stop_id
列创建单独的数据帧,在每个数据帧中我需要有 stop_id
(每一行都是常量), time
字段只有唯一值和 number
列,该列聚合具有相同 time
值的行。因此,假设有 50 个唯一的 stop_id
值,我想获得包含上述所有数据的 50 个单独的 csv 文件。我该怎么做?
希望解释的不乱
我有这行代码
df.groupby(['time']).agg({'time':'size','stop_id': ", ".join})
但它不保留 stop_id
的值
预期输出:
csv1
stop_id time number
7909 2022-04-06T03:47:00+03:00 1
7909 2022-04-06T04:07:00+03:00 1
7909 2022-04-06T05:00:00+03:00 2
...
csv2
stop_id time number
1009413 2022-04-06T04:10:00+03:00 1
1009413 2022-04-06T04:19:00+03:00 3
1009413 2022-04-06T04:30:00+03:00 5
...
您可以在 stop_id
和 time
上使用 group_by
并使用 size()
聚合来获取每个中的 number
列数据框。之后,您可以过滤所有唯一的 stop_id
并迭代每个组以构建单个数据框,如下所示:
import pandas as pd
data = {"stop_id": [...], "time": [...]} # Your Data
df = pd.DataFrame(data=data) # Create the DataFrame from the data
# The GroupBy DataFrame has the MultiIndex with the form (stop_id, time)
g = df.groupby(['stop_id', 'time']).size()
# Set of stop_ids, you can also use df.stop_id.unique()
stops = { i[0] for i in g.index }
# Iterate over every unique stop_id
for stop in stops:
# Filter only the groups with the right stop_id
times = filter(lambda x: x[0] == stop, g.index)
# Prepare new DataFrame
data = { "stop_id": [], "time": [], "number": []}
# Iterate over each unique time for the specific stop_id
for time in times:
data["stop_id"].append(stop) # add the stop_id
data["time"].append(time[1]) # add the current time
data["number"].append(g[(stop, time[1])]) # add its count
# Save the DataFrame as a CSV
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
编辑地址评论
如果我没理解错的话,您现在想要的不是数字,而是前面脚本中的元素列表。这要归功于 apply()
方法,按以下方式使用:
import pandas as pd
data = {"stop_id": [...], "route_name": [...], "time": [...]}
df = pd.DataFrame(data=data)
# The GroupBy DataFrame has the tuple (stop_id, time) as Indexes
# Apply the list() function over the values of "route_name" in the group
g = df.groupby(['stop_id', 'time'])["route_name"].apply(list)
print(g)
# Set of stop_ids
stops = { i[0] for i in g.index }
print(g)
for stop in stops:
times = filter(lambda x: x[0] == stop, g.index)
data = { "stop_id": [], "time": [], "route_names": []}
for time in times:
data["stop_id"].append(stop)
data["time"].append(time[1])
data["route_names"].append(g[(stop, time[1])])
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
注意这样,如果将生成的CSV之一读入Pandas Data Frame,则必须将字段route_names
从字符串转换为列表. Here 你可以找到一些方法来执行这个操作。
我的 df 摘录(总共 35k 行):
stop_id time
7909 2022-04-06T03:47:00+03:00
7909 2022-04-06T04:07:00+03:00
1009413 2022-04-06T04:10:00+03:00
1002246 2022-04-06T04:19:00+03:00
1009896 2022-04-06T04:20:00+03:00
我想为每个唯一的 stop_id
列创建单独的数据帧,在每个数据帧中我需要有 stop_id
(每一行都是常量), time
字段只有唯一值和 number
列,该列聚合具有相同 time
值的行。因此,假设有 50 个唯一的 stop_id
值,我想获得包含上述所有数据的 50 个单独的 csv 文件。我该怎么做?
希望解释的不乱
我有这行代码
df.groupby(['time']).agg({'time':'size','stop_id': ", ".join})
但它不保留 stop_id
预期输出: csv1
stop_id time number
7909 2022-04-06T03:47:00+03:00 1
7909 2022-04-06T04:07:00+03:00 1
7909 2022-04-06T05:00:00+03:00 2
...
csv2
stop_id time number
1009413 2022-04-06T04:10:00+03:00 1
1009413 2022-04-06T04:19:00+03:00 3
1009413 2022-04-06T04:30:00+03:00 5
...
您可以在 stop_id
和 time
上使用 group_by
并使用 size()
聚合来获取每个中的 number
列数据框。之后,您可以过滤所有唯一的 stop_id
并迭代每个组以构建单个数据框,如下所示:
import pandas as pd
data = {"stop_id": [...], "time": [...]} # Your Data
df = pd.DataFrame(data=data) # Create the DataFrame from the data
# The GroupBy DataFrame has the MultiIndex with the form (stop_id, time)
g = df.groupby(['stop_id', 'time']).size()
# Set of stop_ids, you can also use df.stop_id.unique()
stops = { i[0] for i in g.index }
# Iterate over every unique stop_id
for stop in stops:
# Filter only the groups with the right stop_id
times = filter(lambda x: x[0] == stop, g.index)
# Prepare new DataFrame
data = { "stop_id": [], "time": [], "number": []}
# Iterate over each unique time for the specific stop_id
for time in times:
data["stop_id"].append(stop) # add the stop_id
data["time"].append(time[1]) # add the current time
data["number"].append(g[(stop, time[1])]) # add its count
# Save the DataFrame as a CSV
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
编辑地址评论
如果我没理解错的话,您现在想要的不是数字,而是前面脚本中的元素列表。这要归功于 apply()
方法,按以下方式使用:
import pandas as pd
data = {"stop_id": [...], "route_name": [...], "time": [...]}
df = pd.DataFrame(data=data)
# The GroupBy DataFrame has the tuple (stop_id, time) as Indexes
# Apply the list() function over the values of "route_name" in the group
g = df.groupby(['stop_id', 'time'])["route_name"].apply(list)
print(g)
# Set of stop_ids
stops = { i[0] for i in g.index }
print(g)
for stop in stops:
times = filter(lambda x: x[0] == stop, g.index)
data = { "stop_id": [], "time": [], "route_names": []}
for time in times:
data["stop_id"].append(stop)
data["time"].append(time[1])
data["route_names"].append(g[(stop, time[1])])
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
注意这样,如果将生成的CSV之一读入Pandas Data Frame,则必须将字段route_names
从字符串转换为列表. Here 你可以找到一些方法来执行这个操作。