根据行值添加值
add values based on row value
我有一个代码可以在对原始数据帧进行某些操作后创建 csv 文件:
import pandas as pd
timetable = pd.read_excel('timetable.xlsx')
data = {"stop_id": timetable['stop_id'], "arrival_time": timetable['arrival_time'], 'route_id': timetable['route_id']}
df = pd.DataFrame(data=data) # Create the DataFrame from the data
g = df.groupby(['stop_id', 'arrival_time']).size()
stops = { i[0] for i in g.index }
for stop in stops:
times = filter(lambda x: x[0] == stop, g.index)
data = { "stop_id": [], "arrival_time": [], "number": []}
for time in times:
data["stop_id"].append(stop) # add the stop_id
data["arrival_time"].append(time[1]) # add the current time
data["number"].append(g[(stop, time[1])]) # add its count
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
我应该如何更改代码以便它也附加其他列的值?
我有一个列 route_id
,每个 stop_id
具有不同的值,我想列出每个 arrival_time
行的这些 route_id
值。上下文:一辆公共汽车(route_id
)在某个arrival_time
到达stop_id
,但是可以有几辆公共汽车到达同一个arrival_time
,所以我想知道哪个route_id
某时到达
数据:https://docs.google.com/spreadsheets/d/1O6QGWZh0Yp2EcJAnlvIJw0xiCH8T1AY_/edit#gid=640877265
数据摘录:
route_id stop_id arrival_time
429 2179 4/6/22 19:40:00
429 2179 4/6/22 08:06:00
429 2179 4/6/22 09:20:00
429 2179 4/6/22 11:12:00
429 2179 4/6/22 12:25:00
429 2179 4/6/22 13:39:00
429 2179 4/6/22 17:56:00
429 2179 4/6/22 19:19:00
441 2179 4/6/22 07:16:00
441 2179 4/6/22 10:37:00
441 2179 4/6/22 14:33:00
不言自明:
import pandas as pd
df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})
# grouping by stop_id & arrival_time, also joining route_id to the sorted list, counting size of each stop_id group
# all ends up in multi-index dataframe, .reset_index applied to flatten it.
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
.agg(number=('arrival_time', 'size'), route_id=('route_id', sorted))\
.reset_index()
#creating .csv per unique stop_id df_grouped dataframe
for stop in df_grouped.stop_id.unique():
file_name = 'Stop_ID{0}.csv'.format(stop)
df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)
根据评论,字符串选项而不是列表:
import pandas as pd
df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})
df.route_id = df.route_id.astype(str) # changing dtype to string before grouping
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
.agg(number=('arrival_time', 'size'), route_id=('route_id', ', '.join))\
.reset_index()
for stop in df_grouped.stop_id.unique():
file_name = 'Stop_{0}.csv'.format(stop)
df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)
我有一个代码可以在对原始数据帧进行某些操作后创建 csv 文件:
import pandas as pd
timetable = pd.read_excel('timetable.xlsx')
data = {"stop_id": timetable['stop_id'], "arrival_time": timetable['arrival_time'], 'route_id': timetable['route_id']}
df = pd.DataFrame(data=data) # Create the DataFrame from the data
g = df.groupby(['stop_id', 'arrival_time']).size()
stops = { i[0] for i in g.index }
for stop in stops:
times = filter(lambda x: x[0] == stop, g.index)
data = { "stop_id": [], "arrival_time": [], "number": []}
for time in times:
data["stop_id"].append(stop) # add the stop_id
data["arrival_time"].append(time[1]) # add the current time
data["number"].append(g[(stop, time[1])]) # add its count
pd.DataFrame(data=data).to_csv(f"{stop}.csv", index=False)
我应该如何更改代码以便它也附加其他列的值?
我有一个列 route_id
,每个 stop_id
具有不同的值,我想列出每个 arrival_time
行的这些 route_id
值。上下文:一辆公共汽车(route_id
)在某个arrival_time
到达stop_id
,但是可以有几辆公共汽车到达同一个arrival_time
,所以我想知道哪个route_id
某时到达
数据:https://docs.google.com/spreadsheets/d/1O6QGWZh0Yp2EcJAnlvIJw0xiCH8T1AY_/edit#gid=640877265
数据摘录:
route_id stop_id arrival_time
429 2179 4/6/22 19:40:00
429 2179 4/6/22 08:06:00
429 2179 4/6/22 09:20:00
429 2179 4/6/22 11:12:00
429 2179 4/6/22 12:25:00
429 2179 4/6/22 13:39:00
429 2179 4/6/22 17:56:00
429 2179 4/6/22 19:19:00
441 2179 4/6/22 07:16:00
441 2179 4/6/22 10:37:00
441 2179 4/6/22 14:33:00
不言自明:
import pandas as pd
df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})
# grouping by stop_id & arrival_time, also joining route_id to the sorted list, counting size of each stop_id group
# all ends up in multi-index dataframe, .reset_index applied to flatten it.
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
.agg(number=('arrival_time', 'size'), route_id=('route_id', sorted))\
.reset_index()
#creating .csv per unique stop_id df_grouped dataframe
for stop in df_grouped.stop_id.unique():
file_name = 'Stop_ID{0}.csv'.format(stop)
df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)
根据评论,字符串选项而不是列表:
import pandas as pd
df = pd.read_excel('timetable.xlsx', converters={'stop_id':int,'route_id':int})
df.route_id = df.route_id.astype(str) # changing dtype to string before grouping
df_grouped = df.groupby(['stop_id', 'arrival_time'])\
.agg(number=('arrival_time', 'size'), route_id=('route_id', ', '.join))\
.reset_index()
for stop in df_grouped.stop_id.unique():
file_name = 'Stop_{0}.csv'.format(stop)
df_grouped[df_grouped['stop_id'] == stop].to_csv(file_name, index=False)