使用 shift() 函数比较上一行中的数据
Comparing data in previous row using shift() function
我正在处理每月 Citi 自行车旅行数据,如下所示:
> head(data)
tripduration starttime stoptime start.station.id start.station.name end.station.id
1 732 2015-07-01 00:00:03 2015-07-01 00:12:16 489 10 Ave & W 28 St 368
2 322 2015-07-01 00:00:06 2015-07-01 00:05:29 304 Broadway & Battery Pl 3002
3 790 2015-07-01 00:00:17 2015-07-01 00:13:28 447 8 Ave & W 52 St 358
4 1228 2015-07-01 00:00:23 2015-07-01 00:20:51 490 8 Ave & W 33 St 250
5 1383 2015-07-01 00:00:44 2015-07-01 00:23:48 327 Vesey Pl & River Terrace 72
6 603 2015-07-01 00:01:00 2015-07-01 00:11:04 455 1 Ave & E 44 St 367
end.station.name bikeid usertype birth.year gender
1 Carmine St & 6 Ave 18669 Subscriber 1970 1
2 South End Ave & Liberty St 14618 Subscriber 1984 1
3 Christopher St & Greenwich St 18801 Subscriber 1992 1
4 Lafayette St & Jersey St 19137 Subscriber 1990 1
5 W 52 St & 11 Ave 15808 Subscriber 1988 1
6 E 53 St & Lexington Ave 17069 Subscriber 1953 1
每一次旅行都有自己独特的记录,仅7月份就有1,085,676次旅行。可以在此处找到 .csv 格式的原始数据 on the Citi Bike website。通常,自行车从上次行程结束的车站开始。然而,这并非总是如此。有时,自行车的起点站与终点站不同,这表明自行车 "rebalanced" 或由卡车从一个车站移动到下一个车站以满足车站需求。我想过滤掉所有 "normal" 行程,并隔离所有自行车在不同站点开始而不是结束站点的情况(例如 start.station.id
不等于之前的 end.station.id
。 ) 自行车的唯一标识因素是bikeid
,必须使用。以下是一个 bikeid(最常骑的自行车)的月数据子集:
head(onebike)
tripduration starttime stoptime start.station.id start.station.name
1952 691 2015-07-01 07:23:24 2015-07-01 07:34:56 161 LaGuardia Pl & W 3 St
2369 332 2015-07-01 07:38:49 2015-07-01 07:44:22 379 W 31 St & 7 Ave
3879 259 2015-07-01 08:14:34 2015-07-01 08:18:54 472 E 32 St & Park Ave
4310 1112 2015-07-01 08:22:53 2015-07-01 08:41:25 498 Broadway & W 32 St
5795 1509 2015-07-01 08:47:18 2015-07-01 09:12:27 345 W 13 St & 6 Ave
7857 1361 2015-07-01 09:23:50 2015-07-01 09:46:32 348 W Broadway & Spring St
end.station.id end.station.name bikeid usertype birth.year gender
1952 379 W 31 St & 7 Ave 22075 Subscriber 1985 1
2369 472 E 32 St & Park Ave 22075 Subscriber 1986 1
3879 498 Broadway & W 32 St 22075 Subscriber 1986 1
4310 345 W 13 St & 6 Ave 22075 Customer NA 0
5795 348 W Broadway & Spring St 22075 Customer NA 0
7857 386 Centre St & Worth St 22075 Customer NA 0
现在的任务是 select 行的 start.station.id
不等于前一行的 end.station.id
的实例。
结果应包含 bikeid
、end.station.id
、start.station.id
以及自行车放下和拿起的时间差(大致表示移动)。
最好的方法是使用 shift()
函数吗?
如何遍历第一个数据集中的每个 bike.id(大约有 7000 个)以揭示所有隐藏的动作?
我相信有数百种方法可以做到这一点,但由于数据只有 100MB 左右,在这种情况下,一个简单的 for 循环非常有能力并且非常灵活地修改和扩展,所以这里是(完成以秒为单位):
raw_data = read.csv("201511-citibike-tripdata.csv")
bikeid <-22075
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0))
for(i in 2:nrow(onebike)) {
if(onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time)
output[nrow(output) + 1,] = new_row
}
}
output
bikeid end.station.id start.station.id diff.time
1 22075 514 520 181.5667
2 22075 356 502 628.8833
编辑:这是为了进一步回答评论中的问题。
这是一个包含所有 bikeids 的简单扩展:
raw_data = read.csv("201511-citibike-tripdata.csv")
unique_id = unique(raw_data$bikeid)
#bikeid <-22075
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)
for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S")
,units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output[nrow(output) + 1,] = new_row
}
}
}
}
dim(output)
[1] 32589 6
head(output)
bikeid end.station.id start.station.id diff.time stoptime starttime
1 22545 520 529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2 22545 520 517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3 22545 2004 3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4 22545 296 3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5 22545 520 449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6 22545 359 519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29
我正在处理每月 Citi 自行车旅行数据,如下所示:
> head(data) tripduration starttime stoptime start.station.id start.station.name end.station.id 1 732 2015-07-01 00:00:03 2015-07-01 00:12:16 489 10 Ave & W 28 St 368 2 322 2015-07-01 00:00:06 2015-07-01 00:05:29 304 Broadway & Battery Pl 3002 3 790 2015-07-01 00:00:17 2015-07-01 00:13:28 447 8 Ave & W 52 St 358 4 1228 2015-07-01 00:00:23 2015-07-01 00:20:51 490 8 Ave & W 33 St 250 5 1383 2015-07-01 00:00:44 2015-07-01 00:23:48 327 Vesey Pl & River Terrace 72 6 603 2015-07-01 00:01:00 2015-07-01 00:11:04 455 1 Ave & E 44 St 367 end.station.name bikeid usertype birth.year gender 1 Carmine St & 6 Ave 18669 Subscriber 1970 1 2 South End Ave & Liberty St 14618 Subscriber 1984 1 3 Christopher St & Greenwich St 18801 Subscriber 1992 1 4 Lafayette St & Jersey St 19137 Subscriber 1990 1 5 W 52 St & 11 Ave 15808 Subscriber 1988 1 6 E 53 St & Lexington Ave 17069 Subscriber 1953 1
每一次旅行都有自己独特的记录,仅7月份就有1,085,676次旅行。可以在此处找到 .csv 格式的原始数据 on the Citi Bike website。通常,自行车从上次行程结束的车站开始。然而,这并非总是如此。有时,自行车的起点站与终点站不同,这表明自行车 "rebalanced" 或由卡车从一个车站移动到下一个车站以满足车站需求。我想过滤掉所有 "normal" 行程,并隔离所有自行车在不同站点开始而不是结束站点的情况(例如 start.station.id
不等于之前的 end.station.id
。 ) 自行车的唯一标识因素是bikeid
,必须使用。以下是一个 bikeid(最常骑的自行车)的月数据子集:
head(onebike) tripduration starttime stoptime start.station.id start.station.name 1952 691 2015-07-01 07:23:24 2015-07-01 07:34:56 161 LaGuardia Pl & W 3 St 2369 332 2015-07-01 07:38:49 2015-07-01 07:44:22 379 W 31 St & 7 Ave 3879 259 2015-07-01 08:14:34 2015-07-01 08:18:54 472 E 32 St & Park Ave 4310 1112 2015-07-01 08:22:53 2015-07-01 08:41:25 498 Broadway & W 32 St 5795 1509 2015-07-01 08:47:18 2015-07-01 09:12:27 345 W 13 St & 6 Ave 7857 1361 2015-07-01 09:23:50 2015-07-01 09:46:32 348 W Broadway & Spring St end.station.id end.station.name bikeid usertype birth.year gender 1952 379 W 31 St & 7 Ave 22075 Subscriber 1985 1 2369 472 E 32 St & Park Ave 22075 Subscriber 1986 1 3879 498 Broadway & W 32 St 22075 Subscriber 1986 1 4310 345 W 13 St & 6 Ave 22075 Customer NA 0 5795 348 W Broadway & Spring St 22075 Customer NA 0 7857 386 Centre St & Worth St 22075 Customer NA 0
现在的任务是 select 行的 start.station.id
不等于前一行的 end.station.id
的实例。
结果应包含 bikeid
、end.station.id
、start.station.id
以及自行车放下和拿起的时间差(大致表示移动)。
最好的方法是使用 shift()
函数吗?
如何遍历第一个数据集中的每个 bike.id(大约有 7000 个)以揭示所有隐藏的动作?
我相信有数百种方法可以做到这一点,但由于数据只有 100MB 左右,在这种情况下,一个简单的 for 循环非常有能力并且非常灵活地修改和扩展,所以这里是(完成以秒为单位):
raw_data = read.csv("201511-citibike-tripdata.csv")
bikeid <-22075
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0))
for(i in 2:nrow(onebike)) {
if(onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time)
output[nrow(output) + 1,] = new_row
}
}
output
bikeid end.station.id start.station.id diff.time
1 22075 514 520 181.5667
2 22075 356 502 628.8833
编辑:这是为了进一步回答评论中的问题。 这是一个包含所有 bikeids 的简单扩展:
raw_data = read.csv("201511-citibike-tripdata.csv")
unique_id = unique(raw_data$bikeid)
#bikeid <-22075
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)
for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S")
,units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output[nrow(output) + 1,] = new_row
}
}
}
}
dim(output)
[1] 32589 6
head(output)
bikeid end.station.id start.station.id diff.time stoptime starttime
1 22545 520 529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2 22545 520 517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3 22545 2004 3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4 22545 296 3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5 22545 520 449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6 22545 359 519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29