如何根据另一列对r数据框中的列进行排名
How to rank column in r data frame based on another column
假设我有一个如下所示的 R 数据框:
#sample data frame
df <- data.frame(
customer_id = c(568468,568468,568468,568468,568468,568468),
customer = c('paramount','paramount','paramount','paramount','paramount','paramount'),
start_date = as.Date(c('2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15')),
occured_on = as.POSIXct(c('2017-08-08 20:05:00','2017-08-08 20:30:00','2017-08-11 21:13:00','2017-08-11 21:30:00','2017-08-31 05:16:00','2017-08-31 05:30:00')),
old_plan = c('a',NA,'b',NA,'b',NA),
old_price = c(NA,29,NA,99,NA,82.5),
old_recurrence = c('monthly',NA,'monthly',NA,'annually',NA),
new_plan = c('b',NA,'b',NA,'c',NA),
new_price = c(NA,99,NA,82.5,NA,349),
new_recurrence = c('monthly',NA,'annually',NA,'monthly',NA)
);
任务:
根据最短 occured_on 时间将 old_plan、old_price、old_recurrence 排名为每组第一...
new_plan、new_price、new_recurrence,基于最大 occured_on 时间...
这样我得到的数据框就会有第一个旧计划、价格和重复周期,以及最后一个新计划价格和重复周期。
NA 应该被 removed/not 考虑在内。生成的数据框应如下所示:
customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
568468 paramount 2016-03-15 a 29 monthly c 349 monthly
或者如果您想在代码中看到
result_df <- data.frame(
customer_id = 568468,
customer = 'paramount',
start_date = "2016-03-15",
old_plan = 'a',
old_price = 29,
old_recurrence = 'monthly',
new_plan = 'c',
new_price = 349,
new_recurrence = 'monthly'
)
我觉得我已经接近使用这些功能了...
df$old_plan_rank <- rank(df$old_plan, na.last = "keep", ties.method = "min")
df$new_recurrence_rank <- rank(df$new_recurrence, na.last = "keep", ties.method = "max")
除了它是根据顺序或 alphabetically/numerically 排名,而不是根据 occurred_on 列实际出现的顺序。我不知道如何指定要排名的列。
帮忙?
使用dplyr
的解决方案。
library(dplyr)
df2 <- df %>%
arrange(customer_id, start_date, occured_on) %>%
group_by(customer_id, customer, start_date) %>%
summarise(old_plan = first(old_plan[!is.na(old_plan)]),
old_price = first(old_price[!is.na(old_price)]),
old_recurrence = first(old_recurrence[!is.na(old_recurrence)]),
new_plan = last(new_plan[!is.na(new_plan)]),
new_price = last(new_price[!is.na(new_price)]),
new_recurrence = last(new_recurrence[!is.na(new_recurrence)])) %>%
ungroup() %>%
as.data.frame()
df2
# customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
# 1 568468 paramount 2016-03-15 a 29 monthly c 349 monthly
说明
arrange(customer_id, start_date, occured_on)
是对列进行排序。它按 customer_id
对列进行排序,然后 start_date
,最后 occured_on
。
group_by(customer_id, customer, start_date)
表示在customer_id
、customer
、start_date
、
的基础上,在各组中进行如下操作
summarise
为每个变量生成单个汇总输出。
对于每个变量,以old_plan
为例,我使用old_plan[!is.na(old_plan)
提取该列的非NA值。之后,first
和last
可以提取这些值的第一个或最后一个元素,对应时间上的最小值和最大值。
ungroup()
是去掉分组。 as.data.frame()
是可选的,它将 tibble
对象严格转换为 data.frame
对象。
假设我有一个如下所示的 R 数据框:
#sample data frame
df <- data.frame(
customer_id = c(568468,568468,568468,568468,568468,568468),
customer = c('paramount','paramount','paramount','paramount','paramount','paramount'),
start_date = as.Date(c('2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15','2016-03-15')),
occured_on = as.POSIXct(c('2017-08-08 20:05:00','2017-08-08 20:30:00','2017-08-11 21:13:00','2017-08-11 21:30:00','2017-08-31 05:16:00','2017-08-31 05:30:00')),
old_plan = c('a',NA,'b',NA,'b',NA),
old_price = c(NA,29,NA,99,NA,82.5),
old_recurrence = c('monthly',NA,'monthly',NA,'annually',NA),
new_plan = c('b',NA,'b',NA,'c',NA),
new_price = c(NA,99,NA,82.5,NA,349),
new_recurrence = c('monthly',NA,'annually',NA,'monthly',NA)
);
任务:
根据最短 occured_on 时间将 old_plan、old_price、old_recurrence 排名为每组第一... new_plan、new_price、new_recurrence,基于最大 occured_on 时间... 这样我得到的数据框就会有第一个旧计划、价格和重复周期,以及最后一个新计划价格和重复周期。 NA 应该被 removed/not 考虑在内。生成的数据框应如下所示:
customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
568468 paramount 2016-03-15 a 29 monthly c 349 monthly
或者如果您想在代码中看到
result_df <- data.frame(
customer_id = 568468,
customer = 'paramount',
start_date = "2016-03-15",
old_plan = 'a',
old_price = 29,
old_recurrence = 'monthly',
new_plan = 'c',
new_price = 349,
new_recurrence = 'monthly'
)
我觉得我已经接近使用这些功能了...
df$old_plan_rank <- rank(df$old_plan, na.last = "keep", ties.method = "min")
df$new_recurrence_rank <- rank(df$new_recurrence, na.last = "keep", ties.method = "max")
除了它是根据顺序或 alphabetically/numerically 排名,而不是根据 occurred_on 列实际出现的顺序。我不知道如何指定要排名的列。
帮忙?
使用dplyr
的解决方案。
library(dplyr)
df2 <- df %>%
arrange(customer_id, start_date, occured_on) %>%
group_by(customer_id, customer, start_date) %>%
summarise(old_plan = first(old_plan[!is.na(old_plan)]),
old_price = first(old_price[!is.na(old_price)]),
old_recurrence = first(old_recurrence[!is.na(old_recurrence)]),
new_plan = last(new_plan[!is.na(new_plan)]),
new_price = last(new_price[!is.na(new_price)]),
new_recurrence = last(new_recurrence[!is.na(new_recurrence)])) %>%
ungroup() %>%
as.data.frame()
df2
# customer_id customer start_date old_plan old_price old_recurrence new_plan new_price new_recurrence
# 1 568468 paramount 2016-03-15 a 29 monthly c 349 monthly
说明
arrange(customer_id, start_date, occured_on)
是对列进行排序。它按 customer_id
对列进行排序,然后 start_date
,最后 occured_on
。
group_by(customer_id, customer, start_date)
表示在customer_id
、customer
、start_date
、
summarise
为每个变量生成单个汇总输出。
对于每个变量,以old_plan
为例,我使用old_plan[!is.na(old_plan)
提取该列的非NA值。之后,first
和last
可以提取这些值的第一个或最后一个元素,对应时间上的最小值和最大值。
ungroup()
是去掉分组。 as.data.frame()
是可选的,它将 tibble
对象严格转换为 data.frame
对象。