如何根据 2 个标准生成第 6 个最差值并将结果插入单独的列?

How to produce 6th worst value based on 2 criteria and insert results into a separate column?

希望有人能提供帮助。

我正在尝试添加另一列:6th Worst。我想要做的是让它根据指定的标准产生第 6 个最差的 y 结果:Date.

这是我的 df 的一个例子:

Key     Date                     y   x1   x2   x3
   1    1/10/2018 12:00:00 AM    2   3    2    5
   1    1/11/2018 12:00:00 AM    3   5    7    2
   1    1/12/2018 12:00:00 AM    5   7    4    7 
   1    1/13/2018 12:00:00 AM    7   2    7    6
   2    1/10/2018 12:00:00 AM    2   6    3    8
   2    1/11/2018 12:00:00 AM    3   7    7    3
   2    1/12/2018 12:00:00 AM    3   2    3    4
   2    1/13/2018 12:00:00 AM    7   6    2    7
   3    1/10/2018 12:00:00 AM    2   3    2    5
   3    1/11/2018 12:00:00 AM    3   5    7    2
   3    1/12/2018 12:00:00 AM    5   7    4    7 
   3    1/13/2018 12:00:00 AM    7   2    7    6
   3    1/10/2018 12:00:00 AM    2   6    3    8
   3    1/11/2018 12:00:00 AM    3   7    7    3
   3    1/12/2018 12:00:00 AM    3   2    3    4
   3    1/13/2018 12:00:00 AM    7   6    2    7
   4    1/10/2018 12:00:00 AM    2   3    2    5
   4    1/11/2018 12:00:00 AM    3   5    7    2
   4    1/12/2018 12:00:00 AM    5   7    4    7 
   4    1/13/2018 12:00:00 AM    7   2    7    6
   4    1/10/2018 12:00:00 AM    2   6    3    8
   4    1/11/2018 12:00:00 AM    3   7    7    3
   5    1/12/2018 12:00:00 AM    3   2    3    4
   5    1/13/2018 12:00:00 AM    7   6    2    7
   5    1/10/2018 12:00:00 AM    2   3    2    5
   5    1/11/2018 12:00:00 AM    3   5    7    2
   5    1/12/2018 12:00:00 AM    5   7    4    7 
   5    1/13/2018 12:00:00 AM    7   2    7    6
   6    1/10/2018 12:00:00 AM    2   6    3    8
   6    1/11/2018 12:00:00 AM    3   7    7    3
   6    1/12/2018 12:00:00 AM    3   2    3    4
   6    1/13/2018 12:00:00 AM    7   6    2    7

所以对于 1/10/2018 3。因此,数据集将如下所示:

 Key        Date                     y   x1   x2   x3 6th worst   
       1    1/10/2018 12:00:00 AM    2   3    2    5  3
       1    1/11/2018 12:00:00 AM    3   5    7    2  ... (would have values)
       1    1/12/2018 12:00:00 AM    5   7    4    7  ... (would have values)
       1    1/13/2018 12:00:00 AM    7   2    7    6  ... (would have values)
       2    1/10/2018 12:00:00 AM    2   6    3    8  3
       2    1/11/2018 12:00:00 AM    3   7    7    3  etc.
       2    1/12/2018 12:00:00 AM    3   2    3    4
       2    1/13/2018 12:00:00 AM    7   6    2    7
       3    1/10/2018 12:00:00 AM    2   3    2    5
       3    1/11/2018 12:00:00 AM    3   5    7    2
       3    1/12/2018 12:00:00 AM    5   7    4    7 
       3    1/13/2018 12:00:00 AM    7   2    7    6
       3    1/10/2018 12:00:00 AM    2   6    3    8
       3    1/11/2018 12:00:00 AM    3   7    7    3
       3    1/12/2018 12:00:00 AM    3   2    3    4
       3    1/13/2018 12:00:00 AM    7   6    2    7
       4    1/10/2018 12:00:00 AM    2   3    2    5
       4    1/11/2018 12:00:00 AM    3   5    7    2
       4    1/12/2018 12:00:00 AM    5   7    4    7 
       4    1/13/2018 12:00:00 AM    7   2    7    6
       4    1/10/2018 12:00:00 AM    2   6    3    8
       4    1/11/2018 12:00:00 AM    3   7    7    3
       5    1/12/2018 12:00:00 AM    3   2    3    4
       5    1/13/2018 12:00:00 AM    7   6    2    7
       5    1/10/2018 12:00:00 AM    2   3    2    5
       5    1/11/2018 12:00:00 AM    3   5    7    2
       5    1/12/2018 12:00:00 AM    5   7    4    7 
       5    1/13/2018 12:00:00 AM    7   2    7    6
       6    1/10/2018 12:00:00 AM    2   6    3    8
       6    1/11/2018 12:00:00 AM    3   7    7    3
       6    1/12/2018 12:00:00 AM    3   2    3    4
       6    1/13/2018 12:00:00 AM    7   6    2    7

这是我目前的情况:

#获取数据集中第6差的值

n=length(df$y)

df$`6th Worst`= df$`6th Worst`= "-"

df[1,3] = round(-sort(subset(df,c(unique(Date), "y")), partial=n-5)[n-5], digits = 2)

我收到以下错误:

    Error in subset.data.frame(reg_predict, unique(reg_predict2$Date)) : 
  'subset' must be logical

编辑: 问题在几个方面不同于重复标记的问题。特别是事实上我需要一个有条件的第 6 个最坏的场景,而不仅仅是 worst/best 场景。

使用 data.table 包的选项:

library(data.table)

## Generate data
set.seed(1)
RowCount <- 100
DT <- data.table(Date = Sys.Date() + sample.int(3,RowCount,TRUE),
                 y = sample.int(100,RowCount,TRUE))

## Sort by y
setkey(DT,y)

## Too much to unpack here in inline commments, will expand further down
SixthWorst_DT <- DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]

print(SixthWorst_DT)

#    Date       Sixth_Worst
# 1: 2018-06-27          42
# 2: 2018-06-28          11
# 3: 2018-06-29          22

## Set DT Key to be date for update-join
setkey(DT,Date)
## Temporarily join `SixthWorst_DT` to `DT` (without making a full copy)
## and then create a column in `DT` based on the column `Sixth_Worst` in `SixthWorst_DT`
DT[SixthWorst_DT, Sixth_Worst := i.Sixth_Worst]

## Results
head(DT)

#    Date        y Sixth_Worst
# 1: 2018-06-27 18          42
# 2: 2018-06-27 18          42
# 3: 2018-06-27 19          42
# 4: 2018-06-27 19          42
# 5: 2018-06-27 39          42
# 6: 2018-06-27 42          42

操作的真正内容是一行:

SixthWorst_DT <- DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]

  • DT[,.I[6],by = .(Date)]使用特殊符号.I提取每个日期的第6行号
  • 附加的$V1提取这些行号的向量
  • 然后使用此向量对 DT 进行子集化
  • DT 然后被键入 (并隐式排序) 并按 Date 分组以创建具有新列的摘要 table,Sixth_Worst, 基于 y

要真正了解发生了什么,我建议运行以下陈述。

  • DT[,.I[6],by = .(Date)]
  • DT[,.I[6],by = .(Date)]$V1
  • DT[DT[,.I[6],by = .(Date)]$V1]
  • DT[DT[,.I[6],by = .(Date)]$V1,.(Sixth_Worst = y), keyby = .(Date)]

带有 dplyrsort 的选项可以是:

注意: 可以在分组前将 Date 列转换为 POSIXct 格式,但我没有注意到任何优点。

library(dplyr)

df %>% group_by(Date) %>% 
  mutate(Worst6th = sort(y)[6])

# A tibble: 32 x 7
# Groups: Date [4]
    Key Date                      y    x1    x2    x3 Worst6th
  <int> <chr>                 <int> <int> <int> <int>    <int>
1     1 1/10/2018 12:00:00 AM     2     3     2     5        2
2     1 1/11/2018 12:00:00 AM     3     5     7     2        3
3     1 1/12/2018 12:00:00 AM     5     7     4     7        5
4     1 1/13/2018 12:00:00 AM     7     2     7     6        7
5     2 1/10/2018 12:00:00 AM     2     6     3     8        2
6     2 1/11/2018 12:00:00 AM     3     7     7     3        3
7     2 1/12/2018 12:00:00 AM     3     2     3     4        5
8     2 1/13/2018 12:00:00 AM     7     6     2     7        7
9     3 1/10/2018 12:00:00 AM     2     3     2     5        2
10     3 1/11/2018 12:00:00 AM     3     5     7     2        3
# ... with 22 more rows      

数据:

df <- read.table(text="
Key     Date                     y   x1   x2   x3
1    '1/10/2018 12:00:00 AM'    2   3    2    5
1    '1/11/2018 12:00:00 AM'    3   5    7    2
1    '1/12/2018 12:00:00 AM'    5   7    4    7 
1    '1/13/2018 12:00:00 AM'    7   2    7    6
2    '1/10/2018 12:00:00 AM'    2   6    3    8
2    '1/11/2018 12:00:00 AM'    3   7    7    3
2    '1/12/2018 12:00:00 AM'    3   2    3    4
2    '1/13/2018 12:00:00 AM'    7   6    2    7
3    '1/10/2018 12:00:00 AM'    2   3    2    5
3    '1/11/2018 12:00:00 AM'    3   5    7    2
3    '1/12/2018 12:00:00 AM'    5   7    4    7 
3    '1/13/2018 12:00:00 AM'    7   2    7    6
3    '1/10/2018 12:00:00 AM'    2   6    3    8
3    '1/11/2018 12:00:00 AM'    3   7    7    3
3    '1/12/2018 12:00:00 AM'    3   2    3    4
3    '1/13/2018 12:00:00 AM'    7   6    2    7
4    '1/10/2018 12:00:00 AM'    2   3    2    5
4    '1/11/2018 12:00:00 AM'    3   5    7    2
4    '1/12/2018 12:00:00 AM'    5   7    4    7 
4    '1/13/2018 12:00:00 AM'    7   2    7    6
4    '1/10/2018 12:00:00 AM'    2   6    3    8
4    '1/11/2018 12:00:00 AM'    3   7    7    3
5    '1/12/2018 12:00:00 AM'    3   2    3    4
5    '1/13/2018 12:00:00 AM'    7   6    2    7
5    '1/10/2018 12:00:00 AM'    2   3    2    5
5    '1/11/2018 12:00:00 AM'    3   5    7    2
5    '1/12/2018 12:00:00 AM'    5   7    4    7 
5    '1/13/2018 12:00:00 AM'    7   2    7    6
6    '1/10/2018 12:00:00 AM'    2   6    3    8
6    '1/11/2018 12:00:00 AM'    3   7    7    3
6    '1/12/2018 12:00:00 AM'    3   2    3    4
6    '1/13/2018 12:00:00 AM'    7   6    2    7",
header = TRUE, stringsAsFactors = FALSE)