数据 table 有向量作为一个条目 - 如何找出在哪一列然后只将向量的第二个条目作为单个整数

Data table has vector as an entry - how to find out in which column and then only take the second entry of vector as a single integer

我有一个数据 table tmp,它看起来像这样(只是一个简短的例子):

dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
        )), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)

在这里我们可以看到,第三列(“2020-03-29-03”)有向量条目。我想要的是将此向量的第二个条目作为单个整数条目。矢量列(这里是第三列)并不总是在同一个列索引上。所以,首先我们需要找出entry是vector的地方,然后只取这个vector的第二个entry。

最后我的数据 table 应该是这样的:

structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32), 
    `2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
        c(41.87), 0, 0, 0, c(10.04)), 
    `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
        40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85, 
        0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0, 
        10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99), 
    `2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA, 
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)

此解决方案应该足够强大以处理您的问题。 它会自动检查哪些列需要清理。如果要指定某些行,只需将 cols_contain_vec 更改为列索引向量。

# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)



tmp[,cols_contain_vec] <- apply(
  tmp[,cols_contain_vec, with = FALSE],
  # separate function call for every row (1) and column(2)
  MARGIN = c(1,2),
  function(x) { # Return second entry if possible, for some reason the vectors are saved
                # as lists, so we have to unlist them
    relevant_vec <- unlist(x)
    if(length(relevant_vec)>1){
      # if vector length over 1, return second element
      return(relevant_vec[[2]])
    } else {
      # if vector length is below 2 then return the first value
      return(relevant_vec[[1]])
    }
  })
)

结果如下:

> tmp
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1:         42.51          46.8         41.87         45.63         40.86         45.85
2:             0             0          0.00             0             0             0
3:             0             0          0.00             0             0             0
4:             0             0          0.00             0             0             0
5:         12.32         10.03         10.04          9.24          9.06          9.19
   2020-03-29-07 2020-03-29-08 2020-03-29-09
1:         43.68         47.14         49.06
2:             0             0             0
3:             0             0             0
4:             0             0             0
5:         10.39          9.99         11.24

希望对您有所帮助。

通过 cols

在循环中尝试 apply
for (col in colnames(tmp)) {
  tmp[,col] <- apply(tmp[,..col], 1, function(x) {
    # mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
    ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
    }  
  )  
}

apply

tmp <- apply(tmp, c(1,2), function(x) {
  # mean(unlist(x), na.rm = TRUE)
  ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
  } 
) %>% as.data.table() ## convert to data.table from matrix

一种快速而肮脏的方法:

as.data.table(lapply(dt, \(x){
  if(length(x) == sum(lengths(x)))
    x
  else
    sapply(x, \(y)y[[2]])
}))

备选方案,但使用 data.tables

的就地方面
for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
  set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}

请注意,我在 R 4.1.0 中使用了新的 lambda 函数 asepcts。在您必须使用 function(x)function(y) 代替 \(x)\(y) 之前。

如果您使用 str(tmp)lapply(tmp, class) 检查 tmp,您会注意到 所有 列都是列表列,即使是向量所在的列只包含一个元素。

此外,这可以通过设置适当的打印选项来显示

library(data.table)
options(datatable.print.class = TRUE)
tmp
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
          <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>        <list>
1:         42.51          46.8   46.80,41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:             0             0           0,0             0             0             0             0             0             0
3:             0             0           0,0             0             0             0             0             0             0
4:             0             0           0,0             0             0             0             0             0             0
5:         12.32         10.03   10.03,10.04          9.24          9.06          9.19         10.39          9.99         11.24

因此,如果 all 列表列需要强制转换为数字,我们可以在每个向量中选择 last value(使用 last() 函数恰好是第 3 列中的第二个向量条目:

tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]
   2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
           <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>         <num>
1:         42.51         46.80         41.87         45.63         40.86         45.85         43.68         47.14         49.06
2:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
3:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
4:          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00          0.00
5:         12.32         10.03         10.04          9.24          9.06          9.19         10.39          9.99         11.24

现在,所有列都是数字。