数据 table 有向量作为一个条目 - 如何找出在哪一列然后只将向量的第二个条目作为单个整数
Data table has vector as an entry - how to find out in which column and then only take the second entry of vector as a single integer
我有一个数据 table tmp
,它看起来像这样(只是一个简短的例子):
dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
)), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
在这里我们可以看到,第三列(“2020-03-29-03
”)有向量条目。我想要的是将此向量的第二个条目作为单个整数条目。矢量列(这里是第三列)并不总是在同一个列索引上。所以,首先我们需要找出entry是vector的地方,然后只取这个vector的第二个entry。
最后我的数据 table 应该是这样的:
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(41.87), 0, 0, 0, c(10.04)),
`2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
此解决方案应该足够强大以处理您的问题。
它会自动检查哪些列需要清理。如果要指定某些行,只需将 cols_contain_vec
更改为列索引向量。
# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)
tmp[,cols_contain_vec] <- apply(
tmp[,cols_contain_vec, with = FALSE],
# separate function call for every row (1) and column(2)
MARGIN = c(1,2),
function(x) { # Return second entry if possible, for some reason the vectors are saved
# as lists, so we have to unlist them
relevant_vec <- unlist(x)
if(length(relevant_vec)>1){
# if vector length over 1, return second element
return(relevant_vec[[2]])
} else {
# if vector length is below 2 then return the first value
return(relevant_vec[[1]])
}
})
)
结果如下:
> tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1: 42.51 46.8 41.87 45.63 40.86 45.85
2: 0 0 0.00 0 0 0
3: 0 0 0.00 0 0 0
4: 0 0 0.00 0 0 0
5: 12.32 10.03 10.04 9.24 9.06 9.19
2020-03-29-07 2020-03-29-08 2020-03-29-09
1: 43.68 47.14 49.06
2: 0 0 0
3: 0 0 0
4: 0 0 0
5: 10.39 9.99 11.24
希望对您有所帮助。
通过 cols
在循环中尝试 apply
for (col in colnames(tmp)) {
tmp[,col] <- apply(tmp[,..col], 1, function(x) {
# mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
)
}
或 apply
tmp <- apply(tmp, c(1,2), function(x) {
# mean(unlist(x), na.rm = TRUE)
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
) %>% as.data.table() ## convert to data.table from matrix
一种快速而肮脏的方法:
as.data.table(lapply(dt, \(x){
if(length(x) == sum(lengths(x)))
x
else
sapply(x, \(y)y[[2]])
}))
备选方案,但使用 data.tables
的就地方面
for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}
请注意,我在 R 4.1.0 中使用了新的 lambda 函数 asepcts。在您必须使用 function(x)
和 function(y)
代替 \(x)
和 \(y)
之前。
如果您使用 str(tmp)
或 lapply(tmp, class)
检查 tmp
,您会注意到 所有 列都是列表列,即使是向量所在的列只包含一个元素。
此外,这可以通过设置适当的打印选项来显示
library(data.table)
options(datatable.print.class = TRUE)
tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
<list> <list> <list> <list> <list> <list> <list> <list> <list>
1: 42.51 46.8 46.80,41.87 45.63 40.86 45.85 43.68 47.14 49.06
2: 0 0 0,0 0 0 0 0 0 0
3: 0 0 0,0 0 0 0 0 0 0
4: 0 0 0,0 0 0 0 0 0 0
5: 12.32 10.03 10.03,10.04 9.24 9.06 9.19 10.39 9.99 11.24
因此,如果 all 列表列需要强制转换为数字,我们可以在每个向量中选择 last value(使用 last()
函数恰好是第 3 列中的第二个向量条目:
tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09
<num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 42.51 46.80 41.87 45.63 40.86 45.85 43.68 47.14 49.06
2: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5: 12.32 10.03 10.04 9.24 9.06 9.19 10.39 9.99 11.24
现在,所有列都是数字。
我有一个数据 table tmp
,它看起来像这样(只是一个简短的例子):
dput(tmp)
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(46.8, 41.87), c(0, 0), c(0, 0), c(0, 0), c(10.03, 10.04
)), `2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
在这里我们可以看到,第三列(“2020-03-29-03
”)有向量条目。我想要的是将此向量的第二个条目作为单个整数条目。矢量列(这里是第三列)并不总是在同一个列索引上。所以,首先我们需要找出entry是vector的地方,然后只取这个vector的第二个entry。
最后我的数据 table 应该是这样的:
structure(list(`2020-03-29-00` = list(42.51, 0, 0, 0, 12.32),
`2020-03-29-01` = list(46.8, 0, 0, 0, 10.03), `2020-03-29-03` = list(
c(41.87), 0, 0, 0, c(10.04)),
`2020-03-29-04` = list(45.63, 0, 0, 0, 9.24), `2020-03-29-05` = list(
40.86, 0, 0, 0, 9.06), `2020-03-29-06` = list(45.85,
0, 0, 0, 9.19), `2020-03-29-07` = list(43.68, 0, 0, 0,
10.39), `2020-03-29-08` = list(47.14, 0, 0, 0, 9.99),
`2020-03-29-09` = list(49.06, 0, 0, 0, 11.24)), row.names = c(NA,
-5L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000015baf701ef0>)
此解决方案应该足够强大以处理您的问题。
它会自动检查哪些列需要清理。如果要指定某些行,只需将 cols_contain_vec
更改为列索引向量。
# Find the relevant cols which contain vectors
# which cols contain max lengths over 1?
cols_contain_vec <- which(apply(tmp, MARGIN = 2,function(x) max(lengths(x))) > 1)
tmp[,cols_contain_vec] <- apply(
tmp[,cols_contain_vec, with = FALSE],
# separate function call for every row (1) and column(2)
MARGIN = c(1,2),
function(x) { # Return second entry if possible, for some reason the vectors are saved
# as lists, so we have to unlist them
relevant_vec <- unlist(x)
if(length(relevant_vec)>1){
# if vector length over 1, return second element
return(relevant_vec[[2]])
} else {
# if vector length is below 2 then return the first value
return(relevant_vec[[1]])
}
})
)
结果如下:
> tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06
1: 42.51 46.8 41.87 45.63 40.86 45.85
2: 0 0 0.00 0 0 0
3: 0 0 0.00 0 0 0
4: 0 0 0.00 0 0 0
5: 12.32 10.03 10.04 9.24 9.06 9.19
2020-03-29-07 2020-03-29-08 2020-03-29-09
1: 43.68 47.14 49.06
2: 0 0 0
3: 0 0 0
4: 0 0 0
5: 10.39 9.99 11.24
希望对您有所帮助。
通过 cols
在循环中尝试apply
for (col in colnames(tmp)) {
tmp[,col] <- apply(tmp[,..col], 1, function(x) {
# mean(unlist(x), na.rm = TRUE) ## if you want mean insted of second entry of this vector
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
)
}
或 apply
tmp <- apply(tmp, c(1,2), function(x) {
# mean(unlist(x), na.rm = TRUE)
ifelse(is.na(unlist(x)[2]), unlist(x)[1], unlist(x)[2])
}
) %>% as.data.table() ## convert to data.table from matrix
一种快速而肮脏的方法:
as.data.table(lapply(dt, \(x){
if(length(x) == sum(lengths(x)))
x
else
sapply(x, \(y)y[[2]])
}))
备选方案,但使用 data.tables
的就地方面for(i in names(dt)[sapply(dt, \(x)sum(lengths(x)) != length(x))]){
set(dt, j = i, value = sapply(dt[[i]], \(y)y[[2]]))
}
请注意,我在 R 4.1.0 中使用了新的 lambda 函数 asepcts。在您必须使用 function(x)
和 function(y)
代替 \(x)
和 \(y)
之前。
如果您使用 str(tmp)
或 lapply(tmp, class)
检查 tmp
,您会注意到 所有 列都是列表列,即使是向量所在的列只包含一个元素。
此外,这可以通过设置适当的打印选项来显示
library(data.table)
options(datatable.print.class = TRUE)
tmp
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09 <list> <list> <list> <list> <list> <list> <list> <list> <list> 1: 42.51 46.8 46.80,41.87 45.63 40.86 45.85 43.68 47.14 49.06 2: 0 0 0,0 0 0 0 0 0 0 3: 0 0 0,0 0 0 0 0 0 0 4: 0 0 0,0 0 0 0 0 0 0 5: 12.32 10.03 10.03,10.04 9.24 9.06 9.19 10.39 9.99 11.24
因此,如果 all 列表列需要强制转换为数字,我们可以在每个向量中选择 last value(使用 last()
函数恰好是第 3 列中的第二个向量条目:
tmp[, lapply(.SD, \(x) sapply(x, last)), .SDcols = is.list]
2020-03-29-00 2020-03-29-01 2020-03-29-03 2020-03-29-04 2020-03-29-05 2020-03-29-06 2020-03-29-07 2020-03-29-08 2020-03-29-09 <num> <num> <num> <num> <num> <num> <num> <num> <num> 1: 42.51 46.80 41.87 45.63 40.86 45.85 43.68 47.14 49.06 2: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5: 12.32 10.03 10.04 9.24 9.06 9.19 10.39 9.99 11.24
现在,所有列都是数字。