拆分变量并在其间插入 NA
Split Variable and insert NA's in between
我有一个看起来像这样的变量:
Var
[1] 3, 4, 5 2, 4, 5 2, 4 1, 4, 5
我需要将其拆分成如下所示的数据框:
V1 V2 V3 V4 V5
NA NA 3 4 5
NA 2 NA 4 5
NA 2 NA 4 NA
1 NA NA 4 5
不幸的是,我找不到 post 来解决我的问题。有谁知道我该怎么做?
非常感谢您!
编辑:我根据您的回答找到了一个解决方案,并在下面post编辑了它。
Edit2:我使用 Ananda 的解决方案提高了代码的效率。
使用矩阵索引:
Var <- list(c(3,4,5),c(2,4,5),c(2,4),c(1,4,5))
unVar <- unlist(Var)
out <- matrix(NA, nrow=length(Var), ncol=max(unVar))
out[cbind(rep(seq_along(Var),sapply(Var,length)),unVar)] <- unVar
# and if you're using the new version of R, you can simplify a little:
out[cbind(rep(seq_along(Var),lengths(Var)),unVar)] <- unVar
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA 3 4 5
#[2,] NA 2 NA 4 5
#[3,] NA 2 NA 4 NA
#[4,] 1 NA NA 4 5
如果 Var 只是一个向量,那么我将执行以下操作:
Var = c(3,4,5,2,4,5,2,4,1,4,5)
RowIdx = c(rep(1,3),rep(2,3),rep(3,2),rep(4,3))
DF = matrix(NA,nrow=4,ncol=5)
for (idx in 1:length(Var)){
DF[RowIdx[idx],Var[idx]] = Var[idx]
}
当然,如果您有更多数据,您可能希望找到一种以更自动化的方式生成行索引的方法
Var <- list(c(3, 4, 5), c(2, 4, 5), c(2, 4), c(1, 4, 5))
M <- matrix(NA, nrow=length(Var), ncol=max(sapply(Var,max)))
for( L in seq(Var) ) { M [ cbind( rep( L, length(Var[[L]])), Var[[L]]) ] <- Var[[L]]}
M
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA 3 4 5
[2,] NA 2 NA 4 5
[3,] NA 2 NA 4 NA
[4,] 1 NA NA 4 5
我个人的投票推荐是thelatemail的版本,基本上和这个同构。
根据您的回复,我设法找到了解决方案!我的最终解决方案如下所示:
# I had the additional problem that my variable was a factor, therefore I had to transform it first.
df <- data.frame(Var)
Var <- lapply(strsplit(as.character(df$Var), ", "), "[")
for(i in 1:length(Var)){
Var[[i]] <- as.numeric(Var[[i]])
}
# Then I created a matrix based on thelatemails and BondedDusts approach.
M <- matrix(NA, nrow=length(Var), ncol=max(sapply(Var,max)))
# Additionally, I had the problem that there were some lines with a single -99, which indicates a missing value for the complete line. I had some problems with this negative value. For this reason, I assigned NA's first.
for(i in 1:length(Var)){
Var[[i]][Var[[i]] == -99] <- NA
}
# Final assignment like suggested by BonedDust.
for( L in seq(Var) ) { M [ cbind( rep( L, length(Var[[L]])), Var[[L]]) ] <- Var[[L]]}
M
我不确定这是否是最快的解决方案,但现在一切正常!非常感谢您快速而广泛的回答!
根据 OP 的回答判断,"var" 是一个字符串,例如:
var <- c("3, 4, 5", "2, 4, 5", "2, 4", "1, 4, 5")
如果是这样的话,你可以从我的"splitstackshape"包中考虑cSplit_e
:
library(splitstackshape)
cSplit_e(data.frame(var), "var", ",", mode = "value", drop = TRUE)
# var_1 var_2 var_3 var_4 var_5
# 1 NA NA 3 4 5
# 2 NA 2 NA 4 5
# 3 NA 2 NA 4 NA
# 4 1 NA NA 4 5
如果它是 list
,正如其他答案假设的那样,您可以使用 "splitstackshape" 中的(未导出的)numMat
函数为 cSplit_e
赋能.
var <- list(c(3,4,5), c(2,4,5), c(2,4), c(1,4,5))
splitstackshape:::numMat(var, mode = "value")
# 1 2 3 4 5
# [1,] NA NA 3 4 5
# [2,] NA 2 NA 4 5
# [3,] NA 2 NA 4 NA
# [4,] 1 NA NA 4 5
在幕后,numMat
与@thelatemail 的回答中使用的方法非常相似。
如果你有-99代表NA
,你想排除它们,你可以试试:
var <- c("3, 4, 5", "2, -99, 4, 5", "2, 4", "1, 4, 5, -99")
splitstackshape:::numMat(
lapply(strsplit(var, ","), function(x) as.numeric(x)[as.numeric(x) > 0]),
mode = "value")
# 1 2 3 4 5
# [1,] NA NA 3 4 5
# [2,] NA 2 NA 4 5
# [3,] NA 2 NA 4 NA
# [4,] 1 NA NA 4 5
我有一个看起来像这样的变量:
Var
[1] 3, 4, 5 2, 4, 5 2, 4 1, 4, 5
我需要将其拆分成如下所示的数据框:
V1 V2 V3 V4 V5
NA NA 3 4 5
NA 2 NA 4 5
NA 2 NA 4 NA
1 NA NA 4 5
不幸的是,我找不到 post 来解决我的问题。有谁知道我该怎么做? 非常感谢您!
编辑:我根据您的回答找到了一个解决方案,并在下面post编辑了它。
Edit2:我使用 Ananda 的解决方案提高了代码的效率。
使用矩阵索引:
Var <- list(c(3,4,5),c(2,4,5),c(2,4),c(1,4,5))
unVar <- unlist(Var)
out <- matrix(NA, nrow=length(Var), ncol=max(unVar))
out[cbind(rep(seq_along(Var),sapply(Var,length)),unVar)] <- unVar
# and if you're using the new version of R, you can simplify a little:
out[cbind(rep(seq_along(Var),lengths(Var)),unVar)] <- unVar
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA 3 4 5
#[2,] NA 2 NA 4 5
#[3,] NA 2 NA 4 NA
#[4,] 1 NA NA 4 5
如果 Var 只是一个向量,那么我将执行以下操作:
Var = c(3,4,5,2,4,5,2,4,1,4,5)
RowIdx = c(rep(1,3),rep(2,3),rep(3,2),rep(4,3))
DF = matrix(NA,nrow=4,ncol=5)
for (idx in 1:length(Var)){
DF[RowIdx[idx],Var[idx]] = Var[idx]
}
当然,如果您有更多数据,您可能希望找到一种以更自动化的方式生成行索引的方法
Var <- list(c(3, 4, 5), c(2, 4, 5), c(2, 4), c(1, 4, 5))
M <- matrix(NA, nrow=length(Var), ncol=max(sapply(Var,max)))
for( L in seq(Var) ) { M [ cbind( rep( L, length(Var[[L]])), Var[[L]]) ] <- Var[[L]]}
M
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA 3 4 5
[2,] NA 2 NA 4 5
[3,] NA 2 NA 4 NA
[4,] 1 NA NA 4 5
我个人的投票推荐是thelatemail的版本,基本上和这个同构。
根据您的回复,我设法找到了解决方案!我的最终解决方案如下所示:
# I had the additional problem that my variable was a factor, therefore I had to transform it first.
df <- data.frame(Var)
Var <- lapply(strsplit(as.character(df$Var), ", "), "[")
for(i in 1:length(Var)){
Var[[i]] <- as.numeric(Var[[i]])
}
# Then I created a matrix based on thelatemails and BondedDusts approach.
M <- matrix(NA, nrow=length(Var), ncol=max(sapply(Var,max)))
# Additionally, I had the problem that there were some lines with a single -99, which indicates a missing value for the complete line. I had some problems with this negative value. For this reason, I assigned NA's first.
for(i in 1:length(Var)){
Var[[i]][Var[[i]] == -99] <- NA
}
# Final assignment like suggested by BonedDust.
for( L in seq(Var) ) { M [ cbind( rep( L, length(Var[[L]])), Var[[L]]) ] <- Var[[L]]}
M
我不确定这是否是最快的解决方案,但现在一切正常!非常感谢您快速而广泛的回答!
根据 OP 的回答判断,"var" 是一个字符串,例如:
var <- c("3, 4, 5", "2, 4, 5", "2, 4", "1, 4, 5")
如果是这样的话,你可以从我的"splitstackshape"包中考虑cSplit_e
:
library(splitstackshape)
cSplit_e(data.frame(var), "var", ",", mode = "value", drop = TRUE)
# var_1 var_2 var_3 var_4 var_5
# 1 NA NA 3 4 5
# 2 NA 2 NA 4 5
# 3 NA 2 NA 4 NA
# 4 1 NA NA 4 5
如果它是 list
,正如其他答案假设的那样,您可以使用 "splitstackshape" 中的(未导出的)numMat
函数为 cSplit_e
赋能.
var <- list(c(3,4,5), c(2,4,5), c(2,4), c(1,4,5))
splitstackshape:::numMat(var, mode = "value")
# 1 2 3 4 5
# [1,] NA NA 3 4 5
# [2,] NA 2 NA 4 5
# [3,] NA 2 NA 4 NA
# [4,] 1 NA NA 4 5
在幕后,numMat
与@thelatemail 的回答中使用的方法非常相似。
如果你有-99代表NA
,你想排除它们,你可以试试:
var <- c("3, 4, 5", "2, -99, 4, 5", "2, 4", "1, 4, 5, -99")
splitstackshape:::numMat(
lapply(strsplit(var, ","), function(x) as.numeric(x)[as.numeric(x) > 0]),
mode = "value")
# 1 2 3 4 5
# [1,] NA NA 3 4 5
# [2,] NA 2 NA 4 5
# [3,] NA 2 NA 4 NA
# [4,] 1 NA NA 4 5