R table 操纵
R table manipulation
我有一个data.frame如下
PRODUCT=c(rep("A",4),rep("B",2))
ww1=c(201438,201440,201444,201446,201411,201412)
ww2=ww1-6
DIFF=rep(6,6)
DEMAND=rep(100,6)
df=data.frame(PRODUCT,ww1,ww2,DIFF,DEMAND)
df<- df[with(df,order(PRODUCT, ww1)),]
df
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201440 201434 6 100
3 A 201444 201438 6 100
4 A 201446 201440 6 100
5 B 201411 201405 6 100
6 B 201412 201406 6 100
我想根据以下条件向其中添加行。
对于数据中的任何一行,如果前一行的产品与当前行的产品相同,但前一行的ww1与当前行的ww1-1不同(基本上 ww1 的区别是 1),然后添加一个新行。
对于新添加的行:
Product will be the same as product on earlier row.
ww1 will be ww1 on the earlier row + 1
ww2 will be ww2 on the earlier row + 1
ww_diff will be 6
demand will be 0
我需要的最终输出如下所示:
PRODUCT ww1 ww2 WW_DIFF DEMAND
A 201438 201432 6 100
A 201439 201433 6 0
A 201440 201434 6 100
A 201441 201435 6 0
A 201442 201436 6 100
A 201443 201437 6 0
A 201444 201438 6 100
A 201445 201439 6 0
A 201446 201440 6 100
B 201411 201405 6 100
B 201412 201406 6 100
截至目前,我正在考虑在 excel 中编写一个宏,但它会非常慢,因此我更喜欢 R 解决方案
更新1===============================
如何添加列序列?对于每个产品的 ww1 最早条目,该列为 1,然后递增 1。
PRODUCT ww1 ww2 WW_DIFF DEMAND seq
A 201438 201432 6 100 1
A 201439 201433 6 0 2
A 201440 201434 6 100 3
A 201441 201435 6 0 4
A 201442 201436 6 100 5
A 201443 201437 6 0 6
A 201444 201438 6 100 7
A 201445 201439 6 0 8
A 201446 201440 6 100 9
B 201411 201405 6 100 1
B 201412 201406 6 100 2
update2============================================ ===========
我再次发布问题(我未选中之前接受的 alistaire 答案,因为该答案不适用于我的原始数据,它仅适用于小样本数据:(
在用户 alistaire 的以下解决方案中,df3 <- right_join(df, data.frame(ww1=ww1big))
似乎是导致问题的原因。
在最终的解决方案中,我还希望列由它们的名称指定。这样我就不会被迫按特定顺序排列列。
根据说明,如果一行中有多个缺失值,ww1
中仍然存在空白。不过,您可以像这样完全遵循规定的逻辑:
require(dplyr)
df2 <- rbind(df,
unique(do.call(rbind, lapply(seq(nrow(df)), function(x){
toAdd <- filter(df[1:x-1,], PRODUCT == df[x, 'PRODUCT'], ww1 != df[x,'ww1']-1)
if(nrow(toAdd) > 0){
toAdd$ww1 <- toAdd$ww1+1
toAdd$ww2 <- toAdd$ww2+1
toAdd$DEMAND <- 0
toAdd
}
})))
)
哪个returns
> df2
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201439 201433 6 0
3 A 201440 201434 6 100
4 A 201441 201435 6 0
5 A 201444 201438 6 100
6 A 201445 201439 6 0
7 A 201446 201440 6 100
8 B 201411 201405 6 100
9 B 201412 201406 6 100
另一方面,如果您想要每个产品的最小值和最大值之间 ww1
的每个值的行,这将起作用:
require(dplyr)
df <- group_by(df, PRODUCT)
extremes <- summarise(df, maxw=max(ww1), minw=min(ww1))
ww1big <- do.call(c, lapply(seq(nrow(extremes)), function(x){
seq(extremes[[x, 3]], extremes[[x, 2]])
}))
df3 <- right_join(df, data.frame(ww1=ww1big))
nullindex <- seq_along(df3$PRODUCT)[is.na(df3$PRODUCT)]
# the `for` and `while` loops will be slow if the dataset is REALLY huge, but they're pretty simple
nullreplace <- nullindex
for(i in 1:length(nullreplace)){
while(is.na(df3[nullreplace[i], 1])){
nullreplace[i]<-nullreplace[i]-1
}
}
df3[nullindex, c(1, 4)] <- df3[nullreplace, c(1, 4)]
df3[nullindex, 5] <- 0
df3[nullindex, 3] <- df3[nullreplace, 3] + (nullindex-nullreplace)
哪个returns:
> df3
Source: local data frame [11 x 5]
Groups: PRODUCT
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201439 201433 6 0
3 A 201440 201434 6 100
4 A 201441 201435 6 0
5 A 201442 201436 6 0
6 A 201443 201437 6 0
7 A 201444 201438 6 100
8 A 201445 201439 6 0
9 A 201446 201440 6 100
10 B 201411 201405 6 100
11 B 201412 201406 6 100
两种解决方案都使用了 dplyr
包,但都不是非常优雅。不过,除了第二个选择中的一个 for
/while
循环外,它们都应该很快,后者相对简单。如有必要,它可能会用 *apply
函数重写,尽管它的可读性会降低。两者都可以轻松处理其他产品。
编辑 1=========================
实际上非常简单,因为 data.frame
已经按 dplyr
按产品分组,所以您只需要
df3 <- mutate(df3, seq=seq_along(PRODUCT))
你得到
> df3
Source: local data frame [11 x 6]
Groups: PRODUCT
PRODUCT ww1 ww2 DIFF DEMAND seq
1 A 201438 201432 6 100 1
2 A 201439 201433 6 0 2
3 A 201440 201434 6 100 3
4 A 201441 201435 6 0 4
5 A 201442 201436 6 0 5
6 A 201443 201437 6 0 6
7 A 201444 201438 6 100 7
8 A 201445 201439 6 0 8
9 A 201446 201440 6 100 9
10 B 201411 201405 6 100 1
11 B 201412 201406 6 100 2
# NEW SOLUTION
nrows = length(df[,1])
newdf = df[1,]
myseq = 1
for(i in 2:nrows) {
currentRow = df[i,]
tmpRow = df[i-1,]
if(tmpRow$ww1 < (currentRow$ww1 - 1)) {
tmp = (tmpRow$ww1+1):(currentRow$ww1-1)
tmp.length = length(tmp)
tmp.last = ifelse(length(myseq)==0, 1, tail(myseq,1)+1)
myseq = c(myseq, tmp.last:(tmp.last + tmp.length))
tmpdf = data.frame(PRODUCT=rep(tmpRow$PRODUCT, tmp.length),
ww1=tmp, ww2=tmp-6, DIFF=rep(6,tmp.length),DEMAND=rep(0,tmp.length))
newdf = rbind(newdf,tmpdf,currentRow)
} else {
if(tmpRow$ww1==currentRow$ww1-1) {
myseq = c(myseq, tail(myseq,1)+1)
} else {
myseq = c(myseq,1)
}
newdf = rbind(newdf,currentRow)
}
}
newdf = cbind(newdf, myseq)
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows
# OLD SOLUTION
nrows = length(df[,1])
newdf = df[1,]
for(i in 2:nrows) {
previousRow = df[i-1,]
currentRow = df[i,]
tmpRow = df[i-1,]
if(tmpRow$ww1 < currentRow$ww1) {
while(tmpRow$ww1 + 1 != currentRow$ww1) {
tmpRow$ww1 = tmpRow$ww1 + 1
tmpRow$ww2 = tmpRow$ww2 + 1
# diff doesn't change
tmpRow$DEMAND = 0
# rbind current row
newdf=rbind(newdf,tmpRow)
}
}
newdf=rbind(newdf,currentRow)
}
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows
我最近不得不使用大表并且已经成为 data.table 包的忠实粉丝(它真的很快并且允许在不分配内存的情况下创建新变量)。
有了它,解决方案如下:
library(data.table)
# convert to data.table
dtable = as.data.table(df)
# create the variables grouped by PRODUCT
dtransf <- dtable[, .(ww1 = seq(min(ww1), max(ww1), 1L),
ww2 = seq(min(ww2), max(ww2), 1L),
DIFF = 6L,
DEMAND = as.integer(seq(min(ww1), max(ww1),1L) %in% unique(ww1)) * 100),
by = PRODUCT]
#add the incremental counter
dtransf[,seq := seq_len(.N), by = PRODUCT]
该代码有点特定于案例(尤其是 DEMAND 计算),在更复杂的情况下,您可能需要一些连接来输入正确的需求。
另外,请记住,如果数据集中存在一些错误(例如 ww1 和 ww2 元素之间的差异不同),代码将失败。
这是一个非常相似的 data.table
解决方案,我认为它应该更有效,因为我正在最小化每组的计算并改用 binary join。
library(data.table)
setkey(setDT(df), PRODUCT, ww1) ## Sorting by `PRODUCT` and `ww1`
indx <- setkey(df[, list(ww1 = seq.int(ww1[1L], ww1[.N], by = 1L)), by = PRODUCT]) ## running `seq.int` on `ww1` per group
res <- df[indx][is.na(ww2), `:=`(ww2 = ww1 - 6L, DIFF = 6L, DEMAND = 0L)] ## filling the missing values
res[, seq := seq_len(.N), by = PRODUCT] # Adding index
res
# PRODUCT ww1 ww2 DIFF DEMAND seq
# 1: A 201438 201432 6 100 1
# 2: A 201439 201433 6 0 2
# 3: A 201440 201434 6 100 3
# 4: A 201441 201435 6 0 4
# 5: A 201442 201436 6 0 5
# 6: A 201443 201437 6 0 6
# 7: A 201444 201438 6 100 7
# 8: A 201445 201439 6 0 8
# 9: A 201446 201440 6 100 9
# 10: B 201411 201405 6 100 1
# 11: B 201412 201406 6 100 2
我有一个data.frame如下
PRODUCT=c(rep("A",4),rep("B",2))
ww1=c(201438,201440,201444,201446,201411,201412)
ww2=ww1-6
DIFF=rep(6,6)
DEMAND=rep(100,6)
df=data.frame(PRODUCT,ww1,ww2,DIFF,DEMAND)
df<- df[with(df,order(PRODUCT, ww1)),]
df
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201440 201434 6 100
3 A 201444 201438 6 100
4 A 201446 201440 6 100
5 B 201411 201405 6 100
6 B 201412 201406 6 100
我想根据以下条件向其中添加行。
对于数据中的任何一行,如果前一行的产品与当前行的产品相同,但前一行的ww1与当前行的ww1-1不同(基本上 ww1 的区别是 1),然后添加一个新行。
对于新添加的行:
Product will be the same as product on earlier row.
ww1 will be ww1 on the earlier row + 1
ww2 will be ww2 on the earlier row + 1
ww_diff will be 6
demand will be 0
我需要的最终输出如下所示:
PRODUCT ww1 ww2 WW_DIFF DEMAND
A 201438 201432 6 100
A 201439 201433 6 0
A 201440 201434 6 100
A 201441 201435 6 0
A 201442 201436 6 100
A 201443 201437 6 0
A 201444 201438 6 100
A 201445 201439 6 0
A 201446 201440 6 100
B 201411 201405 6 100
B 201412 201406 6 100
截至目前,我正在考虑在 excel 中编写一个宏,但它会非常慢,因此我更喜欢 R 解决方案
更新1===============================
如何添加列序列?对于每个产品的 ww1 最早条目,该列为 1,然后递增 1。
PRODUCT ww1 ww2 WW_DIFF DEMAND seq
A 201438 201432 6 100 1
A 201439 201433 6 0 2
A 201440 201434 6 100 3
A 201441 201435 6 0 4
A 201442 201436 6 100 5
A 201443 201437 6 0 6
A 201444 201438 6 100 7
A 201445 201439 6 0 8
A 201446 201440 6 100 9
B 201411 201405 6 100 1
B 201412 201406 6 100 2
update2============================================ ===========
我再次发布问题(我未选中之前接受的 alistaire 答案,因为该答案不适用于我的原始数据,它仅适用于小样本数据:(
在用户 alistaire 的以下解决方案中,df3 <- right_join(df, data.frame(ww1=ww1big))
似乎是导致问题的原因。
在最终的解决方案中,我还希望列由它们的名称指定。这样我就不会被迫按特定顺序排列列。
根据说明,如果一行中有多个缺失值,ww1
中仍然存在空白。不过,您可以像这样完全遵循规定的逻辑:
require(dplyr)
df2 <- rbind(df,
unique(do.call(rbind, lapply(seq(nrow(df)), function(x){
toAdd <- filter(df[1:x-1,], PRODUCT == df[x, 'PRODUCT'], ww1 != df[x,'ww1']-1)
if(nrow(toAdd) > 0){
toAdd$ww1 <- toAdd$ww1+1
toAdd$ww2 <- toAdd$ww2+1
toAdd$DEMAND <- 0
toAdd
}
})))
)
哪个returns
> df2
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201439 201433 6 0
3 A 201440 201434 6 100
4 A 201441 201435 6 0
5 A 201444 201438 6 100
6 A 201445 201439 6 0
7 A 201446 201440 6 100
8 B 201411 201405 6 100
9 B 201412 201406 6 100
另一方面,如果您想要每个产品的最小值和最大值之间 ww1
的每个值的行,这将起作用:
require(dplyr)
df <- group_by(df, PRODUCT)
extremes <- summarise(df, maxw=max(ww1), minw=min(ww1))
ww1big <- do.call(c, lapply(seq(nrow(extremes)), function(x){
seq(extremes[[x, 3]], extremes[[x, 2]])
}))
df3 <- right_join(df, data.frame(ww1=ww1big))
nullindex <- seq_along(df3$PRODUCT)[is.na(df3$PRODUCT)]
# the `for` and `while` loops will be slow if the dataset is REALLY huge, but they're pretty simple
nullreplace <- nullindex
for(i in 1:length(nullreplace)){
while(is.na(df3[nullreplace[i], 1])){
nullreplace[i]<-nullreplace[i]-1
}
}
df3[nullindex, c(1, 4)] <- df3[nullreplace, c(1, 4)]
df3[nullindex, 5] <- 0
df3[nullindex, 3] <- df3[nullreplace, 3] + (nullindex-nullreplace)
哪个returns:
> df3
Source: local data frame [11 x 5]
Groups: PRODUCT
PRODUCT ww1 ww2 DIFF DEMAND
1 A 201438 201432 6 100
2 A 201439 201433 6 0
3 A 201440 201434 6 100
4 A 201441 201435 6 0
5 A 201442 201436 6 0
6 A 201443 201437 6 0
7 A 201444 201438 6 100
8 A 201445 201439 6 0
9 A 201446 201440 6 100
10 B 201411 201405 6 100
11 B 201412 201406 6 100
两种解决方案都使用了 dplyr
包,但都不是非常优雅。不过,除了第二个选择中的一个 for
/while
循环外,它们都应该很快,后者相对简单。如有必要,它可能会用 *apply
函数重写,尽管它的可读性会降低。两者都可以轻松处理其他产品。
编辑 1=========================
实际上非常简单,因为 data.frame
已经按 dplyr
按产品分组,所以您只需要
df3 <- mutate(df3, seq=seq_along(PRODUCT))
你得到
> df3
Source: local data frame [11 x 6]
Groups: PRODUCT
PRODUCT ww1 ww2 DIFF DEMAND seq
1 A 201438 201432 6 100 1
2 A 201439 201433 6 0 2
3 A 201440 201434 6 100 3
4 A 201441 201435 6 0 4
5 A 201442 201436 6 0 5
6 A 201443 201437 6 0 6
7 A 201444 201438 6 100 7
8 A 201445 201439 6 0 8
9 A 201446 201440 6 100 9
10 B 201411 201405 6 100 1
11 B 201412 201406 6 100 2
# NEW SOLUTION
nrows = length(df[,1])
newdf = df[1,]
myseq = 1
for(i in 2:nrows) {
currentRow = df[i,]
tmpRow = df[i-1,]
if(tmpRow$ww1 < (currentRow$ww1 - 1)) {
tmp = (tmpRow$ww1+1):(currentRow$ww1-1)
tmp.length = length(tmp)
tmp.last = ifelse(length(myseq)==0, 1, tail(myseq,1)+1)
myseq = c(myseq, tmp.last:(tmp.last + tmp.length))
tmpdf = data.frame(PRODUCT=rep(tmpRow$PRODUCT, tmp.length),
ww1=tmp, ww2=tmp-6, DIFF=rep(6,tmp.length),DEMAND=rep(0,tmp.length))
newdf = rbind(newdf,tmpdf,currentRow)
} else {
if(tmpRow$ww1==currentRow$ww1-1) {
myseq = c(myseq, tail(myseq,1)+1)
} else {
myseq = c(myseq,1)
}
newdf = rbind(newdf,currentRow)
}
}
newdf = cbind(newdf, myseq)
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows
# OLD SOLUTION
nrows = length(df[,1])
newdf = df[1,]
for(i in 2:nrows) {
previousRow = df[i-1,]
currentRow = df[i,]
tmpRow = df[i-1,]
if(tmpRow$ww1 < currentRow$ww1) {
while(tmpRow$ww1 + 1 != currentRow$ww1) {
tmpRow$ww1 = tmpRow$ww1 + 1
tmpRow$ww2 = tmpRow$ww2 + 1
# diff doesn't change
tmpRow$DEMAND = 0
# rbind current row
newdf=rbind(newdf,tmpRow)
}
}
newdf=rbind(newdf,currentRow)
}
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows
我最近不得不使用大表并且已经成为 data.table 包的忠实粉丝(它真的很快并且允许在不分配内存的情况下创建新变量)。
有了它,解决方案如下:
library(data.table)
# convert to data.table
dtable = as.data.table(df)
# create the variables grouped by PRODUCT
dtransf <- dtable[, .(ww1 = seq(min(ww1), max(ww1), 1L),
ww2 = seq(min(ww2), max(ww2), 1L),
DIFF = 6L,
DEMAND = as.integer(seq(min(ww1), max(ww1),1L) %in% unique(ww1)) * 100),
by = PRODUCT]
#add the incremental counter
dtransf[,seq := seq_len(.N), by = PRODUCT]
该代码有点特定于案例(尤其是 DEMAND 计算),在更复杂的情况下,您可能需要一些连接来输入正确的需求。 另外,请记住,如果数据集中存在一些错误(例如 ww1 和 ww2 元素之间的差异不同),代码将失败。
这是一个非常相似的 data.table
解决方案,我认为它应该更有效,因为我正在最小化每组的计算并改用 binary join。
library(data.table)
setkey(setDT(df), PRODUCT, ww1) ## Sorting by `PRODUCT` and `ww1`
indx <- setkey(df[, list(ww1 = seq.int(ww1[1L], ww1[.N], by = 1L)), by = PRODUCT]) ## running `seq.int` on `ww1` per group
res <- df[indx][is.na(ww2), `:=`(ww2 = ww1 - 6L, DIFF = 6L, DEMAND = 0L)] ## filling the missing values
res[, seq := seq_len(.N), by = PRODUCT] # Adding index
res
# PRODUCT ww1 ww2 DIFF DEMAND seq
# 1: A 201438 201432 6 100 1
# 2: A 201439 201433 6 0 2
# 3: A 201440 201434 6 100 3
# 4: A 201441 201435 6 0 4
# 5: A 201442 201436 6 0 5
# 6: A 201443 201437 6 0 6
# 7: A 201444 201438 6 100 7
# 8: A 201445 201439 6 0 8
# 9: A 201446 201440 6 100 9
# 10: B 201411 201405 6 100 1
# 11: B 201412 201406 6 100 2