R table 操纵

R table manipulation

我有一个data.frame如下

PRODUCT=c(rep("A",4),rep("B",2))
ww1=c(201438,201440,201444,201446,201411,201412)
ww2=ww1-6
DIFF=rep(6,6)
DEMAND=rep(100,6)

df=data.frame(PRODUCT,ww1,ww2,DIFF,DEMAND)
df<- df[with(df,order(PRODUCT, ww1)),]

df

  PRODUCT    ww1    ww2 DIFF DEMAND
1       A 201438 201432    6    100
2       A 201440 201434    6    100
3       A 201444 201438    6    100
4       A 201446 201440    6    100
5       B 201411 201405    6    100
6       B 201412 201406    6    100

我想根据以下条件向其中添加行。

对于数据中的任何一行,如果前一行的产品与当前行的产品相同,但前一行的ww1与当前行的ww1-1不同(基本上 ww1 的区别是 1),然后添加一个新行。

对于新添加的行:

Product will be the same as product on earlier row.
ww1 will be ww1 on the earlier row + 1
ww2 will be ww2 on the earlier row + 1
ww_diff will be 6
demand will be 0

我需要的最终输出如下所示:

PRODUCT ww1 ww2 WW_DIFF DEMAND
A   201438  201432  6   100
A   201439  201433  6   0
A   201440  201434  6   100
A   201441  201435  6   0
A   201442  201436  6   100
A   201443  201437  6   0
A   201444  201438  6   100
A   201445  201439  6   0
A   201446  201440  6   100
B   201411  201405  6   100
B   201412  201406  6   100

截至目前,我正在考虑在 excel 中编写一个宏,但它会非常慢,因此我更喜欢 R 解决方案

更新1===============================

如何添加列序列?对于每个产品的 ww1 最早条目,该列为 1,然后递增 1。

PRODUCT ww1 ww2 WW_DIFF DEMAND  seq
A   201438  201432  6   100 1
A   201439  201433  6   0   2
A   201440  201434  6   100 3
A   201441  201435  6   0   4
A   201442  201436  6   100 5
A   201443  201437  6   0   6
A   201444  201438  6   100 7
A   201445  201439  6   0   8
A   201446  201440  6   100 9
B   201411  201405  6   100 1
B   201412  201406  6   100 2

update2============================================ ===========

我再次发布问题(我未选中之前接受的 alistaire 答案,因为该答案不适用于我的原始数据,它仅适用于小样本数据:(

在用户 alistaire 的以下解决方案中,df3 <- right_join(df, data.frame(ww1=ww1big)) 似乎是导致问题的原因。

在最终的解决方案中,我还希望列由它们的名称指定。这样我就不会被迫按特定顺序排列列。

根据说明,如果一行中有多个缺失值,ww1 中仍然存在空白。不过,您可以像这样完全遵循规定的逻辑:

require(dplyr)

df2 <- rbind(df,
         unique(do.call(rbind, lapply(seq(nrow(df)), function(x){
             toAdd <- filter(df[1:x-1,], PRODUCT == df[x, 'PRODUCT'], ww1 != df[x,'ww1']-1)
             if(nrow(toAdd) > 0){
                 toAdd$ww1 <- toAdd$ww1+1
                 toAdd$ww2 <- toAdd$ww2+1
                 toAdd$DEMAND <- 0
                 toAdd
             }
         })))
)

哪个returns

> df2

  PRODUCT    ww1    ww2 DIFF DEMAND
1       A 201438 201432    6    100
2       A 201439 201433    6      0
3       A 201440 201434    6    100
4       A 201441 201435    6      0
5       A 201444 201438    6    100
6       A 201445 201439    6      0
7       A 201446 201440    6    100
8       B 201411 201405    6    100
9       B 201412 201406    6    100

另一方面,如果您想要每个产品的最小值和最大值之间 ww1 的每个值的行,这将起作用:

require(dplyr)

df <- group_by(df, PRODUCT)
extremes <- summarise(df, maxw=max(ww1), minw=min(ww1))
ww1big <- do.call(c, lapply(seq(nrow(extremes)), function(x){
    seq(extremes[[x, 3]], extremes[[x, 2]])
}))

df3 <- right_join(df, data.frame(ww1=ww1big))
nullindex <- seq_along(df3$PRODUCT)[is.na(df3$PRODUCT)]

# the `for` and `while` loops will be slow if the dataset is REALLY huge, but they're pretty simple
nullreplace <- nullindex
for(i in 1:length(nullreplace)){
    while(is.na(df3[nullreplace[i], 1])){
        nullreplace[i]<-nullreplace[i]-1
    }
}
df3[nullindex, c(1, 4)] <- df3[nullreplace, c(1, 4)]
df3[nullindex, 5] <- 0
df3[nullindex, 3] <- df3[nullreplace, 3] + (nullindex-nullreplace)

哪个returns:

> df3
Source: local data frame [11 x 5]
Groups: PRODUCT

   PRODUCT    ww1    ww2 DIFF DEMAND
1        A 201438 201432    6    100
2        A 201439 201433    6      0
3        A 201440 201434    6    100
4        A 201441 201435    6      0
5        A 201442 201436    6      0
6        A 201443 201437    6      0
7        A 201444 201438    6    100
8        A 201445 201439    6      0
9        A 201446 201440    6    100
10       B 201411 201405    6    100
11       B 201412 201406    6    100

两种解决方案都使用了 dplyr 包,但都不是非常优雅。不过,除了第二个选择中的一个 for/while 循环外,它们都应该很快,后者相对简单。如有必要,它可能会用 *apply 函数重写,尽管它的可读性会降低。两者都可以轻松处理其他产品。

编辑 1=========================

实际上非常简单,因为 data.frame 已经按 dplyr 按产品分组,所以您只需要

df3 <- mutate(df3, seq=seq_along(PRODUCT))

你得到

> df3
Source: local data frame [11 x 6]
Groups: PRODUCT

   PRODUCT    ww1    ww2 DIFF DEMAND seq
1        A 201438 201432    6    100   1
2        A 201439 201433    6      0   2
3        A 201440 201434    6    100   3
4        A 201441 201435    6      0   4
5        A 201442 201436    6      0   5
6        A 201443 201437    6      0   6
7        A 201444 201438    6    100   7
8        A 201445 201439    6      0   8
9        A 201446 201440    6    100   9
10       B 201411 201405    6    100   1
11       B 201412 201406    6    100   2
# NEW SOLUTION
nrows = length(df[,1])
newdf = df[1,]
myseq = 1
for(i in 2:nrows) {
  currentRow = df[i,]
  tmpRow = df[i-1,]

  if(tmpRow$ww1 < (currentRow$ww1 - 1)) {
    tmp = (tmpRow$ww1+1):(currentRow$ww1-1)
    tmp.length = length(tmp)
    tmp.last = ifelse(length(myseq)==0, 1, tail(myseq,1)+1) 
    myseq = c(myseq, tmp.last:(tmp.last + tmp.length))
    tmpdf = data.frame(PRODUCT=rep(tmpRow$PRODUCT, tmp.length),
     ww1=tmp, ww2=tmp-6, DIFF=rep(6,tmp.length),DEMAND=rep(0,tmp.length))
    newdf = rbind(newdf,tmpdf,currentRow)
  } else {
    if(tmpRow$ww1==currentRow$ww1-1) {
      myseq = c(myseq, tail(myseq,1)+1)
    } else {
      myseq = c(myseq,1)
    }
    newdf = rbind(newdf,currentRow)
  }
}
newdf = cbind(newdf, myseq)
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows    

# OLD SOLUTION
nrows = length(df[,1])
newdf = df[1,]
for(i in 2:nrows) {
  previousRow = df[i-1,] 
  currentRow = df[i,]
  tmpRow = df[i-1,]

  if(tmpRow$ww1 < currentRow$ww1) {
    while(tmpRow$ww1 + 1 != currentRow$ww1) {
      tmpRow$ww1 = tmpRow$ww1 + 1
      tmpRow$ww2 = tmpRow$ww2 + 1 
      # diff doesn't change
      tmpRow$DEMAND = 0
      # rbind current row
      newdf=rbind(newdf,tmpRow)
    }
  }
  newdf=rbind(newdf,currentRow)
}
nrows = length(newdf[,1])
row.names(newdf) = 1:nrows

我最近不得不使用大表并且已经成为 data.table 包的忠实粉丝(它真的很快并且允许在不分配内存的情况下创建新变量)。

有了它,解决方案如下:

library(data.table)

# convert to data.table
dtable = as.data.table(df)
# create the variables grouped by PRODUCT 
dtransf <- dtable[, .(ww1 = seq(min(ww1), max(ww1), 1L), 
                      ww2 = seq(min(ww2), max(ww2), 1L), 
                     DIFF = 6L,
                   DEMAND = as.integer(seq(min(ww1), max(ww1),1L) %in% unique(ww1)) * 100), 
                 by = PRODUCT]
#add the incremental counter
dtransf[,seq := seq_len(.N), by = PRODUCT]

该代码有点特定于案例(尤其是 DEMAND 计算),在更复杂的情况下,您可能需要一些连接来输入正确的需求。 另外,请记住,如果数据集中存在一些错误(例如 ww1 和 ww2 元素之间的差异不同),代码将失败。

这是一个非常相似的 data.table 解决方案,我认为它应该更有效,因为我正在最小化每组的计算并改用 binary join。

library(data.table)
setkey(setDT(df), PRODUCT, ww1) ## Sorting by `PRODUCT` and `ww1`
indx <- setkey(df[, list(ww1 = seq.int(ww1[1L], ww1[.N], by = 1L)), by = PRODUCT]) ## running `seq.int` on `ww1` per group
res <- df[indx][is.na(ww2), `:=`(ww2 = ww1 - 6L, DIFF = 6L, DEMAND = 0L)] ## filling the missing values
res[, seq := seq_len(.N), by = PRODUCT] # Adding index
res
#     PRODUCT    ww1    ww2 DIFF DEMAND seq
#  1:       A 201438 201432    6    100   1
#  2:       A 201439 201433    6      0   2
#  3:       A 201440 201434    6    100   3
#  4:       A 201441 201435    6      0   4
#  5:       A 201442 201436    6      0   5
#  6:       A 201443 201437    6      0   6
#  7:       A 201444 201438    6    100   7
#  8:       A 201445 201439    6      0   8
#  9:       A 201446 201440    6    100   9
# 10:       B 201411 201405    6    100   1
# 11:       B 201412 201406    6    100   2