估计 r 中面板数据的组合?

Estimate combinations in panel data in r?

我正在尝试最大化横截面面板数据中的数据点数量。我的矩阵结构如下,y 轴为年份,x 轴为国家/地区:

        A     B    C     D 
2000   NA    50    NA    85
2001   110   75    76    86
2002   120   NA    78    87
2003   130   100   80    88

因此,我试图找到年度数据点的所有可能组合,以获得每个组合的最多国家/地区。使用上面的例子,我试图生成向量、列表或其他类型的对象,类似于这样的东西:

2000, 2001, 2002, 2003 = D
2000, 2001, 2003 = D and B
2001, 2002, 2003 = D, A and C
2000, 2001 = D and B
2001, 2002 = D, A and C
2002, 2003 = D, A and C
2000 = D and B
2001 = A, B, C and D
2002 = A, C and D
2003 = A, B, C and D

这是一件抽象的事情,我无法全神贯注。如果有任何帮助,我将不胜感激。

更新

这是一个很好的起点,但可能需要改进的解决方案:

library(RcppAlgos)
getCombs <- function(myMat, myCap = NULL, minYears = NULL) {

    numRows <- nrow(myMat)
    myColNames <- colnames(myMat)

    if (is.null(minYears))  ## set default
        repZero <- numRows - 1
    else if (minYears >= numRows || minYears < 1)  ## check for extreme cases
        repZero <- numRows - 1
    else
        repZero <- numRows - minYears

    combs <- comboGeneral(v = c(0,1:numRows),
                          m = numRows, freqs = c(repZero, 
                                                 rep(1, numRows)), rowCap = myCap)

    ## I think this part could be improved
    out <- lapply(1:nrow(combs), function(x) {
        myRows <- myMat[combs[x,],]

        if (is.null(nrow(myRows)))
            result <- !is.na(myRows)
        else
            result <- complete.cases(t(myRows))

        myColNames[result]
    })

    myRowNames <- rownames(myMat)
    names(out) <- lapply(1:nrow(combs), function(x) myRowNames[combs[x,combs[x,]>0]])
    out
}

这是 OP 示例的输出。 (OP 缺少以下 5 个结果):

testMat <- matrix(c(NA, 50, NA, 85, 110, 75, 76, 86, 120, NA, 78, 87, 130, 100, 80, 88), nrow = 4, byrow = TRUE)
row.names(testMat) <- 2000:2003
colnames(testMat) <- LETTERS[1:4]

getCombs(testMat)
$`2000`
[1] "B" "D"

$`2001`
[1] "A" "B" "C" "D"

$`2002`
[1] "A" "C" "D"

$`2003`
[1] "A" "B" "C" "D"

$`c(2000, 2001)`
[1] "B" "D"

$`c(2000, 2002)`
[1] "D"

$`c(2000, 2003)`
[1] "B" "D"

$`c(2001, 2002)`
[1] "A" "C" "D"

$`c(2001, 2003)`
[1] "A" "B" "C" "D"

$`c(2002, 2003)`
[1] "A" "C" "D"

$`c(2000, 2001, 2002)`
[1] "D"

$`c(2000, 2001, 2003)`
[1] "B" "D"

$`c(2000, 2002, 2003)`
[1] "D"

$`c(2001, 2002, 2003)`
[1] "A" "C" "D"

$`c(2000, 2001, 2002, 2003)`
[1] "D"

但是,由于您拥有 144 个国家/地区和 47 年的数据,因此这个答案或任何未来的答案都不会为您提供所有组合。这会产生一个非常非常 的大数字。每个长度不超过 n 的组合都等同于 power set。幂集中的元素数量仅为 2^n。由于我们不是在计算空集的等价物,所以我们需要减去一个,因此:

library(gmp)
sub.bigz(pow.bigz(2, 47),1)
Big Integer ('bigz') :
[1] 140737488355327

对,就是一百多万亿!!!您可能需要重新考虑您的方法,因为结果太多了。

一切都没有丢失!您可以使用 myCap 参数来限制结果的数量,这样您仍然可以研究可能的组合。观察:

set.seed(11111)
biggerTest <- matrix(sample(100, 20*20, replace = TRUE), nrow = 20)

library(countrycode)
colnames(biggerTest) <- LETTERS[1:20]
rownames(biggerTest) <- 1988:2007

## set 10% of values to NA
myNAs <- sample(400, 400 / 10)
biggerTest[myNAs] <- NA

biggerTest[1:6, 1:10]
      A  B  C   D  E  F  G  H  I  J
1988 51 71 79  35 22 33 22 84 68  4
1989 NA 51 73  10 48 NA 62 44 29 60
1990 NA 21 NA  44 91 24 45 62 52 18
1991 91 91 58  79 65 34 36 87 54 32
1992 82  6 74  75 99 NA 20 28 64 30
1993 80 10 43 100 24 22 99 28 22 44

## Getting all 1,048,575 results takes a good bit of time
system.time(allResults <- getCombs(biggerTest))
  user  system elapsed 
49.449   0.726  50.191 

## Using myCap greatly reduces the amount of time
system.time(smallSampTest <- getCombs(biggerTest, myCap = 10000))
  user  system elapsed 
 0.252   0.003   0.257 

或者,您可以使用 minYears 参数来仅 return 具有最少年份组合的结果。例如,根据 OP 对@CPak 回答的评论,如果您只想查看 15 年或更长时间组合的结果,我们有:

system.time(minYearTest <- getCombs(biggerTest, minYears = 15))
 user  system elapsed 
1.408   0.018   1.428

set.seed(123)
minYearTest[sample(length(minYearTest), 5)]
$`c(1988, 1989, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2001, 2004, 2005, 2007)`
[1] "C" "E" "G" "T"

$`c(1988, 1989, 1990, 1991, 1993, 1994, 1996, 1997, 1998, 1999, 2000, 2002, 2003, 2004, 2005, 2007)`
[1] "G" "I" "T"

$`c(1988, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1999, 2000, 2001, 2003, 2004, 2005, 2007)`
[1] "D" "G" "K" "M" "T"

$`c(1988, 1990, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 2000, 2002, 2003, 2004, 2005, 2006, 2007)`
[1] "G" "J" "K" "T"

$`c(1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2003, 2004, 2005, 2006, 2007)`
[1] "E" "G" "T"

或者同时使用两个参数:

system.time(bothConstraintsTest <- getCombs(biggerTest, 10000, minYears = 10))
  user  system elapsed 
0.487   0.004   0.494

bothConstraintsTest[1:5]
$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997")`
[1] "E" "G" "H" "J" "M" "R" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1998")`
[1] "E" "G" "H" "J" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1999")`
[1] "D" "E" "G" "M" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2000")`
[1] "D" "G" "J" "M" "R" "T"

$`c("1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "2001")`
[1] "D" "E" "G" "H" "J" "M" "R" "T"


说明

我们需要做的第一件事是确定 n 年的每个组合。这归结为找到 multiset c(rep(0, n-1), 1:n) 的所有 n 元组或等效地,n 的幂集元素集减去空集。例如,对于 2000:2003 年(4 年跨度),可能的组合为:

    comboGeneral(v = c(0,1:4), m = 4,
             freqs = c(3, rep(1, 4)))
      [,1] [,2] [,3] [,4]
 [1,]    0    0    0    1
 [2,]    0    0    0    2
 [3,]    0    0    0    3
 [4,]    0    0    0    4
 [5,]    0    0    1    2
 [6,]    0    0    1    3
 [7,]    0    0    1    4
 [8,]    0    0    2    3
 [9,]    0    0    2    4
[10,]    0    0    3    4
[11,]    0    1    2    3
[12,]    0    1    2    4
[13,]    0    1    3    4
[14,]    0    2    3    4
[15,]    1    2    3    4

现在,我们遍历组合的每一行,其中每一行告诉我们要测试 NAs 的原始矩阵中的哪些行组合。如果特定组合只包含一个结果,我们确定哪些索引不是NA!is.na( 很容易做到这一点。如果我们有不止一行,我们使用 complete.cases(t 来获得只有数字的列(即没有出现 NA)。

在此之后,我们仅使用索引来获取我们结果的名称,瞧,我们得到了我们想要的结果。

library(tidyverse)

我首先 1) 制作行名 - 年份 - 列,2) 将数据转换为长格式,以及 3) 丢弃 is.na(value) == TRUE

处的行
df <- data %>%
        mutate(year = rownames(data)) %>%
        gather(countries, value, A:D) %>%
        filter(is.finite(value)) %>%
        arrange(year) %>%
        select(-value)

valid_countries是过滤df为years-of-interest的函数(vec是年份的每一个组合),然后过滤所有年份中找到的国家在 years-of-interest。它 return 是 [years-of-interest as a comma-separated string, valid-countries as comma-separated string]

的二元向量
valid_countries <- function(df, vec) {
    ans <- df %>%
            filter(year %in% vec) %>%
            count(countries) %>%
            filter(n == length(vec)) %>%
            pluck("countries")
    c(toString(vec), toString(unique(sort(ans))))
}

以下 lapply 将遍历数据中的 1:N 年数。它将根据指定的条件绘制独特的年份组合,大小为 1:N,然后是 return valid-countries。我使用 as.data.frame(t(Reduce(...)))

将数据格式化为更易于阅读的格式
result <- lapply(
            seq_len(length(unique(df$year))),
            function(i) {
                apply(
                    combn(unique(df$year), i), 
                    2,
                    function(j) { valid_countries(df, as.numeric(j)) }
                )
            }
        )
as.data.frame(t(Reduce("cbind", result)))

结果

                       V1         V2
1                    2000       B, D
2                    2001 A, B, C, D
3                    2002    A, C, D
4                    2003 A, B, C, D
5              2000, 2001       B, D
6              2000, 2002          D
7              2000, 2003       B, D
8              2001, 2002    A, C, D
9              2001, 2003 A, B, C, D
10             2002, 2003    A, C, D
11       2000, 2001, 2002          D
12       2000, 2001, 2003       B, D
13       2000, 2002, 2003          D
14       2001, 2002, 2003    A, C, D
15 2000, 2001, 2002, 2003          D    

数据

data <- read.table(text="A     B    C     D 
NA    50    NA    85
110   75    76    86
120   NA    78    87
130   100   80    88", header=TRUE, stringsAsFactors=FALSE)
rownames(data) <- 2000:2003