如何将不同列的数据整理成一个

How to tidy data from different columns into one

我担心我的标题没有准确反映我的问题。 让我尽可能地解释我的问题。这是我的一部分数据。

pdpass begyr1 gvkey1 endyr1 begyr2 gvkey2 endyr2 begyr3 gvkey3 endyr3 begyr4 gvkey4 endyr4 begyr5 gvkey5 endyr5
-16348   1991  26005   1993   1994   8852   1996   1997   3708   2005     NA     NA     NA     NA     NA     NA
-16179   2006 174876   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-16161   1993  30964   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-15631   1996 111491   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-14668   1988  15854   2003     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-13980   1986  13218   1995   1996   9921   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA

因为它是一个宽数据,所以我也会输入 dput 值。

#dput

structure(list(pdpass = c(-16348L, -16179L, -16161L, -15631L, -14668L, -13980L),
begyr1 = c(1991L, 2006L, 1993L, 1996L, 1988L,  1986L), gvkey1 = c(26005L, 174876L, 30964L, 111491L, 15854L, 13218L), endyr1 = c(1993L, 2006L, 2005L, 2006L, 2003L, 1995L), 
begyr2 = c(1994L, NA, NA, NA, NA, 1996L), gvkey2 = c(8852L, NA, NA, NA, NA, 9921L), endyr2 = c(1996L, NA, NA, NA, NA, 2005L), 
begyr3 = c(1997L, NA, NA, NA, NA, NA), gvkey3 = c(3708L, NA, NA, NA, NA, NA), endyr3 = c(2005L, NA, NA, NA, NA, NA),
begyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_),
begyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, -6L), class = "data.frame", .Names = c("pdpass", "begyr1", "gvkey1", "endyr1", "begyr2", "gvkey2", "endyr2", "begyr3", "gvkey3", "endyr3", "begyr4", "gvkey4", "endyr4", "begyr5", "gvkey5", "endyr5"))

我想把它整理成长格式。

pdpass列是ID变量。我想为每个 pdpass 值收集所有 begyrsendyearsgvkey 值。 但是,正如您从原始数据中看到的那样,有些 pdpass 只有一个 gvkey, endyear, begyears 值,而有些则有多个。 (它们表示为 gvkey2, endyear2, begyear2, gvkey3 ... 等等)。

基本上,只有当 gvkey-endyear-begyear 值以相同的数字结尾时,我才想排 每个 pdpass

简单地说,这是我想要的输出:

   pdpass gvkeyN  gvkey begyearN begyear endyearN endyear
1  -16348 gvkey1  26005 begyear1    1991 endyear1    1993
2  -16348 gvkey2   8852 begyear2    1994 endyear2    1996
3  -16348 gvkey3   3708 begyear3    1997 endyear3    2005
4  -16179 gvkey1 174876 begyear1    2006 endyear1    2006
5  -16161 gvkey1  30964 begyear1    1993 endyear1    2005
6  -15631 gvkey1 111491 begyear1    1996 endyear1    2006
7  -14668 gvkey1  15854 begyear1    1988 endyear1    2003
8  -13980 gvkey1  13218 begyear1    1986 endyear1    1995
9  -13980 gvkey2  13218 begyear1    1996 endyear2    2005
10 -13956 gvkey1   8674 begyear1    1968 endyear2    1987    

# code
desired<-data.frame(pdpass=c(rep(-16348, 3), -16179, -16161, -15631, -14668, rep(-13980, 2), -13956),
gvkeyN=c("gvkey1", "gvkey2", "gvkey3", rep("gvkey1", 5), "gvkey2", "gvkey1"),
gvkey=c(26005, 8852, 3708, 174876, 30964, 111491, 15854, 13218, 13218, 8674),
begyearN=c("begyear1", "begyear2", "begyear3", rep("begyear1", 5), "begyear1", "begyear1"),
begyear=c(1991, 1994, 1997, 2006, 1993, 1996, 1988, 1986, 1996, 1968), 
endyearN=c("endyear1", "endyear2", endyear3", rep("endyear1", 5), "endyear2", "endyear2"),
endyear=c(1993, 1996, 2005, 2006, 2005, 2006, 2003, 1995, 2005, 1987))`

这是我目前尝试过的方法

require(tidyr)
d1<-gather(NBER, gvkeyN, gvkey, -pdpass, -endyr1, -begyr1, -endyr2, -begyr2, -endyr3, -begyr3, -endyr4, -begyr4, -endyr5, -begyr5, na.rm=TRUE)
d2<-gather(d1, begyrN, begyear, -gvkeyN, -gvkey, -pdpass, -endyr1, -endyr2, -endyr3, -endyr4, -endyr5, na.rm=TRUE)
d3<-gather(d2, endyrN, endyear, -gvkeyN, -gvkey, -pdpass, -begyrN, -begyear, na.rm=TRUE)
d4<-arrange(d3, pdpass)
head(d4, 10)

pdpass gvkeyN gvkey begyrN begyear endyrN endyear
1  -16348 gvkey1 26005 begyr1    1991 endyr1    1993
2  -16348 gvkey2  8852 begyr1    1991 endyr1    1993
3  -16348 gvkey3  3708 begyr1    1991 endyr1    1993
4  -16348 gvkey1 26005 begyr2    1994 endyr1    1993
5  -16348 gvkey2  8852 begyr2    1994 endyr1    1993
6  -16348 gvkey3  3708 begyr2    1994 endyr1    1993
7  -16348 gvkey1 26005 begyr3    1997 endyr1    1993
8  -16348 gvkey2  8852 begyr3    1997 endyr1    1993
9  -16348 gvkey3  3708 begyr3    1997 endyr1    1993
10 -16348 gvkey1 26005 begyr1    1991 endyr2    1996

如您所见,我得到了不必要的行。我只需要 gvkey, endyear, and begyear 末尾的数字完全匹配的行。 (例如,gvkey 1 begyr 1 endyr 1gvkey 2 begyr 2 endyr 2gvkey3 begyr3 endyr3 等等);但是,我得到了 1,2 和 3 的组合。因此,上面的行 2:9 都是不必要的。

也许,我可以在收集后以某种方式手动删除所有不需要的行。但是,我认为这可能是更好地利用 gather 函数的方法。我看过类似的问题,但找不到解决此问题的方法。 谁能帮我解决这些问题?

我们可以使用 melt 来自 data.table

library(data.table)
melt(setDT(df1), measure = patterns("^gvkey", "^begyr", "^endyr"), 
    na.rm = TRUE, value.name = c("gvkey", "begyear", "endyear"))