更改 example/fictitious 数据集中两个变量之间的相关性
Changing the correlation between two variables in an example/fictitious data set
我正在尝试创建一个示例数据集(大部分代码来自 this question)。这几乎就是我想要的样子。不过,还有两件事我还想做,但想不通。
我想在 y
和 year
之间创建更高的相关性,而不需要重新排列整个数据集(因此只需更改 y 的值)。
如果可能(我目前只是手动更改 set.seed()
直到我得到显着关系),我希望能够确定 event
之间的真实相关性和 y
。 (同样只能更改 y)。
有人可以帮我解释一下怎么做吗?
set.seed(2)
a <- 2 # structural parameter of interest
b <- 1 # strength of instrument
rho <- 0.5 # degree of endogeneity
N <- 1000
z <- rnorm(N)
res1 <- rnorm(N)
res2 <- res1*rho + sqrt(1-rho*rho)*rnorm(N)
x <- z*b + res1
ys <- x*a + res2
d <- (ys>0) #dummy variable
y <- round(10-(d*ys))
random_variable <- rnorm(100, mean = 0, sd = 1)
library(data.table)
DT_1 <- data.frame(y,x,z, random_variable)
DT_2 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50), year = c(1995, 1995, 1995, 1995, 1995,
1995, 1995, 1995, 1995, 1995, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2005, 2005, 2005, 2005, 2005, 2005, 2005,
2005, 2005, 2005, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015), Group = c("A", "A", "A", "A", "B", "B", "B", "B", "C",
"C", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "A", "A",
"A", "A", "B", "B", "B", "B", "C", "C", "A", "A", "A", "A", "B",
"B", "B", "B", "C", "C", "A", "A", "A", "A", "B", "B", "B", "B",
"C", "C"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), win_or_lose = c(-1,
-1, -1, -1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 1, 1, 1, 1, 0, 0,
-1, -1, -1, -1, 1, 1, 1, 1, 0, 0)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
DT_1 <- setDT(DT_1)
DT_2 <- setDT(DT_2)
DT_2 <- rbind(DT_2 , DT_2 [rep(1:50, 19), ])
sandbox <- cbind(DT_1, DT_2)
下面的解决方案呢
generate.values <- function(year) {
if (year == 1995)
return(rnorm(1, mean = 50))
if (year == 2000)
return(rnorm(1, mean = 100))
if (year == 2005)
return(rnorm(1, mean = 150))
if (year == 2010)
return(rnorm(1, mean = 200))
if (year == 2015)
return(rnorm(1, mean = 250))
}
sandbox$y <- apply(sandbox,
MARGIN = 1,
FUN = function(row) {
return(generate.values(row["year"]))
})
cor(sandbox$y, sandbox$year)
这给了我一个关于 0.99
的相关性。请注意我是如何增加每年正态分布的均值的。如果你想要负相关,那么你可以简单地改变 mean
参数的符号。所以函数变成了
generate.values <- function(year) {
if (year == 1995)
return(rnorm(1, mean = -50))
if (year == 2000)
return(rnorm(1, mean = -100))
if (year == 2005)
return(rnorm(1, mean = -150))
if (year == 2010)
return(rnorm(1, mean = -200))
if (year == 2015)
return(rnorm(1, mean = -250))
}
当然,您可以将其合并到一个函数中。然后我们有
generate.values <- function(year, negative.correlation = FALSE) {
means <- c(50, 100, 150, 200, 250)
names(means) <- c("1995", "2000", "2005", "2010", "2015")
if (negative.correlation)
means <- -means
if (year == 1995)
return(rnorm(1, mean = means["1995"]))
if (year == 2000)
return(rnorm(1, mean = means["2000"]))
if (year == 2005)
return(rnorm(1, mean = means["2005"]))
if (year == 2010)
return(rnorm(1, mean = means["2010"]))
if (year == 2015)
return(rnorm(1, mean = means["2015"]))
}
HTH!
此方法使用以下想法:
- 向
y
添加一个取决于 year
的确定性项。这大大提高了相关性。这里,取决于beta
。您可以调整 beta
来增加 year
的影响力。请注意,我与 year-mean(year)
一起工作,因此 y
的整体比例不会移动太多。如果您不关心 y
被移动,只需删除均值部分。
- 向
y
添加一些高斯噪声。您可以调整 sd
参数以增加噪声,从而降低相关性。
我将结果保存在y2
中,这样你就可以更轻松地玩了。当您对参数 beta
和 sd
感到满意时,您可以覆盖 y
.
noise = rnorm(n = nrow(sandbox), mean = 0, sd = 0.01)
beta = 0.1
sandbox$y2 = sandbox$y + beta * (sandbox$year - mean(sandbox$year)) + noise
cor(sandbox$y2, sandbox$year)
祝你好运,如果这不是我们想要的行为,请提供反馈或澄清。
编辑:
在这里您可以看到不同 beta
和 sigma
值的行为:
betas = seq(-.50, .50, by=.10)
sigmas = seq(0.0, 5.0, by=1.0)
M = matrix(data=NA, nrow=length(betas), ncol=length(sigmas))
for (b in 1:length(betas)){
for (s in 1:length(sigmas)){
noise = rnorm(n = nrow(sandbox), mean = 0, sd = sigmas[s])
sandbox$y2 = sandbox$y + betas[b] * (sandbox$year - mean(sandbox$year)) + noise
M[b,s] = round(cor(sandbox$y2, sandbox$year), 2)
}
}
rownames(M) = betas
colnames(M) = sigmas
M
导致以下矩阵输出。行是 beta,列是 sigma,单元格值是 y
和 year
的相关性:
0 1 2 3 4 5
-0.5 -0.86 -0.84 -0.77 -0.66 -0.62 -0.55
-0.4 -0.81 -0.78 -0.70 -0.61 -0.53 -0.47
-0.3 -0.71 -0.68 -0.61 -0.51 -0.45 -0.42
-0.2 -0.56 -0.51 -0.46 -0.32 -0.29 -0.25
-0.1 -0.32 -0.29 -0.25 -0.22 -0.12 -0.08
0 0.01 -0.01 0.00 -0.01 0.01 -0.01
0.1 0.33 0.31 0.24 0.21 0.19 0.12
0.2 0.57 0.52 0.45 0.38 0.33 0.27
0.3 0.72 0.66 0.59 0.48 0.44 0.33
0.4 0.81 0.78 0.71 0.62 0.54 0.48
0.5 0.86 0.84 0.78 0.69 0.61 0.53
编辑 2: 当然,您可以简单地使用负 beta
来实现负相关。您也可以只修复 sigma
并只调整 beta
.
我正在尝试创建一个示例数据集(大部分代码来自 this question)。这几乎就是我想要的样子。不过,还有两件事我还想做,但想不通。
我想在
y
和year
之间创建更高的相关性,而不需要重新排列整个数据集(因此只需更改 y 的值)。如果可能(我目前只是手动更改
set.seed()
直到我得到显着关系),我希望能够确定event
之间的真实相关性和y
。 (同样只能更改 y)。
有人可以帮我解释一下怎么做吗?
set.seed(2)
a <- 2 # structural parameter of interest
b <- 1 # strength of instrument
rho <- 0.5 # degree of endogeneity
N <- 1000
z <- rnorm(N)
res1 <- rnorm(N)
res2 <- res1*rho + sqrt(1-rho*rho)*rnorm(N)
x <- z*b + res1
ys <- x*a + res2
d <- (ys>0) #dummy variable
y <- round(10-(d*ys))
random_variable <- rnorm(100, mean = 0, sd = 1)
library(data.table)
DT_1 <- data.frame(y,x,z, random_variable)
DT_2 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50), year = c(1995, 1995, 1995, 1995, 1995,
1995, 1995, 1995, 1995, 1995, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2005, 2005, 2005, 2005, 2005, 2005, 2005,
2005, 2005, 2005, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015), Group = c("A", "A", "A", "A", "B", "B", "B", "B", "C",
"C", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "A", "A",
"A", "A", "B", "B", "B", "B", "C", "C", "A", "A", "A", "A", "B",
"B", "B", "B", "C", "C", "A", "A", "A", "A", "B", "B", "B", "B",
"C", "C"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), win_or_lose = c(-1,
-1, -1, -1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 1, 1, 1, 1, 0, 0,
-1, -1, -1, -1, 1, 1, 1, 1, 0, 0)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
DT_1 <- setDT(DT_1)
DT_2 <- setDT(DT_2)
DT_2 <- rbind(DT_2 , DT_2 [rep(1:50, 19), ])
sandbox <- cbind(DT_1, DT_2)
下面的解决方案呢
generate.values <- function(year) {
if (year == 1995)
return(rnorm(1, mean = 50))
if (year == 2000)
return(rnorm(1, mean = 100))
if (year == 2005)
return(rnorm(1, mean = 150))
if (year == 2010)
return(rnorm(1, mean = 200))
if (year == 2015)
return(rnorm(1, mean = 250))
}
sandbox$y <- apply(sandbox,
MARGIN = 1,
FUN = function(row) {
return(generate.values(row["year"]))
})
cor(sandbox$y, sandbox$year)
这给了我一个关于 0.99
的相关性。请注意我是如何增加每年正态分布的均值的。如果你想要负相关,那么你可以简单地改变 mean
参数的符号。所以函数变成了
generate.values <- function(year) {
if (year == 1995)
return(rnorm(1, mean = -50))
if (year == 2000)
return(rnorm(1, mean = -100))
if (year == 2005)
return(rnorm(1, mean = -150))
if (year == 2010)
return(rnorm(1, mean = -200))
if (year == 2015)
return(rnorm(1, mean = -250))
}
当然,您可以将其合并到一个函数中。然后我们有
generate.values <- function(year, negative.correlation = FALSE) {
means <- c(50, 100, 150, 200, 250)
names(means) <- c("1995", "2000", "2005", "2010", "2015")
if (negative.correlation)
means <- -means
if (year == 1995)
return(rnorm(1, mean = means["1995"]))
if (year == 2000)
return(rnorm(1, mean = means["2000"]))
if (year == 2005)
return(rnorm(1, mean = means["2005"]))
if (year == 2010)
return(rnorm(1, mean = means["2010"]))
if (year == 2015)
return(rnorm(1, mean = means["2015"]))
}
HTH!
此方法使用以下想法:
- 向
y
添加一个取决于year
的确定性项。这大大提高了相关性。这里,取决于beta
。您可以调整beta
来增加year
的影响力。请注意,我与year-mean(year)
一起工作,因此y
的整体比例不会移动太多。如果您不关心y
被移动,只需删除均值部分。 - 向
y
添加一些高斯噪声。您可以调整sd
参数以增加噪声,从而降低相关性。
我将结果保存在y2
中,这样你就可以更轻松地玩了。当您对参数 beta
和 sd
感到满意时,您可以覆盖 y
.
noise = rnorm(n = nrow(sandbox), mean = 0, sd = 0.01)
beta = 0.1
sandbox$y2 = sandbox$y + beta * (sandbox$year - mean(sandbox$year)) + noise
cor(sandbox$y2, sandbox$year)
祝你好运,如果这不是我们想要的行为,请提供反馈或澄清。
编辑:
在这里您可以看到不同 beta
和 sigma
值的行为:
betas = seq(-.50, .50, by=.10)
sigmas = seq(0.0, 5.0, by=1.0)
M = matrix(data=NA, nrow=length(betas), ncol=length(sigmas))
for (b in 1:length(betas)){
for (s in 1:length(sigmas)){
noise = rnorm(n = nrow(sandbox), mean = 0, sd = sigmas[s])
sandbox$y2 = sandbox$y + betas[b] * (sandbox$year - mean(sandbox$year)) + noise
M[b,s] = round(cor(sandbox$y2, sandbox$year), 2)
}
}
rownames(M) = betas
colnames(M) = sigmas
M
导致以下矩阵输出。行是 beta,列是 sigma,单元格值是 y
和 year
的相关性:
0 1 2 3 4 5
-0.5 -0.86 -0.84 -0.77 -0.66 -0.62 -0.55
-0.4 -0.81 -0.78 -0.70 -0.61 -0.53 -0.47
-0.3 -0.71 -0.68 -0.61 -0.51 -0.45 -0.42
-0.2 -0.56 -0.51 -0.46 -0.32 -0.29 -0.25
-0.1 -0.32 -0.29 -0.25 -0.22 -0.12 -0.08
0 0.01 -0.01 0.00 -0.01 0.01 -0.01
0.1 0.33 0.31 0.24 0.21 0.19 0.12
0.2 0.57 0.52 0.45 0.38 0.33 0.27
0.3 0.72 0.66 0.59 0.48 0.44 0.33
0.4 0.81 0.78 0.71 0.62 0.54 0.48
0.5 0.86 0.84 0.78 0.69 0.61 0.53
编辑 2: 当然,您可以简单地使用负 beta
来实现负相关。您也可以只修复 sigma
并只调整 beta
.