用于插入人口计数的后续循环传递未按预期填充数据框 - 逻辑不正确?
Subsequent loop passes to interpolate population counts are not populating dataframe as intended - logic is incorrect?
我有一个包含 6 个变量的数据框:
Depr 是一个有 6 个水平的因子(“0”、“1”、“2”、“3”、“4”、“5”)
性别是一个具有 3 个水平的因素("Both sexes"、"Female"、"Male")
年龄是一个有 19 个级别的因素(“00-04”、“05-09”、“10-14”、“15-19”、
“20-24”、“25-29”、“30-34”、“35-39”、“40-44”、“45-49”、“50-54”、
"55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+","Total")
GL是一个因子(地理级别),有5个级别("HPE"、"KFLA"、"LGL"、"ON"、"Regional")
YR是一个整数(年份),只有两个——2011和2016(人口普查年份)
而Pop是人口数,一个整数。
数据框以长格式设置,其中我有两年中每一年所有因素组合的人口计数。
Depr Sex Age GL YR Pop
0 Both sexes 00-04 ON 2011 395
0 Both sexes 00-04 ON 2016 5550
...
1 Both sexes 00-04 ON 2011 495
1 Both sexes 00-04 ON 2016 3923
我想为数据框中的每一行插入 2011 年到 2016 年(2012、2013、2014、2015)之间的年份,以便得到如下内容:
Depr Sex Age GL YR Pop
0 Both sexes 00-04 ON 2011 395
0 Both sexes 00-04 ON 2012 456
0 Both sexes 00-04 ON 2013 689
0 Both sexes 00-04 ON 2014 2354
0 Both sexes 00-04 ON 2015 3446
0 Both sexes 00-04 ON 2016 5550
我已经设置了嵌套循环并使用 approx
进行线性插值。
#create an empty dataframe to combine the results
fdepr <- data.frame(Depr = factor (levels = c("0", "1", "2", "3", "4", "5")),
Sex = factor(levels = c("Both sexes", "Female", "Male")),
Age = factor (levels = c("00-04", "05-09", "10-14",
"15-19", "20-24", "25-29", "30-34", "35-39", "40-44",
"45-49","50-54", "55-59", "60-64", "65-69", "70-74", "75-
79", "80-84", "85+","Total")),
GL = factor(levels = c("HPE","KFLA","LGL","ON","Regional")),
YR = integer(),
Pop = integer())
#loops to subset Pop by grouping categories (depr is my original df)
for (i in unique(depr$Depr))
{
for (j in unique(depr$Sex))
{
for (k in unique(depr$Age))
{
for (l in unique(depr$GL)) {
temp <- subset(depr, subset=(Depr==i & Sex==j & Age==k & GL == l),select = c(YR, Pop))
x <- temp$YR
y <- temp$Pop
t <- c(2011,2012,2013,2014,2015,2016)
points <- approx(x,y, method = 'linear', xout=t)
results <- data.frame(Depr=rep(i,6), Sex=rep(j,6), Age=rep(k,6), GL= rep(l,6), YR = points$x, Pop = points$y)
fdepr <- rbind (fdepr,results)
}
}}}
它似乎通过并完成了第一轮,并按预期填充了 results
和 fdepr
,但后来我得到了
Error in approx(x, y, method = "linear", xout = t) :
need at least two non-NA values to interpolate
temp
是空的,x
和 y
也是空的。我不确定是不是 fdepr
的定义方式有问题,还是嵌套循环有问题...
我不是数据科学家,所以复杂的逻辑和编程并不直观 - 任何见解都值得赞赏
在我看来,创建一个新的数据框并进行嵌套 for-loops 会使事情变得比需要的更复杂。
这里我使用 group_by
和 expand
来获取每个数据组的中间年份,然后 left_join
原始数据框添加相应的 Pop
值。之后,你只需要对每组数据应用na.approx
,数据已经从expand
部分分组,所以你可以直接使用mutate
。
当然,您可以在 mutate
调用中覆盖 Pop
而不是创建新变量,我这样做只是为了说明目的。
library(zoo) # for na.approx
library(tidyverse) # for $>%, group_by, expand, left_join, and mutate
depr %>%
group_by(Depr, Sex, Age, GL) %>%
expand(YR = do.call(seq, as.list(YR))) %>%
left_join(depr, names(.)) %>%
mutate(Pop_interp = na.approx(Pop))
# # A tibble: 12 x 7
# # Groups: Depr, Sex, Age, GL [2]
# Depr Sex Age GL YR Pop Pop_interp
# <int> <chr> <chr> <chr> <int> <int> <dbl>
# 1 0 Both 00-04 ON 2011 395 395
# 2 0 Both 00-04 ON 2012 NA 1426
# 3 0 Both 00-04 ON 2013 NA 2457
# 4 0 Both 00-04 ON 2014 NA 3488
# 5 0 Both 00-04 ON 2015 NA 4519
# 6 0 Both 00-04 ON 2016 5550 5550
# 7 1 Both 00-04 ON 2011 495 495
# 8 1 Both 00-04 ON 2012 NA 1181.
# 9 1 Both 00-04 ON 2013 NA 1866.
# 10 1 Both 00-04 ON 2014 NA 2552.
# 11 1 Both 00-04 ON 2015 NA 3237.
# 12 1 Both 00-04 ON 2016 3923 3923
这里用 data.table
和 magrittr
代替 tidyverse
也是一样的
library(zoo)
library(magrittr)
library(data.table)
depr[, .(YR = do.call(seq, as.list(YR))), .(Depr, Sex, Age, GL)] %>%
.[depr, on = names(.), Pop := i.Pop] %>%
.[, Pop_Interp := na.approx(Pop)] %>%
print
# Depr Sex Age GL YR Pop Pop_Interp
# 1: 0 Both 00-04 ON 2011 395 395.0
# 2: 0 Both 00-04 ON 2012 NA 1426.0
# 3: 0 Both 00-04 ON 2013 NA 2457.0
# 4: 0 Both 00-04 ON 2014 NA 3488.0
# 5: 0 Both 00-04 ON 2015 NA 4519.0
# 6: 0 Both 00-04 ON 2016 5550 5550.0
# 7: 1 Both 00-04 ON 2011 495 495.0
# 8: 1 Both 00-04 ON 2012 NA 1180.6
# 9: 1 Both 00-04 ON 2013 NA 1866.2
# 10: 1 Both 00-04 ON 2014 NA 2551.8
# 11: 1 Both 00-04 ON 2015 NA 3237.4
# 12: 1 Both 00-04 ON 2016 3923 3923.0
使用的输入数据
depr <- data.table::fread('
Depr Sex Age GL YR Pop
0 Both 00-04 ON 2011 395
0 Both 00-04 ON 2016 5550
1 Both 00-04 ON 2011 495
1 Both 00-04 ON 2016 3923
', data.table = F)
我有一个包含 6 个变量的数据框:
Depr 是一个有 6 个水平的因子(“0”、“1”、“2”、“3”、“4”、“5”)
性别是一个具有 3 个水平的因素("Both sexes"、"Female"、"Male")
年龄是一个有 19 个级别的因素(“00-04”、“05-09”、“10-14”、“15-19”、 “20-24”、“25-29”、“30-34”、“35-39”、“40-44”、“45-49”、“50-54”、 "55-59", "60-64", "65-69", "70-74", "75-79", "80-84", "85+","Total")
GL是一个因子(地理级别),有5个级别("HPE"、"KFLA"、"LGL"、"ON"、"Regional")
YR是一个整数(年份),只有两个——2011和2016(人口普查年份)
而Pop是人口数,一个整数。
数据框以长格式设置,其中我有两年中每一年所有因素组合的人口计数。
Depr Sex Age GL YR Pop
0 Both sexes 00-04 ON 2011 395
0 Both sexes 00-04 ON 2016 5550
...
1 Both sexes 00-04 ON 2011 495
1 Both sexes 00-04 ON 2016 3923
我想为数据框中的每一行插入 2011 年到 2016 年(2012、2013、2014、2015)之间的年份,以便得到如下内容:
Depr Sex Age GL YR Pop
0 Both sexes 00-04 ON 2011 395
0 Both sexes 00-04 ON 2012 456
0 Both sexes 00-04 ON 2013 689
0 Both sexes 00-04 ON 2014 2354
0 Both sexes 00-04 ON 2015 3446
0 Both sexes 00-04 ON 2016 5550
我已经设置了嵌套循环并使用 approx
进行线性插值。
#create an empty dataframe to combine the results
fdepr <- data.frame(Depr = factor (levels = c("0", "1", "2", "3", "4", "5")),
Sex = factor(levels = c("Both sexes", "Female", "Male")),
Age = factor (levels = c("00-04", "05-09", "10-14",
"15-19", "20-24", "25-29", "30-34", "35-39", "40-44",
"45-49","50-54", "55-59", "60-64", "65-69", "70-74", "75-
79", "80-84", "85+","Total")),
GL = factor(levels = c("HPE","KFLA","LGL","ON","Regional")),
YR = integer(),
Pop = integer())
#loops to subset Pop by grouping categories (depr is my original df)
for (i in unique(depr$Depr))
{
for (j in unique(depr$Sex))
{
for (k in unique(depr$Age))
{
for (l in unique(depr$GL)) {
temp <- subset(depr, subset=(Depr==i & Sex==j & Age==k & GL == l),select = c(YR, Pop))
x <- temp$YR
y <- temp$Pop
t <- c(2011,2012,2013,2014,2015,2016)
points <- approx(x,y, method = 'linear', xout=t)
results <- data.frame(Depr=rep(i,6), Sex=rep(j,6), Age=rep(k,6), GL= rep(l,6), YR = points$x, Pop = points$y)
fdepr <- rbind (fdepr,results)
}
}}}
它似乎通过并完成了第一轮,并按预期填充了 results
和 fdepr
,但后来我得到了
Error in approx(x, y, method = "linear", xout = t) :
need at least two non-NA values to interpolate
temp
是空的,x
和 y
也是空的。我不确定是不是 fdepr
的定义方式有问题,还是嵌套循环有问题...
我不是数据科学家,所以复杂的逻辑和编程并不直观 - 任何见解都值得赞赏
在我看来,创建一个新的数据框并进行嵌套 for-loops 会使事情变得比需要的更复杂。
这里我使用 group_by
和 expand
来获取每个数据组的中间年份,然后 left_join
原始数据框添加相应的 Pop
值。之后,你只需要对每组数据应用na.approx
,数据已经从expand
部分分组,所以你可以直接使用mutate
。
当然,您可以在 mutate
调用中覆盖 Pop
而不是创建新变量,我这样做只是为了说明目的。
library(zoo) # for na.approx
library(tidyverse) # for $>%, group_by, expand, left_join, and mutate
depr %>%
group_by(Depr, Sex, Age, GL) %>%
expand(YR = do.call(seq, as.list(YR))) %>%
left_join(depr, names(.)) %>%
mutate(Pop_interp = na.approx(Pop))
# # A tibble: 12 x 7
# # Groups: Depr, Sex, Age, GL [2]
# Depr Sex Age GL YR Pop Pop_interp
# <int> <chr> <chr> <chr> <int> <int> <dbl>
# 1 0 Both 00-04 ON 2011 395 395
# 2 0 Both 00-04 ON 2012 NA 1426
# 3 0 Both 00-04 ON 2013 NA 2457
# 4 0 Both 00-04 ON 2014 NA 3488
# 5 0 Both 00-04 ON 2015 NA 4519
# 6 0 Both 00-04 ON 2016 5550 5550
# 7 1 Both 00-04 ON 2011 495 495
# 8 1 Both 00-04 ON 2012 NA 1181.
# 9 1 Both 00-04 ON 2013 NA 1866.
# 10 1 Both 00-04 ON 2014 NA 2552.
# 11 1 Both 00-04 ON 2015 NA 3237.
# 12 1 Both 00-04 ON 2016 3923 3923
这里用 data.table
和 magrittr
代替 tidyverse
library(zoo)
library(magrittr)
library(data.table)
depr[, .(YR = do.call(seq, as.list(YR))), .(Depr, Sex, Age, GL)] %>%
.[depr, on = names(.), Pop := i.Pop] %>%
.[, Pop_Interp := na.approx(Pop)] %>%
print
# Depr Sex Age GL YR Pop Pop_Interp
# 1: 0 Both 00-04 ON 2011 395 395.0
# 2: 0 Both 00-04 ON 2012 NA 1426.0
# 3: 0 Both 00-04 ON 2013 NA 2457.0
# 4: 0 Both 00-04 ON 2014 NA 3488.0
# 5: 0 Both 00-04 ON 2015 NA 4519.0
# 6: 0 Both 00-04 ON 2016 5550 5550.0
# 7: 1 Both 00-04 ON 2011 495 495.0
# 8: 1 Both 00-04 ON 2012 NA 1180.6
# 9: 1 Both 00-04 ON 2013 NA 1866.2
# 10: 1 Both 00-04 ON 2014 NA 2551.8
# 11: 1 Both 00-04 ON 2015 NA 3237.4
# 12: 1 Both 00-04 ON 2016 3923 3923.0
使用的输入数据
depr <- data.table::fread('
Depr Sex Age GL YR Pop
0 Both 00-04 ON 2011 395
0 Both 00-04 ON 2016 5550
1 Both 00-04 ON 2011 495
1 Both 00-04 ON 2016 3923
', data.table = F)