清理人口普查局数据框
Cleaning up Census Bureau data frames
我正在尝试清理和操作人口普查局数据的数据框。我在 R 中使用(for 循环)来完成它,但到目前为止这需要 20 多个小时!!
问题是我使用了两个不同的数据框
这是我的代码
t=1
for(i in 1:25558){ # number of records in the Housing record
family <- array(0,dim=c(0,12)) # creating an empty array to store row number
k=1
n=0
for(j in t:52608){ # number of records in the Personal record
if(Housing[i,5] == Personal[j,2]) {
family[k]=j
k=k+1
n=1
}
else(
if(n == 1) {
t=j
break
}
)
}
a=0
for(m in 1:length(family)){
if(is.na(Personal[family[m],22])) { # Some families has mix values: NA and numbers
break
}
else(
if(Personal[family[m],22] > 1){
a=a+1
}
)
}
if(a == length(family)) {
Housing[i,1]=1
}
}
(编辑 - 一个例子):
在 Hosing 记录中,我有每个家庭的 ID。在个人记录中,所有家庭成员重复相同的家庭ID。
Housing Record:
ID Family Ability to Speak English
1 0
2 0
3 1
Personal Record:
ID Member Person Ability to Speak English
1 1 1
1 2 NA
1 3 2
2 1 4
2 2 1
3 1 3
3 2 2
注意:这里的"NA"不是"Not Available"的意思,它有特定的意思(基本上我应该不会去掉)
我需要根据该家庭成员的英语能力,将 "Family Ability to Speak English" 列的值更改为 1。 (请参阅我的代码的最后一部分)
# some dummy data frame for families
family <- data.frame(famid=rep(1:5, each=3),
member=rep(1:3, 5),
speake=sample(c(1:4, NA, NA), 15, replace=TRUE))
# a function to calculate scores
# (modify according to your scoring algorithm)
english_score <- function(fam){
# pull out the English scores for all members of fam
data <- family$speake[which(family$famid==fam)]
# I dont know how you want to number the families,
# by their combined English score, or just if any exist
# so demonstrate both
eng_sum <- sum(na.omit(data))
eng_present <- !any(is.na(data))
#return this result of the function as a vector
c(fam, eng_sum, eng_present)
}
# apply the function to each unique family
housing <- sapply(unique(family$famid), english_score)
housing <- as.data.frame(t(housing))
colnames(housing) <- c("family", "eng_sum", "eng_present")
我正在尝试清理和操作人口普查局数据的数据框。我在 R 中使用(for 循环)来完成它,但到目前为止这需要 20 多个小时!!
问题是我使用了两个不同的数据框
这是我的代码
t=1
for(i in 1:25558){ # number of records in the Housing record
family <- array(0,dim=c(0,12)) # creating an empty array to store row number
k=1
n=0
for(j in t:52608){ # number of records in the Personal record
if(Housing[i,5] == Personal[j,2]) {
family[k]=j
k=k+1
n=1
}
else(
if(n == 1) {
t=j
break
}
)
}
a=0
for(m in 1:length(family)){
if(is.na(Personal[family[m],22])) { # Some families has mix values: NA and numbers
break
}
else(
if(Personal[family[m],22] > 1){
a=a+1
}
)
}
if(a == length(family)) {
Housing[i,1]=1
}
}
(编辑 - 一个例子): 在 Hosing 记录中,我有每个家庭的 ID。在个人记录中,所有家庭成员重复相同的家庭ID。
Housing Record:
ID Family Ability to Speak English
1 0
2 0
3 1
Personal Record:
ID Member Person Ability to Speak English
1 1 1
1 2 NA
1 3 2
2 1 4
2 2 1
3 1 3
3 2 2
注意:这里的"NA"不是"Not Available"的意思,它有特定的意思(基本上我应该不会去掉)
我需要根据该家庭成员的英语能力,将 "Family Ability to Speak English" 列的值更改为 1。 (请参阅我的代码的最后一部分)
# some dummy data frame for families
family <- data.frame(famid=rep(1:5, each=3),
member=rep(1:3, 5),
speake=sample(c(1:4, NA, NA), 15, replace=TRUE))
# a function to calculate scores
# (modify according to your scoring algorithm)
english_score <- function(fam){
# pull out the English scores for all members of fam
data <- family$speake[which(family$famid==fam)]
# I dont know how you want to number the families,
# by their combined English score, or just if any exist
# so demonstrate both
eng_sum <- sum(na.omit(data))
eng_present <- !any(is.na(data))
#return this result of the function as a vector
c(fam, eng_sum, eng_present)
}
# apply the function to each unique family
housing <- sapply(unique(family$famid), english_score)
housing <- as.data.frame(t(housing))
colnames(housing) <- c("family", "eng_sum", "eng_present")