从 2 列中隐式获取 R 中的中位数
Get Median implicitly in R from 2 columns
我正在尝试弄清楚如何操作数据 here。该图片仅显示一门课程,但我有多门课程和课程编号,范围从 2010 年到 2017 年。我应该如何添加一列来显示某门课程根据年份、教学和学期的平均成绩?我们有获得特定成绩但没有实际成绩的孩子数量。我期望基于每个 "taught" 变量的 11 个不同等级的中位数成绩列应该有 11 个重复项。 taught 只能有两个值,"here" 或 "there".
我试过使用聚合函数,但这个问题似乎不是 high-level 函数可以解决的问题。数据库是 R 中的 DBKids。我似乎想不出可以帮助我解决这个问题的方法。谢谢!
编辑:可重现的代码
structure(list(sessionYear = c(2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010), courseNumber = c("20", "20",
"20", "20", "20", "20", "20", "20", "20", "20", "20", "20", "20",
"20", "20", "20", "20", "20", "20", "20", "20", "20"),
courseName = c("KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn"), Taught = c("There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here"), Term = c("1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1"), averageGrade = c(83, 84, 83, 84, 83, 84,
83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84
), Grade = c("F", "F", "D", "D", "C3", "C3", "C2", "C2", "C1",
"C1", "B3", "B3", "B2", "B2", "B1", "B1", "A3", "A3", "A2", "A2",
"A1", "A1"), numberOfKids = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
3, 0, 3, 2, 6, 0, 14, 7, 24, 4, 18, 4)), class = "data.frame", row.names = c(NA,
-22L), .Names = c("sessionYear", "courseNumber", "courseName",
"Taught", "Term", "averageGrade", "Grade", "numberOfKids"))
希望对您有所帮助。
首先,我们将制作一个 factor
等级,确保等级顺序正确。我们可以将其转换为数字,因此我们有数字取中位数。
levels(factor(dd$Grade))
# [1] "A1" "A2" "A3" "B1" "B2" "B3" "C1" "C2" "C3" "D" "F"
## order seems good
dd$grade_numeric = as.numeric(factor(dd$Grade))
现在我们按组计算中位数,按孩子数量加权,四舍五入到最接近的整数并转换回字母等级。
library(dplyr)
group_by(dd, sessionYear, Taught, Term) %>%
mutate(med = spatstat::weighted.median(x = grade_numeric, w = numberOfKids),
med = round(med),
median_Grade = levels(factor(Grade))[med]) %>%
print.data.frame
# sessionYear courseNumber courseName Taught Term averageGrade Grade numberOfKids grade_numeric med median_Grade
# 1 2010 20 KidsLearn There 1 83 F 1 11 2 A2
# 2 2010 20 KidsLearn Here 1 84 F 0 11 2 A2
# 3 2010 20 KidsLearn There 1 83 D 0 10 2 A2
# 4 2010 20 KidsLearn Here 1 84 D 0 10 2 A2
# 5 2010 20 KidsLearn There 1 83 C3 1 9 2 A2
# 6 2010 20 KidsLearn Here 1 84 C3 0 9 2 A2
# 7 2010 20 KidsLearn There 1 83 C2 1 8 2 A2
# 8 2010 20 KidsLearn Here 1 84 C2 0 8 2 A2
# 9 2010 20 KidsLearn There 1 83 C1 0 7 2 A2
# 10 2010 20 KidsLearn Here 1 84 C1 0 7 2 A2
# 11 2010 20 KidsLearn There 1 83 B3 3 6 2 A2
# 12 2010 20 KidsLearn Here 1 84 B3 0 6 2 A2
# 13 2010 20 KidsLearn There 1 83 B2 3 5 2 A2
# 14 2010 20 KidsLearn Here 1 84 B2 2 5 2 A2
# 15 2010 20 KidsLearn There 1 83 B1 6 4 2 A2
# 16 2010 20 KidsLearn Here 1 84 B1 0 4 2 A2
# 17 2010 20 KidsLearn There 1 83 A3 14 3 2 A2
# 18 2010 20 KidsLearn Here 1 84 A3 7 3 2 A2
# 19 2010 20 KidsLearn There 1 83 A2 24 2 2 A2
# 20 2010 20 KidsLearn Here 1 84 A2 4 2 2 A2
# 21 2010 20 KidsLearn There 1 83 A1 18 1 2 A2
# 22 2010 20 KidsLearn Here 1 84 A1 4 1 2 A2
在此示例中,只有 2 个组(Term 和 Year 各只有一个值),并且它们的平均成绩均为 A2。 (向右滚动以查看添加的列。)
所以每个条目numberOfKids是在Grade中获得相应成绩的孩子的数量?您可以通过
获得中位数 "by hand"
get_median = function(numberOfKids,Grade){
current_count = 0
middle = (sum(numberOfKids)+1)/2
for (i in 1:length(numberOfKids)){
current_count = current_count+numberOfKids
#if we're halfway through the class, return the current grade
if (current_count == middle) return(Grade[i])
#if we're more than halfway through the class, then decide whether
#the middle is closer to the current total or the previous
if (current_count > middle){
if ((current_count-middle) > numberOfKids[i]/2) return(Grade[i])
return(Grade[i-1] } } }
通常有一个中位数,如果有一个"tie",你取两个值的平均值,但你不能真正取两个等级的平均值,所以你必须决定选择哪个.使用此功能,如果有完整的平局,则取较低的等级。您可以通过将最后一个“>”更改为“>=”来更改它。
我正在尝试弄清楚如何操作数据 here。该图片仅显示一门课程,但我有多门课程和课程编号,范围从 2010 年到 2017 年。我应该如何添加一列来显示某门课程根据年份、教学和学期的平均成绩?我们有获得特定成绩但没有实际成绩的孩子数量。我期望基于每个 "taught" 变量的 11 个不同等级的中位数成绩列应该有 11 个重复项。 taught 只能有两个值,"here" 或 "there".
我试过使用聚合函数,但这个问题似乎不是 high-level 函数可以解决的问题。数据库是 R 中的 DBKids。我似乎想不出可以帮助我解决这个问题的方法。谢谢!
编辑:可重现的代码
structure(list(sessionYear = c(2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2010, 2010, 2010), courseNumber = c("20", "20",
"20", "20", "20", "20", "20", "20", "20", "20", "20", "20", "20",
"20", "20", "20", "20", "20", "20", "20", "20", "20"),
courseName = c("KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
"KidsLearn", "KidsLearn", "KidsLearn"), Taught = c("There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here", "There", "Here", "There",
"Here", "There", "Here"), Term = c("1", "1", "1", "1", "1",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",
"1", "1", "1", "1"), averageGrade = c(83, 84, 83, 84, 83, 84,
83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84
), Grade = c("F", "F", "D", "D", "C3", "C3", "C2", "C2", "C1",
"C1", "B3", "B3", "B2", "B2", "B1", "B1", "A3", "A3", "A2", "A2",
"A1", "A1"), numberOfKids = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
3, 0, 3, 2, 6, 0, 14, 7, 24, 4, 18, 4)), class = "data.frame", row.names = c(NA,
-22L), .Names = c("sessionYear", "courseNumber", "courseName",
"Taught", "Term", "averageGrade", "Grade", "numberOfKids"))
希望对您有所帮助。
首先,我们将制作一个 factor
等级,确保等级顺序正确。我们可以将其转换为数字,因此我们有数字取中位数。
levels(factor(dd$Grade))
# [1] "A1" "A2" "A3" "B1" "B2" "B3" "C1" "C2" "C3" "D" "F"
## order seems good
dd$grade_numeric = as.numeric(factor(dd$Grade))
现在我们按组计算中位数,按孩子数量加权,四舍五入到最接近的整数并转换回字母等级。
library(dplyr)
group_by(dd, sessionYear, Taught, Term) %>%
mutate(med = spatstat::weighted.median(x = grade_numeric, w = numberOfKids),
med = round(med),
median_Grade = levels(factor(Grade))[med]) %>%
print.data.frame
# sessionYear courseNumber courseName Taught Term averageGrade Grade numberOfKids grade_numeric med median_Grade
# 1 2010 20 KidsLearn There 1 83 F 1 11 2 A2
# 2 2010 20 KidsLearn Here 1 84 F 0 11 2 A2
# 3 2010 20 KidsLearn There 1 83 D 0 10 2 A2
# 4 2010 20 KidsLearn Here 1 84 D 0 10 2 A2
# 5 2010 20 KidsLearn There 1 83 C3 1 9 2 A2
# 6 2010 20 KidsLearn Here 1 84 C3 0 9 2 A2
# 7 2010 20 KidsLearn There 1 83 C2 1 8 2 A2
# 8 2010 20 KidsLearn Here 1 84 C2 0 8 2 A2
# 9 2010 20 KidsLearn There 1 83 C1 0 7 2 A2
# 10 2010 20 KidsLearn Here 1 84 C1 0 7 2 A2
# 11 2010 20 KidsLearn There 1 83 B3 3 6 2 A2
# 12 2010 20 KidsLearn Here 1 84 B3 0 6 2 A2
# 13 2010 20 KidsLearn There 1 83 B2 3 5 2 A2
# 14 2010 20 KidsLearn Here 1 84 B2 2 5 2 A2
# 15 2010 20 KidsLearn There 1 83 B1 6 4 2 A2
# 16 2010 20 KidsLearn Here 1 84 B1 0 4 2 A2
# 17 2010 20 KidsLearn There 1 83 A3 14 3 2 A2
# 18 2010 20 KidsLearn Here 1 84 A3 7 3 2 A2
# 19 2010 20 KidsLearn There 1 83 A2 24 2 2 A2
# 20 2010 20 KidsLearn Here 1 84 A2 4 2 2 A2
# 21 2010 20 KidsLearn There 1 83 A1 18 1 2 A2
# 22 2010 20 KidsLearn Here 1 84 A1 4 1 2 A2
在此示例中,只有 2 个组(Term 和 Year 各只有一个值),并且它们的平均成绩均为 A2。 (向右滚动以查看添加的列。)
所以每个条目numberOfKids是在Grade中获得相应成绩的孩子的数量?您可以通过
获得中位数 "by hand"get_median = function(numberOfKids,Grade){
current_count = 0
middle = (sum(numberOfKids)+1)/2
for (i in 1:length(numberOfKids)){
current_count = current_count+numberOfKids
#if we're halfway through the class, return the current grade
if (current_count == middle) return(Grade[i])
#if we're more than halfway through the class, then decide whether
#the middle is closer to the current total or the previous
if (current_count > middle){
if ((current_count-middle) > numberOfKids[i]/2) return(Grade[i])
return(Grade[i-1] } } }
通常有一个中位数,如果有一个"tie",你取两个值的平均值,但你不能真正取两个等级的平均值,所以你必须决定选择哪个.使用此功能,如果有完整的平局,则取较低的等级。您可以通过将最后一个“>”更改为“>=”来更改它。