在 tidyverse 中同时测试集合包含和处理数据

Question

我几乎拥有我需要的东西。我需要一些关于最后一个细节的帮助！数据集由以下内容产生：

stu_vec <- c("A","B","C","D","E","F","G","H","I","J")
college_vec <- c("ATC","CCTC","DTC","FDTC","GTC","NETC", "USC", "Clemson", "Winthrop", "Allen")
sctcs <- c("ATC","CCTC","DTC","FDTC","GTC","NETC")
Student <- sample (stu_vec, size=100,replace=T, prob=c(.08,0.09,0.06,.07,.12,.10,.07,.05,.11,.05))
College <- sample(college_vec, size=100, replace=T,prob=c(.08,.07,.13,.12,.11,.06,.05,.08,.02,.08))

test.dat1 <- as.data.frame(cbind(Student, College))

我正在使用以下代码创建我需要的东西

library(dplyr)

set.seed(29)
test.dat2 <- test.dat1 %>% 
  group_by(Student, .drop=F) %>% #group by student
  mutate(semester= sequence(n())) %>% #set semester sequence
  summarise(home_school= College[min(which(College %in% sctcs))], # Find first college in sctcs
            seq_home=min(which(College %in% sctcs)), # add column of sequence values
            new_school= if_else(n_distinct(College) > 1, 
            first(College[!(College %in% sctcs) & semester > seq_home]), last(College))) #new_school should be the first non-sctcs school after the sctcs school is found or the last school for that student.

它产生以下 table

我希望 NA 填写该学生的最后一所大学。我不知道如何摆脱 NA。如果您知道制作相同东西的更简单方法，请分享知识。

Answer 1

不清楚您要做什么。但是当[!(College %in% sctcs) & semester > seq_home] returns FALSE, College[!(College %in% sctcs) & semester > seq_home] returns 是一个零长度的字符向量，所以first(College[!(College %in% sctcs) & semester > seq_home]) returns NA.

当[!(College %in% sctcs) & semester > seq_home]中没有TRUE值时，是因为semester[seq_home]之后的任何一个学期都没有非sctcs学院。如果学生从 home_school 转到一所或多所 sctcs 学校，但从未转到任何非 sctcs 学校，您将获得 NA 值。

您实际上是在问错问题。我不确定你想问什么问题，但你现在问的是：

What's the first non-sctcs school this student attended after they attended their first sctcs school?

但是，有些学生在 第一次 sctcs 学校就读后，再也没有去非 sctcs 学校。因此，您会收到 NA 回复，这是问题的正确答案。

Answer 2

应该这样做：

test.dat2 <- test.dat1 |> 
  mutate(semester= sequence(n())) |>
  arrange(Student, semester) |> # find this a more intuitive order
  group_by(Student, .drop=F) |>
  # Additional mutate step for clarity & simplicity
  mutate(seq_home = min(which(College %in% sctcs))) |>
  summarise(home_school = College[seq_home],
            new_school = 
              College[
                coalesce(
                  first(which(!(College %in% sctcs) & semester > seq_home)),
                  seq_home,
                  length(College))
                  ]
            )

我们正在使用 coalesce() 为 College 建立索引，return 是其参数中的第一个非缺失值。最初，我们寻找他们在 home_school 之后就读的第一所非 sctcs 大学。如果那个 returns NA（即没有这样的大学），我们就 return seq_home，得到他们上过的最后一个 sctcs 大学。如果那个 returns NA（如果他们从未上过任何 sctcs 学院就会是这种情况），我们 return length(College)，当然哪个子集 College 给我们他们上的最后一所大学。

我仍然不是 100% 清楚这是否正是您想要的 - 我不知道您是否考虑过没有 sctcs 学院的情况。这个种子上有 none，但它很容易发生。

在 tidyverse 中同时测试集合包含和处理数据

Test for set inclusion and processing data simultaneously in tidyverse

logic

r

sequence

dplyr

tidyverse