如何在 R 中将 table 重组为特定的模板格式？

Question

我有一个包含调查结果的数据集。让我们假设向属于许多不同公司的数千名员工发送了一项调查，我处理了这些调查的结果，发现了这些调查中的一些错误，现在想向每个员工发送自定义错误摘要，以便他们可以纠正这些错误。

为了发送这些摘要，我们使用了一种软件，它允许您使用可以指定自定义字段的模板发送自定义电子邮件。

例如

亲爱的（姓名），

我们在 (company_name) 提交的调查中发现了总共（错误数）个错误。请在下面找到这些：

(error_1_description)

(error_1_survey_IDs)

(error_2_description)

(error_2_survey_IDs)

(error_3_description)

(error_3_survey_IDs)

(error_4_description)

(error_4_survey_IDs)

发送后，收件人会看到特定于其公司的摘要，例如:

亲爱的史蒂夫，

我们在亚马逊提交的调查中发现了总共 20 个错误。请在下面找到这些：

问题 1 错误。受影响的调查 ID：

00100A、00100B、00100C

问题 2 错误。受影响的调查 ID：

00100A, 00100B

问题 3 错误。受影响的调查 ID：

00100A

问题 4 错误。受影响的调查 ID：

00100B, 00100C

我的问题是需要将错误摘要重新构造成软件接受的模板格式，苦苦思索

可以使用以下代码重新创建包含错误摘要的 table：

error_summary <- structure(list(organisation = c("Amazon", "Amazon", "Amazon", 
"Amazon", "Amazon", "Amazon", "Amazon", "Amazon", "Amazon", "Google", 
"Google", "Google", "Google", "Google", "Google", "Google", "Google", 
"Google", "Google", "Google", "Google", "Google", "Google", "Facebook", 
"Facebook", "Facebook", "Facebook", "Facebook", "Facebook", "Facebook", 
"Facebook", "Facebook", "Facebook", "Facebook", "Facebook", "Facebook"
), questionnaire_id = c("00100A", "00100A", "00100A", "00100B", 
"00100C", "00100C", "00100C", "00100D", "00100D", "00100E", "00100E", 
"00100E", "00100F", "00100G", "00100G", "00100G", "00100H", "00100H", 
"00100H", "00100H", "00100H", "00100J", "00100J", "00100K", "00100K", 
"00100K", "00100K", "00100L", "00100L", "00100L", "00100L", "00100M", 
"00100M", "00100M", "00100M", "00100M"), error_message = c("error found in question 1", 
"error found in question 2", "error found in question 4", "error found in question 1", 
"error found in question 2", "error found in question 5", "error found in question 6", 
"error found in question 1", "error found in question 2", "error found in question 1", 
"error found in question 2", "error found in question 4", "error found in question 1", 
"error found in question 2", "error found in question 5", "error found in question 6", 
"error found in question 1", "error found in question 2", "error found in question 3", 
"error found in question 4", "error found in question 5", "error found in question 5", 
"error found in question 6", "error found in question 1", "error found in question 2", 
"error found in question 4", "error found in question 5", "error found in question 2", 
"error found in question 5", "error found in question 6", "error found in question 7", 
"error found in question 2", "error found in question 3", "error found in question 4", 
"error found in question 5", "error found in question 6")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -36L), spec = structure(list(
    cols = list(organisation = structure(list(), class = c("collector_character", 
    "collector")), questionnaire_id = structure(list(), class = c("collector_character", 
    "collector")), error_message = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

这 table 每个公司包含多行，只有三列。

对于模板，需要重新构建数据结构，以便每个公司只包含一行，并且每个错误和包含该错误的调查 ID 列表在它们自己的单独列中。

例如，理想情况下，上面的结果如下所示，其中每一列对应一个自定义字段，可以在模板的文本正文中指定：

end_goal_template <- structure(list(organisation = c("Amazon", "Google", "Facebook"
), error_1 = c("error found in question 1", "error found in question 1", 
"error found in question 1"), error_1_survey_IDs = c("00100A 00100B 00100D", 
NA, "00100K"), error_2 = c("error found in question 2", "error found in question 2", 
"error found in question 2"), error_2_survey_IDs = c("00100A 00100C 00100D", 
NA, "00100K 00100L 00100M"), error_3 = c("error found in question 3", 
"error found in question 3", "error found in question 3"), error_3_survey_IDs = c(NA, 
NA, "00100M"), error_4 = c("error found in question 4", "error found in question 4", 
"error found in question 4"), error_4_survey_IDs = c("00100A", 
NA, "00100K 00100M"), error_5 = c("error found in question 5", 
"error found in question 5", "error found in question 5"), error_5_survey_IDs = c("00100C", 
NA, "00100K 00100L 00100M"), error_6 = c("error found in question 6", 
"error found in question 6", "error found in question 6"), error_6_survey_IDs = c("00100C", 
NA, "00100L 00100M"), error_7 = c("error found in question 7", 
"error found in question 7", "error found in question 7"), error_7_survey_IDs = c(NA, 
NA, "00100L")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), spec = structure(list(cols = list(
    organisation = structure(list(), class = c("collector_character", 
    "collector")), error_1 = structure(list(), class = c("collector_character", 
    "collector")), error_1_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_2 = structure(list(), class = c("collector_character", 
    "collector")), error_2_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_3 = structure(list(), class = c("collector_character", 
    "collector")), error_3_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_4 = structure(list(), class = c("collector_character", 
    "collector")), error_4_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_5 = structure(list(), class = c("collector_character", 
    "collector")), error_5_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_6 = structure(list(), class = c("collector_character", 
    "collector")), error_6_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_7 = structure(list(), class = c("collector_character", 
    "collector")), error_7_survey_IDs = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

通过将所有信息放在一行中，软件可以用模板中的信息替换自定义字段，从而生成自定义电子邮件（模板中也会有一个电子邮件字段，但我排除了这个用于演示目的）。

我整天都在努力解决这个问题，主要是使用像 pivot_wider 这样的 tidyr 函数，但我觉得这超出了我目前的能力范围。非常感谢任何指点或想法，谢谢！

Answer 1

这是一个 pivot_wider 解决方案。这些列的顺序与您的模板不同（并且名称也不完全相同），但这应该让您完成了 90% 的事情。

library(tidyverse)
error_summary %>%
  group_by(organisation, error_message) %>%
  summarise(survey_IDs = paste(questionnaire_id, collapse = " ")) %>%
  ungroup() %>%
  mutate(error = gsub(" found in question ", "_", error_message)) %>%
  rename(message = error_message) %>%
  group_by(organisation) %>%
  pivot_wider(id_cols = "organisation", names_from = error,
              values_from = c(message, survey_IDs),
              names_glue = "{error}_{.value}")

Answer 2

这是一个data.table解决方案：

library(data.table)
end_goal <- dcast(data.table(error_summary), organisation ~ error_message, 
    value.var="questionnaire_id", fun.aggregate = paste, collapse=", ", fill=NA)
setnames(end_goal, 
    sub("(error found in question )(.*)", "error_\2_survey_IDs", colnames(end_goal)))
cn <- colnames(end_goal)[-1]
end_goal[,(sub("_survey_IDs", "", cn)):=data.table(t(cn))]
setcolorder(end_goal,  c("organisation", 
    as.vector(t(matrix(colnames(end_goal)[-1], ncol=2)[, 2:1]))))[]

如何在 R 中将 table 重组为特定的模板格式？

How to re-structure a table into a specific template format in R?

r

tidyr

data-wrangling