如何根据data.table中的多个条件设置新列？

Question

我正在尝试收集基于文本搜索的目录信息。在列 Text 中搜索某个字符串，并将一些描述放入新列 C_Organization.

示例数据如下：

# load packages:
pacman::p_load("data.table",
               "stringr")

# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
                 c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))

colnames(DE) <- c("Name", "Text")

> DE
     Name                               Text
1:   John                  Text contains MIT
2: Sussan some text with Stanford University
3:   Bill             He graduated from Yale

搜索某个字符串并用新列创建一个新的 data.table:

mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]

# bind them together:
combine_table <- rbind(mit, yale, stanford)

combine_table

     Name C_Organization
1:   John            MIT
2:   Bill           Yale
3: Sussan       Stanford

这种选择和组合方法工作正常，但似乎有点乏味。 data.table可以一步完成吗？

编辑

由于本人数据分析能力较差，数据不干净，需要明确问题：

真实数据有点复杂：

(1) 有时一个人来自两个以上的组织，例如Jack, UC Berkeley, Bell lab。和

(2)同一组织同一人出现不同年份，如Steven, MIT, 2011、Steven, MIT, 2014.
我想弄明白：

(1) 每个组织有多少人。如果一个人属于多个组织，则将出现次数最多的组织作为他的组织。（即按受欢迎程度。）例如，John, MIT, AMS, Bell lab，如果 MIT 出现 30 次，则 AMS 出现 12 次，Bell lab 出现 26 次。然后将MIT设为他的组织。

(2) 统计每年有多少人。这与我原来的问题没有直接关系，但为了以后的计算，我不想丢弃这些记录。

Answer 1

另一种解决方案考虑了一个文本中的多个匹配项，按行操作并将匹配项绑定在一起：

uni <- c("MIT","Yale","Stanford")
DE[,idx:=.I][, c_org := paste(uni[str_detect(Text, uni)], collapse=","), idx]

这给出：

> DE
     Name                                   Text idx             c_org
1:   John                      Text contains MIT   1               MIT
2: Sussan     some text with Stanford University   2          Stanford
3:   Bill He graduated from Yale, MIT, Stanford.   3 MIT,Yale,Stanford
4:   Bill                              some text   4

当您在 Name 中有相同的名称时，按行操作的优势就很明显了。当你这样做时：

DE[, uni[str_detect(Text, uni)], Name]

你得到的结果不正确：

     Name       V1
1:   John      MIT
2: Sussan Stanford
3:   Bill      MIT
4:   Bill Stanford

=> 你不知道第四行是哪张账单。此外，Yale 未包含在 'first' 账单中（即原始数据集的第 3 行）。

已用数据：

DE <- structure(list(Name = c("John", "Sussan", "Bill", "Bill"), Text = c("Text contains MIT", "some text with Stanford University", "He graduated from Yale, MIT, Stanford.", "some text")), .Names = c("Name", "Text"), row.names = c(NA, -4L), class = c("data.table", "data.frame"))

如何根据data.table中的多个条件设置新列？

How to set new column based on multiple conditions in data.table?

r

data.table