模式匹配

pattern matching R

ca.df

id    Category
1     Noun
2     Negative
3     Positive
4     adj
5     word

每个term被分配到1个以上的类别,因此,它对应了1个以上的id。在 terms.df 中,所有 ID 都在一列中。

terms.df

Terms   id
 Love    1 4 5 3
 Hate    2 4 5
 ice     1 5

terms 中的id 对应ca.df 中的category。我想要这样的输出:

x.df

Category      terms

Noun          ice Love
Negative      Hate
Positive      Love
adj           Hate Love
word          ice Hate Love

如何做到这一点?

可以使用merge根据id

组合
ca.df <- data.frame(id=1:5, Category=c("Noun", "Negative", "Positive", "adj", "word"))
terms.df <- data.frame(Terms=c(rep("Love", 3), rep("Hate", 3), rep("ice", 2)), 
        id = c(1,4,5,2,4,5,1,5))
x.df <- merge(ca.df, terms.df, by="id")
x.df

  id Category Terms
1  1     Noun  Love
2  1     Noun   ice
3  2 Negative  Hate
4  4      adj  Love
5  4      adj  Hate
6  5     word  Love
7  5     word  Hate
8  5     word   ice

使用 tidyrdplyr 的解决方案。

library(tidyr)
library(dplyr)
ca.df$id <- as.character(ca.df$id)

terms.df %>% separate(id,into=paste0("V",1:3),sep = " ",extra = "merge") %>%
  gather(var,id,-Terms) %>%
  filter(!is.na(id)) %>%
  left_join(ca.df,by="id") %>%
  select(-var,-id) %>%
  group_by(Category) %>%
  summarize(Terms=paste(Terms,collapse=" "))

输出:

Source: local data frame [4 x 2]

      Category         Terms
    1 Negative          Hate
    2     Noun      Love ice
    3      adj     Love Hate
    4     word ice Love Hate

数据:

ca.df <- read.table(text = 
"id    Category
1     Noun
2     Negative
3     Positive
4     adj
5     word",head=TRUE,stringsAsFactors=FALSE)

terms.df <- read.table(text = 
"Terms   id
Love    '1 4 5'
Hate    '2 4 5'
ice     '1 5'
",head=TRUE,stringsAsFactors=FALSE)

这是一个可能的 data.table/splitstackshape 包解决方案

library(splitstackshape) ## loads `data.table` package too
terms.df <- cSplit(terms.df, "id", sep = " ", direction = "long")
setkey(terms.df, id)[ca.df, .(Category , Terms = toString(Terms)), by = .EACHI]

#    id Category           Terms
# 1:  1     Noun       Love, ice
# 2:  2 Negative            Hate
# 3:  3 Positive            Love
# 4:  4      adj      Love, Hate
# 5:  5     word Love, Hate, ice

一些解释

  1. 我们先将id列按照Terms
  2. 进行空格分割
  3. 然后我们在id
  4. 上的两个数据集之间执行二进制左连接
  5. 在连接时,我们根据每个连接使用 by = .EACHI 运算符连接 Terms 列,这允许我们在连接时执行不同的操作