在 R 中查找文章中的单词组合

Finding Combinations of Words in Articles in R

我想按商品编号和年份查找 "Words" 的不同组合。有任何想法吗?

我有一个如下所示的数据集:

Year     Article     Word
2013    Article1    WordA
2013    Article1    WordB
2013    Article2    WordC
2013    Article2    WordD
2013    Article2    WordA
2014    Article1    WordC
2014    Article1    WordA
2014    Article4    WordE
2014    Article4    WordD
2014    Article4    WordB

我希望结果如下所示:

Year    Article    Source   Target
2013    Article1    WordA   WordB
2013    Article1    WordB   WordA
2013    Article2    WordC   WordD
2013    Article2    WordC   WordA
2013    Article2    WordD   WordC
2013    Article2    WordD   WordA
2013    Article2    WordA   WordC
2013    Article2    WordA   WordD
2014    Article1    WordC   WordA
2014    Article1    WordA   WordC
2014    Article4    WordE   WordD
2014    Article4    WordE   WordB
2014    Article4    WordD   WordE
2014    Article4    WordD   WordB
2014    Article4    WordB   WordE
2014    Article4    WordB   WordD

谢谢!

您可以尝试 merge 然后 subset 具有 'Word' 列不相同的行。

df2 <- merge(df1, df1, by.x=c('Year', 'Article'), by.y= c('Year', 'Article'))
res <- subset(df2, Word.x!=Word.y)
row.names(res) <- NULL
res
# Year  Article Word.x Word.y
#1  2013 Article1  WordA  WordB
#2  2013 Article1  WordB  WordA
#3  2013 Article2  WordC  WordD
#4  2013 Article2  WordC  WordA
#5  2013 Article2  WordD  WordC
#6  2013 Article2  WordD  WordA
#7  2013 Article2  WordA  WordC
#8  2013 Article2  WordA  WordD
#9  2014 Article1  WordC  WordA
#10 2014 Article1  WordA  WordC
#11 2014 Article4  WordE  WordD
#12 2014 Article4  WordE  WordB
#13 2014 Article4  WordD  WordE
#14 2014 Article4  WordD  WordB
#15 2014 Article4  WordB  WordE
#16 2014 Article4  WordB  WordD

使用 data.table 的开发版本(即 v1.9.5)的类似选项是

library(data.table)#v1.9.5
setDT(df1)[df1, on= c('Year', 'Article'), allow.cartesian=TRUE][Word!=i.Word]
#    Year  Article  Word i.Word
# 1: 2013 Article1 WordB  WordA
# 2: 2013 Article1 WordA  WordB
# 3: 2013 Article2 WordD  WordC
# 4: 2013 Article2 WordA  WordC
# 5: 2013 Article2 WordC  WordD
# 6: 2013 Article2 WordA  WordD
# 7: 2013 Article2 WordC  WordA
# 8: 2013 Article2 WordD  WordA
# 9: 2014 Article1 WordA  WordC
#10: 2014 Article1 WordC  WordA
#11: 2014 Article4 WordD  WordE
#12: 2014 Article4 WordB  WordE
#13: 2014 Article4 WordE  WordD
#14: 2014 Article4 WordB  WordD
#15: 2014 Article4 WordE  WordB
#16: 2014 Article4 WordD  WordB

注意:安装 data.table 开发版的说明是 here

数据

df1 <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L), Article = c("Article1", "Article1", 
"Article2", "Article2", "Article2", "Article1", "Article1", "Article4", 
"Article4", "Article4"), Word = c("WordA", "WordB", "WordC", 
"WordD", "WordA", "WordC", "WordA", "WordE", "WordD", "WordB"
)), .Names = c("Year", "Article", "Word"), class = "data.frame", 
row.names = c(NA, -10L))