在 R 中查找文章中的单词组合
Finding Combinations of Words in Articles in R
我想按商品编号和年份查找 "Words" 的不同组合。有任何想法吗?
我有一个如下所示的数据集:
Year Article Word
2013 Article1 WordA
2013 Article1 WordB
2013 Article2 WordC
2013 Article2 WordD
2013 Article2 WordA
2014 Article1 WordC
2014 Article1 WordA
2014 Article4 WordE
2014 Article4 WordD
2014 Article4 WordB
我希望结果如下所示:
Year Article Source Target
2013 Article1 WordA WordB
2013 Article1 WordB WordA
2013 Article2 WordC WordD
2013 Article2 WordC WordA
2013 Article2 WordD WordC
2013 Article2 WordD WordA
2013 Article2 WordA WordC
2013 Article2 WordA WordD
2014 Article1 WordC WordA
2014 Article1 WordA WordC
2014 Article4 WordE WordD
2014 Article4 WordE WordB
2014 Article4 WordD WordE
2014 Article4 WordD WordB
2014 Article4 WordB WordE
2014 Article4 WordB WordD
谢谢!
您可以尝试 merge
然后 subset
具有 'Word' 列不相同的行。
df2 <- merge(df1, df1, by.x=c('Year', 'Article'), by.y= c('Year', 'Article'))
res <- subset(df2, Word.x!=Word.y)
row.names(res) <- NULL
res
# Year Article Word.x Word.y
#1 2013 Article1 WordA WordB
#2 2013 Article1 WordB WordA
#3 2013 Article2 WordC WordD
#4 2013 Article2 WordC WordA
#5 2013 Article2 WordD WordC
#6 2013 Article2 WordD WordA
#7 2013 Article2 WordA WordC
#8 2013 Article2 WordA WordD
#9 2014 Article1 WordC WordA
#10 2014 Article1 WordA WordC
#11 2014 Article4 WordE WordD
#12 2014 Article4 WordE WordB
#13 2014 Article4 WordD WordE
#14 2014 Article4 WordD WordB
#15 2014 Article4 WordB WordE
#16 2014 Article4 WordB WordD
使用 data.table
的开发版本(即 v1.9.5)的类似选项是
library(data.table)#v1.9.5
setDT(df1)[df1, on= c('Year', 'Article'), allow.cartesian=TRUE][Word!=i.Word]
# Year Article Word i.Word
# 1: 2013 Article1 WordB WordA
# 2: 2013 Article1 WordA WordB
# 3: 2013 Article2 WordD WordC
# 4: 2013 Article2 WordA WordC
# 5: 2013 Article2 WordC WordD
# 6: 2013 Article2 WordA WordD
# 7: 2013 Article2 WordC WordA
# 8: 2013 Article2 WordD WordA
# 9: 2014 Article1 WordA WordC
#10: 2014 Article1 WordC WordA
#11: 2014 Article4 WordD WordE
#12: 2014 Article4 WordB WordE
#13: 2014 Article4 WordE WordD
#14: 2014 Article4 WordB WordD
#15: 2014 Article4 WordE WordB
#16: 2014 Article4 WordD WordB
注意:安装 data.table 开发版的说明是 here
数据
df1 <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L), Article = c("Article1", "Article1",
"Article2", "Article2", "Article2", "Article1", "Article1", "Article4",
"Article4", "Article4"), Word = c("WordA", "WordB", "WordC",
"WordD", "WordA", "WordC", "WordA", "WordE", "WordD", "WordB"
)), .Names = c("Year", "Article", "Word"), class = "data.frame",
row.names = c(NA, -10L))
我想按商品编号和年份查找 "Words" 的不同组合。有任何想法吗?
我有一个如下所示的数据集:
Year Article Word
2013 Article1 WordA
2013 Article1 WordB
2013 Article2 WordC
2013 Article2 WordD
2013 Article2 WordA
2014 Article1 WordC
2014 Article1 WordA
2014 Article4 WordE
2014 Article4 WordD
2014 Article4 WordB
我希望结果如下所示:
Year Article Source Target
2013 Article1 WordA WordB
2013 Article1 WordB WordA
2013 Article2 WordC WordD
2013 Article2 WordC WordA
2013 Article2 WordD WordC
2013 Article2 WordD WordA
2013 Article2 WordA WordC
2013 Article2 WordA WordD
2014 Article1 WordC WordA
2014 Article1 WordA WordC
2014 Article4 WordE WordD
2014 Article4 WordE WordB
2014 Article4 WordD WordE
2014 Article4 WordD WordB
2014 Article4 WordB WordE
2014 Article4 WordB WordD
谢谢!
您可以尝试 merge
然后 subset
具有 'Word' 列不相同的行。
df2 <- merge(df1, df1, by.x=c('Year', 'Article'), by.y= c('Year', 'Article'))
res <- subset(df2, Word.x!=Word.y)
row.names(res) <- NULL
res
# Year Article Word.x Word.y
#1 2013 Article1 WordA WordB
#2 2013 Article1 WordB WordA
#3 2013 Article2 WordC WordD
#4 2013 Article2 WordC WordA
#5 2013 Article2 WordD WordC
#6 2013 Article2 WordD WordA
#7 2013 Article2 WordA WordC
#8 2013 Article2 WordA WordD
#9 2014 Article1 WordC WordA
#10 2014 Article1 WordA WordC
#11 2014 Article4 WordE WordD
#12 2014 Article4 WordE WordB
#13 2014 Article4 WordD WordE
#14 2014 Article4 WordD WordB
#15 2014 Article4 WordB WordE
#16 2014 Article4 WordB WordD
使用 data.table
的开发版本(即 v1.9.5)的类似选项是
library(data.table)#v1.9.5
setDT(df1)[df1, on= c('Year', 'Article'), allow.cartesian=TRUE][Word!=i.Word]
# Year Article Word i.Word
# 1: 2013 Article1 WordB WordA
# 2: 2013 Article1 WordA WordB
# 3: 2013 Article2 WordD WordC
# 4: 2013 Article2 WordA WordC
# 5: 2013 Article2 WordC WordD
# 6: 2013 Article2 WordA WordD
# 7: 2013 Article2 WordC WordA
# 8: 2013 Article2 WordD WordA
# 9: 2014 Article1 WordA WordC
#10: 2014 Article1 WordC WordA
#11: 2014 Article4 WordD WordE
#12: 2014 Article4 WordB WordE
#13: 2014 Article4 WordE WordD
#14: 2014 Article4 WordB WordD
#15: 2014 Article4 WordE WordB
#16: 2014 Article4 WordD WordB
注意:安装 data.table 开发版的说明是 here
数据
df1 <- structure(list(Year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L,
2014L, 2014L, 2014L, 2014L), Article = c("Article1", "Article1",
"Article2", "Article2", "Article2", "Article1", "Article1", "Article4",
"Article4", "Article4"), Word = c("WordA", "WordB", "WordC",
"WordD", "WordA", "WordC", "WordA", "WordE", "WordD", "WordB"
)), .Names = c("Year", "Article", "Word"), class = "data.frame",
row.names = c(NA, -10L))