运行 数据帧子集的聚类函数,并将每个子集的结果写入数据帧

Running a clustering function on subsets of dataframe and writing the results for each subset in the dataframe

数据

我在 R 中有一个具有以下结构的数据框:

ID   group           text
100    1    An apple is a sweet, edible fruit produced by an apple tree.
103    1    An apple is a sweet, edible fruit produced by an apple tree.
105    1    Some dog breeds show more variation in size than other dog breeds.
106    1    An apple is a sweet, edible fruit produced by an apple tree.
107    1    An apple is a sweet, edible fruit produced by an apple tree.
209    1    Some dog breeds show more variation in size than other dog breeds.
300    1    Some dog breeds show more variation in size than other dog breeds.
501    1    An apple is a sweet, edible fruit produced by an apple tree.
503    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711    2    Pizza is a savory dish of Italian origin.
799    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.

可以使用以下代码重现:

test_df <- data.frame(
  "ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
  "group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
  "text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)

实际上,每个主题的文本都略有不同,有几十万个,分布在几十个组中。

我想做什么

我正在尝试编写一个执行以下操作的函数:

下面是分析后数据框中两行的示例:

ID   group    topic           text
100    1      apple    An apple is a sweet, edible fruit produced by an apple tree.
105    1       dog     Some dog breeds show more variation in size than other dog breeds.

我目前有什么

我可以使用下面的代码 运行 这种一般的函数,在一个完整的数据帧上(没有按组子集):

# Preparing the texts

library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)

# Identifying topics

library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')

# Inputting the topic classification into the dataframe

test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])

我尝试将它变成一个函数,然后 运行 数据帧上的函数,使用以下代码:

library(tm)
library(topicmodels)

topic_identifier <- function(text) {
  corpus <- Corpus(VectorSource(text))
    corpus <- tm_map(corpus, removeWords, stopwords('english'))
    corpus <- tm_map(corpus, stemDocument, language = 'english')
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
  TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
      lda.output <- LDA(TF, k=2, method = 'Gibbs')
  test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
    }

by(test_df$text, test_df$group, topic_identifier)

但这不允许我在原始 df 中保存每个子集的相关输出。

by 函数在接收数据帧而不是列向量作为输入时效果最佳,text。然后,您可以为 return 而不是原始 test_df 操作此数据框。本质上,整个数据帧上的相同过程在子集数据帧上保留为 运行。

此外,您需要用 <- 分配结果以构建一个对象列表,这些对象甚至可以在最后用 do.call + rbind 绑定在一起(假设每个数据框架保持相同的列数和名称):

topic_identifier <- function(sub_df) { 
   corpus <- Corpus(VectorSource(sub_df$text)) 
   corpus <- tm_map(corpus, removeWords, stopwords('english')) 
   corpus <- tm_map(corpus, stemDocument, language = 'english') 
   corpus <- tm_map(corpus, removePunctuation) 
   corpus <- tm_map(corpus, stripWhitespace) 

   TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
   lda.output <- LDA(TF, k=2, method = 'Gibbs') 

   sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)]) 
   return(sub_df)
} 

# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)

# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))