从 R 中的文件路径列表创建语料库

Question

我在一个目录中有 1030 个单独的 .txt 文件，代表研究中的所有参与者。

我已经从目录中的所有文件中成功创建了一个语料库，用于 R 中的 tm 包。

现在我正在尝试创建这些文件的大量子集的 corpi。例如，一个所有女性作者和一个男性作者的语料库。

我希望能够传递文件路径列表的 Corpus 函数子集，但这没有成功。

感谢任何帮助。以下是构建的示例：

pathname <- c("C:/Desktop/Samples")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = T) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/Desktop/Samples/author1.txt","C:/Desktop/Samples/author2.txt","C:/Desktop/Samples/author3.txt","C:/Desktop/Samples/author4.txt","C:/Desktop/Samples/author5.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- Corpus(women.files)
women_corpus <- Corpus(DirSource(women.files))
women_corpus <- Corpus(DirSource(unlist(women.files)))

我需要创建的子集相当复杂，因此我无法轻松地创建仅包含每个语料库感兴趣的文本文件的新文件夹。

Answer 1

这是我想的那样工作。

pathname <- c("C:/data/test")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = F) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/data/test/test1/test1.txt",
                 "C:/data/test/test2/test2.txt",
                 "C:/data/test/test3/test3.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- NULL
nedir <- lapply(women.files, function (filename) read.table(filename, sep="\t", stringsAsFactors = F))
hepsi <- lapply( nedir, function(x) x$V1)
women_corpus <- Corpus(VectorSource(hepsi))

Answer 2

我有一个类似的问题，我根据文档的余弦相似度对文档进行聚类，我想单独分析各个聚类，但不想将文档组织到单独的文件夹中。

查看 DirSource 的文档，有一个选项可以传入正则表达式模式“仅返回与正则表达式匹配的文件名”，所以我使用聚类信息对文档进行分组并构造一个正则表达式每个集群的模式。

使用上面的示例，您可以使用类似的方法：

library(tidyverse)
library(tm)

study.files <- c(
  "C:/Desktop/Samples/author1.txt"
  ,"C:/Desktop/Samples/author2.txt"
  ,"C:/Desktop/Samples/author3.txt"
  ,"C:/Desktop/Samples/author4.txt"
  ,"C:/Desktop/Samples/author5.txt"
)

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

# putting this into a data.frame
doc_df <- data.frame(document = study.files) %>% 
  # categoris each of the documents using the numeric vectors 
  # defined above, as per original example
  mutate(
    index = row_number()
    , gender = if_else(index %in% women, 'woman', 'man')
    # separate the file name from the full path
    , filename = basename(as.character(document))
    ) %>% 
  group_by(gender) %>%
  # build the regex select pattern
  mutate(select_pattern = str_replace_all(paste0(filename, collapse = '|'), '[.]', "[.]")) %>%
  summarise(select_pattern = first(select_pattern))
  
men_df <- doc_df %>% filter(gender == 'man')
woman_df <- doc_df %>% filter(gender == 'woman')

# you can then use this to load a subset of documents from a single directory using regex
men_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = men_df$select_pattern[1]))
woman_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = woman_df$select_pattern[1]))

从 R 中的文件路径列表创建语料库

Create a Corpus from a List of File Paths in R

text

r

corpus

tm