.docx文件章节提取

Question

我想提取一个 .docx 文件的内容，分章。所以，我的.docx文档有一个寄存器，每章都有一些内容

 1. Intro
   some text about Intro, these things, those things
 2. Special information
   these information are really special
    2.1 General information about the environment
      environment should be also important
    2.2 Further information 
      and so on and so on

所以最后收到一个 Nx3 矩阵会很棒，包含索引号、索引名称和至少内容。

i_number     i_name                 content
1            Intro                  some text about Intro, these things, those things
2            Special Information    these information are really special
...

感谢您的帮助

Answer 1

您可以在 .txt 中导出或 copy-paste 您的 .docx 并应用此 R 脚本：

library(stringr)
library(readr)

doc <- read_file("filename.txt")

pattern_chapter <- regex("(\d+\.)(.{4,100}?)(?:\r\n)", dotall = T)

i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]

result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))

View(result)

如果您的文档包含任何不是以数字开头的标题的行（例如，脚注或编号列表），它就不起作用

（请不要盲目投反对票：此脚本与给定的示例完美配合）

.docx文件章节提取

.docx file chapter extraction

indexing

docx

extraction