.docx文件章节提取
.docx file chapter extraction
我想提取一个 .docx
文件的内容,分章。
所以,我的.docx
文档有一个寄存器,每章都有一些内容
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
所以最后收到一个 Nx3
矩阵会很棒,包含索引号、索引名称和至少内容。
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
感谢您的帮助
您可以在 .txt 中导出或 copy-paste 您的 .docx 并应用此 R 脚本:
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\d+\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
如果您的文档包含任何不是以数字开头的标题的行(例如,脚注或编号列表),它就不起作用
(请不要盲目投反对票:此脚本与给定的示例完美配合)
我想提取一个 .docx
文件的内容,分章。
所以,我的.docx
文档有一个寄存器,每章都有一些内容
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
所以最后收到一个 Nx3
矩阵会很棒,包含索引号、索引名称和至少内容。
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
感谢您的帮助
您可以在 .txt 中导出或 copy-paste 您的 .docx 并应用此 R 脚本:
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\d+\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
如果您的文档包含任何不是以数字开头的标题的行(例如,脚注或编号列表),它就不起作用
(请不要盲目投反对票:此脚本与给定的示例完美配合)