使用 gsub 替换字符串和后面的 n 个单词

Question

我正在尝试清理议会协议中的文本。由于数据源自 pdf 文件，因此它们包括带有立法时期的页脚和页面参考：“第 18 立法时期第 x 页，共 N 页”。由于所有 600 个协议的总页数不同，我无法匹配精确的表达式。相反，我想使用 gsub 函数删除页脚的开头和接下来的 n 个单词。

我针对其他类似方向的问题提出了许多解决方案，但无法使其发挥作用。

string <- "this is the first page. 18th legislative period page 1 of 44 
this is the second page. 18th legislative period page 2 of 44 and this is 
the third page"

gsub("18th legislative period page", "", string)

我希望字符串显示为

"this is the first page. this is the second page. and this is the third page."

编辑：非常感谢您的时间和耐心！

Answer 1

你可以使用

gsub("18th legislative period page \d+ of \d+", "", string)
# or without the newline symbol '\n'
gsub('\s{2,}', ' ', gsub("18th legislative period page \d+ of \d+", "", string))

使用 gsub 替换字符串和后面的 n 个单词

Using gsub to replace string and following n words

regex

string

r

gsub