可以从网站的 robots.txt 文件中 crawl/extract 一行吗？

Question

我有一个网站，我想在其中抓取根文件夹中的机器人文件。 www.foo.com/robots.txt
在这里，我想抓取特定行 [比如第 3 行] 并提取值 [以检查它是否包含 Disallow 或不]。是否可以在 RVest 中执行此操作？我还想按预定频率自动抓取此页面。

Answer 1

这应该很容易实现，例如：

txt <- readLines("https://www.whosebug.com/robots.txt")
txt

txt[3] # line 3 of the file
grepl("disallow", txt[3], ignore.case = TRUE) # check for "disallow"

要在指定时间对此进行抓取，请使用 CRON 作业 (Linux) 或任务调度程序作业 (Windows)。

可以从网站的 robots.txt 文件中 crawl/extract 一行吗？

Possible to crawl/extract a line from a website's robots.txt file?

r

rvest