无法使用 rvest 和 selectorGadget 访问 html 页面中的特定内容

Question

我正在尝试抓取一个 ncbi 网站 (https://www.ncbi.nlm.nih.gov/protein/29436380) 以获取蛋白质的信息。我需要访问 gene_synonyms 和 GeneID 字段。我试图通过 chrome 中的 selectorGadget 插件和 ff 中的代码检查器找到相关节点。我试过这段代码：

require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)

然后我试图找到相关的文本，但根本就没有。

str_extract_all(TestHTML, pattern = "(synonym).{30}")
 [[1]]
 character(0)

str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
 [[1]]
 character(0)

我似乎正在访问的是右侧栏的一些文本内容。

str_extract_all(TestHTML, pattern = "(protein).{30}")
 [[1]]
 [1] "protein codes including ambiguities a"
 [2] "protein sequence for myosin-9  (NP_00"
 [3] "protein should not be confused with t"
 [4] "protein, partial [Homo sapiens]gi|294"
 [5] "protein codes including ambiguities a"

我已经尝试了很多节点 selection 与 html_node() 的组合，我不知道该尝试什么了。这些内容是否隐藏在我看不到的结构中？或者我只是不够熟练，无法实现 select?

的节点

非常感谢，何塞.

Answer 1

页面正在动态加载信息。基础信息存储在另一个位置。
使用浏览器中的开发人员工具，查找 link:

您要查找的信息存储在"viewer.fcgi"，右键复制link。

查看类似内容question/answers：R not accepting xpath query

无法使用 rvest 和 selectorGadget 访问 html 页面中的特定内容

Can't access specific content in html page with rvest and selectorGadget

html

r

web-scraping

rvest