使用 str_match_all 匹配 R 中字符的开头和结尾

Question

亲爱的 Whosebug 社区，

我正在尝试使用 stringR 从网站中提取唯一的数字标识符。网站有几个独特的doi，doi结束后跟一个“Cite”字符。

[1] 我从一个网站获取信息 pg <- read_html("https://search.datacite.org/works?query=Movebank&resource-type-id=dataset") [2] 我尝试从网站中获取以 "doi"

开头的 26 个唯一字符串

[3] 我计划使用 string_match_all，开头必须匹配“https://doi.org/”，“*”之间的一些字符和结尾必须匹配单词 "Cite"。

str_match_all( html_text(html_nodes(pg, "body")) , pattern = "^https://doi.org/*Cite$") [4] 这些 doi 之一的示例如下：

https://doi.org/10.5441/001/1.41076dq1/6 引用

非常感谢任何帮助！

此致，

迭戈

Answer 1

使用与以下答案中的 hrbrmstr 类似的代码，您可以轻松获得所有想要的网址。

fils <- html_nodes(pg, xpath=".//a[contains(@href, 'doi.org')]")

df <- data.frame(link= html_attr(fils, "href"))

 df
                                          link
1  https://doi.org/10.25504/fairsharing.httzv2
2     https://doi.org/10.5441/001/1.41076dq1/6
3     https://doi.org/10.5441/001/1.q986rc29/3
4     https://doi.org/10.5441/001/1.q986rc29/4
5       https://doi.org/10.5441/001/1.25551gr6
6     https://doi.org/10.5441/001/1.25551gr6/1
7     https://doi.org/10.5441/001/1.25551gr6/2
8     https://doi.org/10.5441/001/1.q8b02dc5/4

使用 str_match_all 匹配 R 中字符的开头和结尾

Using str_match_all to match beginning and end of characters in R

regex

string-matching

stringr

rvest