由于引用内容,使用 Rvest 进行 Web 抓取会丢失文本
Webscraping with Rvest losing text due to quote in content
我正在尝试使用网络抓取包 rvest
从 eBird 网站上获取物种描述。我的问题是,我认为由于内容中的引号,描述文本被截断了。检查网页的来源和我正在寻找的标签,我看到:
<meta name="description" content="Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive "pwit-SIP;" call note is a sharp "pweek." ">
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get list of metatag tags
metatags <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('name')
# Get which row has the description
rownum <- which(metatags == "description")
# Get content from meta tags
content <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('content')
# Get description content
description <- content[rownum]
我从以下代码中提取的描述给了我:
“小型捕蝇器,有一个大的尖头和相对较长的喙。与其他几个物种非常相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。细细的白色眼圈。深色翅膀带有明显的白色wingbars。很长的翼尖。从栖息地和声音上与其他捕蝇器区别开来。靠近范围北端的鸟类更喜欢阴凉的峡谷,混合着铁杉和落叶树;更南端,在成熟的落叶林中发现。倾向于在树冠的高处停留. 宋是个爆款《
然而,我真正想要的是:
“小型捕蝇器,有一个大的尖头和相对较长的喙。与其他几个物种非常相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。细细的白色眼圈。深色翅膀带有明显的白色wingbars。很长的翼尖。从栖息地和声音上与其他捕蝇器区别开来。靠近范围北端的鸟类更喜欢阴凉的峡谷,混合着铁杉和落叶树;更南端,在成熟的落叶林中发现。倾向于在树冠的高处停留.歌曲是一首爆发力"pwit-SIP;"call note是一首尖锐的"pweek."
如何获得包含引号的完整描述?
您可以获得完整的描述,包括第一个 p
标签中的引号 class u-stack-sm
:
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get description content
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive \"pwit-SIP;\" call note is a sharp \"pweek.\"\n\r\n"
url <- "https://ebird.org/species/siltea1/"
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Distinctive, but rather local and uncommon in Chile (more common in Argentina and elsewhere) in grassy wetlands, reedy marshes, and on lakes. Associates with other waterfowl, but usually is not out on open water and is easily overlooked. Readily identified by small size, dark cap, pale cheeks, and blue bill with yellow patch at base. Range does not overlap with the larger and more boldly patterned Puna Teal.\n\r\n\r\n\r\n"
由 reprex package (v0.3.0)
于 2020-10-11 创建
我正在尝试使用网络抓取包 rvest
从 eBird 网站上获取物种描述。我的问题是,我认为由于内容中的引号,描述文本被截断了。检查网页的来源和我正在寻找的标签,我看到:
<meta name="description" content="Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive "pwit-SIP;" call note is a sharp "pweek." ">
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get list of metatag tags
metatags <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('name')
# Get which row has the description
rownum <- which(metatags == "description")
# Get content from meta tags
content <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('content')
# Get description content
description <- content[rownum]
我从以下代码中提取的描述给了我:
“小型捕蝇器,有一个大的尖头和相对较长的喙。与其他几个物种非常相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。细细的白色眼圈。深色翅膀带有明显的白色wingbars。很长的翼尖。从栖息地和声音上与其他捕蝇器区别开来。靠近范围北端的鸟类更喜欢阴凉的峡谷,混合着铁杉和落叶树;更南端,在成熟的落叶林中发现。倾向于在树冠的高处停留. 宋是个爆款《
然而,我真正想要的是:
“小型捕蝇器,有一个大的尖头和相对较长的喙。与其他几个物种非常相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。细细的白色眼圈。深色翅膀带有明显的白色wingbars。很长的翼尖。从栖息地和声音上与其他捕蝇器区别开来。靠近范围北端的鸟类更喜欢阴凉的峡谷,混合着铁杉和落叶树;更南端,在成熟的落叶林中发现。倾向于在树冠的高处停留.歌曲是一首爆发力"pwit-SIP;"call note是一首尖锐的"pweek."
如何获得包含引号的完整描述?
您可以获得完整的描述,包括第一个 p
标签中的引号 class u-stack-sm
:
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get description content
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive \"pwit-SIP;\" call note is a sharp \"pweek.\"\n\r\n"
url <- "https://ebird.org/species/siltea1/"
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Distinctive, but rather local and uncommon in Chile (more common in Argentina and elsewhere) in grassy wetlands, reedy marshes, and on lakes. Associates with other waterfowl, but usually is not out on open water and is easily overlooked. Readily identified by small size, dark cap, pale cheeks, and blue bill with yellow patch at base. Range does not overlap with the larger and more boldly patterned Puna Teal.\n\r\n\r\n\r\n"
由 reprex package (v0.3.0)
于 2020-10-11 创建