从 p 标签中提取伪列表文本
extract pseudo-list text from p tag
我从一系列链接中提取文本时遇到了问题,旧页面使用不同的格式样式。
是否可以识别我在此处突出显示的文本,它位于
标记中?如果它不与 ul ui 调用结合使用,我可以多次 运行 该函数。
我正在使用此函数提取文本,也许您可以将解决方案添加到 html_nodes
调用中:
link_df_reprex <- tribble(~title, ~episode, ~link,
"a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
"b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
"c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1",
"d", "4", "https://www.backlisted.fm/episodes/67-willa-cather-my-ntonia",
"e", "5", "https://www.backlisted.fm/episodes/66-sebastian-faulks-the-fatal-englishman")
recs_extract <- function(df){
pages <- df %>% map(read_html, url = link)
pages_text <- pages %>%
map_dfr(. %>%
html_nodes(css = "ul li") %>%
html_text() %>%
tibble(text = .)
)
}
#works for first 3
link_df_reprex %>% slice(1:3) %>% mutate(data = suppressWarnings(map(link, recs_extract)))
#doesn't work for last 2, extracts different text:
link_df_reprex %>% slice(4:5) %>% mutate(data = suppressWarnings(map(link, recs_extract))) %>% unnest()
试试这个:
library(rvest)
library(tibble)
library(dplyr)
library(purrr)
library(tidyr)
link_df_reprex <- tribble(~title, ~episode, ~link,
"a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
"b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
"c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1",
"d", "4", "https://www.backlisted.fm/episodes/67-willa-cather-my-ntonia",
"e", "5", "https://www.backlisted.fm/episodes/66-sebastian-faulks-the-fatal-englishman")
recs_extract <- function(df){
pages <- df %>% map(read_html, url = link)
pages_text <- pages %>%
map_dfr(. %>%
html_nodes(xpath = "//div[@class='sqs-block-content']/descendant::p[contains(., 'Books mentioned:') or contains(., 'Books Mentioned:')]/following-sibling::*/descendant::a/parent::*") %>%
html_text() %>%
tibble(text = .)
)
}
link_df_reprex %>% mutate(data = suppressWarnings(map(link, recs_extract))) %>% unnest()
我从一系列链接中提取文本时遇到了问题,旧页面使用不同的格式样式。
是否可以识别我在此处突出显示的文本,它位于
标记中?如果它不与 ul ui 调用结合使用,我可以多次 运行 该函数。
我正在使用此函数提取文本,也许您可以将解决方案添加到 html_nodes
调用中:
link_df_reprex <- tribble(~title, ~episode, ~link,
"a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
"b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
"c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1",
"d", "4", "https://www.backlisted.fm/episodes/67-willa-cather-my-ntonia",
"e", "5", "https://www.backlisted.fm/episodes/66-sebastian-faulks-the-fatal-englishman")
recs_extract <- function(df){
pages <- df %>% map(read_html, url = link)
pages_text <- pages %>%
map_dfr(. %>%
html_nodes(css = "ul li") %>%
html_text() %>%
tibble(text = .)
)
}
#works for first 3
link_df_reprex %>% slice(1:3) %>% mutate(data = suppressWarnings(map(link, recs_extract)))
#doesn't work for last 2, extracts different text:
link_df_reprex %>% slice(4:5) %>% mutate(data = suppressWarnings(map(link, recs_extract))) %>% unnest()
试试这个:
library(rvest)
library(tibble)
library(dplyr)
library(purrr)
library(tidyr)
link_df_reprex <- tribble(~title, ~episode, ~link,
"a", "1", "https://www.backlisted.fm/episodes/1-j-l-carr-a-month-in-the-country",
"b", "2", "https://www.backlisted.fm/episodes/2-jean-rhys-good-morning-midnight",
"c", "3", "https://www.backlisted.fm/episodes/3-david-nobbs-1",
"d", "4", "https://www.backlisted.fm/episodes/67-willa-cather-my-ntonia",
"e", "5", "https://www.backlisted.fm/episodes/66-sebastian-faulks-the-fatal-englishman")
recs_extract <- function(df){
pages <- df %>% map(read_html, url = link)
pages_text <- pages %>%
map_dfr(. %>%
html_nodes(xpath = "//div[@class='sqs-block-content']/descendant::p[contains(., 'Books mentioned:') or contains(., 'Books Mentioned:')]/following-sibling::*/descendant::a/parent::*") %>%
html_text() %>%
tibble(text = .)
)
}
link_df_reprex %>% mutate(data = suppressWarnings(map(link, recs_extract))) %>% unnest()