如何在 div 内的第二个 p 元素中提取文本

How to extract text in second p element inside div

我有一个 div 有 2 p 个标签。

我需要获取第二个 p 元素的文本。

<div class="fb-price-list">
      <p class="fb-price">S/  1,699 (Internet)</p>
      <p class="fb-price">S/  2,399 (Normal)</p>
</div>

预期结果:

S/  2,399 (Normal)

我有这个但是没有用:

tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")

product_price_actual <- tvs_url %>% 
  html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
  html_text()

html:

<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis  ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/  1,699 (Internet)</p><p class="fb-price">S/  2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>

更新 1:

根据选择的答案,我使用 ifelse 检查给定位置的字符数:

被监督的位置是第4个,当没有precio_antes(价格前)这个位置被另一个元素占据所以我们需要在这些情况下放置NA

ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))

我是如何构建最终 df 的:

df <- data.frame(
    brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
    product = sapply(splitted, "[", 3),
    precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
    precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
  )

这里我使用 css 到 select 节点 class fb-price-list 然后 select 第二个 p 子节点:

library(rvest)

"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis  \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/  1,699 (Internet)</p><p class=\"fb-price\">S/  2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>% 
  read_html() %>% 
  html_nodes(".fb-price-list p:nth-child(2)") %>% 
  html_text()

tl;dr

内容是动态加载的,但可以字符串形式使用,来源是javascript字典,可以在正则表达式后用json解析器解析得到字符串。 This 是当前提取的json。

如果您使用 F12 打开开发工具并检查页面 html 您将看到包含 javascript 字典的 script 标签可以通过 json 解析器提取和处理。这确实意味着您可以定位显示的 script 标记,然后从节点和子字符串中提取文本,但我更喜欢在字符串上使用正则表达式(参见我将正文提取为字符串。通常不建议使用 HTML 使用正则表达式,但是用字符串很好)。


代码输出:

json$state$searchItemList$resultList$prices

给你一个长度为 32 的列表,其中包含数据帧。您可以看到在每个数据框中 originalPice 包含您想要的信息(label 列 == (Normal) 的行)

并非每件商品都有原价。以下是一种简单但不一定最有效的写出值的方法:

l <- json$state$searchItemList$resultList$prices

for (i in l){
  if (length(i$originalPrice)>1){
    print(i$originalPrice[2])
  } else {
    print("No original price")
  }
}

R

library(rvest)
library(jsonlite)
library(stringr)

url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)

正则表达式解释:

似乎是动态的,所以数据来自其他地方。我在数据中寻找 JSON、XML 等的 GET 响应,但没有找到任何内容。此时我会选择 RSelenium。以下应提取正确的节点。您可以使用任何您喜欢的方法从结果字符串中提取数字:

# install.packages("RSelenium")
library(RSelenium)
library(rvest)

driver <- rsDriver(4444L, "firefox")
fox_client <- driver$client

url <- "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1"
fox_client$navigate(url = url)

html <- fox_client$getPageSource()[[1]]

read_html(html) %>% 
    html_nodes(".fb-price:nth-child(2)") %>% 
    html_text()

#### OUTPUT ####

 [1] "S/  1,599 (Normal)"  "S/  3,999 (Normal)"  "S/  2,399 (Normal)"  "S/  1,149 (Normal)" 
 [5] "S/  1,399 (Normal)"  "S/  1,699 (Normal)"  "S/  4,999 (Normal)"  "S/  7,999 (Normal)" 
 [9] "S/  3,499 (Normal)"  "S/  12,999 (Normal)" "S/  9,798 (Normal)"  "S/  1,999 (Normal)" 
[13] "S/  2,499 (Normal)"  "S/  1,299 (Normal)"  "S/  2,499 (Normal)"  "S/  3,599 (Normal)" 
[17] "S/  8,999 (Normal)"  "S/  2,499 (Normal)"  "S/  8,599 (Normal)"  "S/  1,499 (Normal)" 
[21] "S/  2,199 (Normal)"  "S/  1,199 (Normal)"  "S/  699 (Normal)"    "S/  999 (Normal)"   
[25] "S/  29,999 (Normal)" "S/  499 (Normal)"    "S/  699 (Normal)"    "S/  4,999 (Normal)" 
[29] "S/  17,999 (Normal)" "S/  1,399 (Normal)" 

您还可以使用 findElementclickElement 浏览页面。有关更多信息,请参阅

正如您所考虑的那样RSelenium这里有一个带有相应包的解决方案。

您可以通过 xpath 找到这些元素。在您的情况下,xpath 将是:/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p.

类似于@gersht 的解决方案,但仅使用 RSelenium

可重现的例子:

library(RSelenium)

rD <- rsDriver() 
remDr <- rD$client

remDr$navigate(url)
priceElems = remDr$findElements(
  using = "xpath", 
  value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)

rawPrices = sapply(
  X = priceElems, 
  FUN = function(elem) elem$getElementText()
)

splitted = sapply(
  X = rawPrices, 
  FUN = strsplit, 
  split = "\nS/"
)

prices = data.frame(
  internetPrices = sapply(splitted, "[", 1),
  normalPrices = sapply(splitted, "[", 2)
)

结果/输出:

> head(prices, 8)
       internetPrices    normalPrices
1 S/ 1,099 (Internet)  1,599 (Normal)
2 S/ 2,299 (Internet)  3,999 (Normal)
3 S/ 1,699 (Internet)  2,399 (Normal)
4   S/ 999 (Internet)  1,149 (Normal)
5   S/ 999 (Internet)  1,399 (Normal)
6 S/ 1,399 (Internet)  1,699 (Normal)
7 S/ 2,199 (Internet)            <NA>
8 S/ 2,699 (Internet)  4,999 (Normal)

设置:

如果需要,请参阅此处了解如何设置 RSeneliumHow to set up rselenium for R?

编辑:

根据评论中的评论也捕获空元素,我将获取父元素,然后处理价格文本。

父元素是 /html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list'],如果其中一个价格不可用,则包含一个空字符串。