如何在 div 内的第二个 p 元素中提取文本
How to extract text in second p element inside div
我有一个 div
有 2 p
个标签。
我需要获取第二个 p
元素的文本。
<div class="fb-price-list">
<p class="fb-price">S/ 1,699 (Internet)</p>
<p class="fb-price">S/ 2,399 (Normal)</p>
</div>
预期结果:
S/ 2,399 (Normal)
我有这个但是没有用:
tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")
product_price_actual <- tvs_url %>%
html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
html_text()
html:
<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/ 1,699 (Internet)</p><p class="fb-price">S/ 2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>
更新 1:
根据选择的答案,我使用 ifelse
检查给定位置的字符数:
被监督的位置是第4个,当没有precio_antes(价格前)这个位置被另一个元素占据所以我们需要在这些情况下放置NA
:
ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))
我是如何构建最终 df 的:
df <- data.frame(
brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
product = sapply(splitted, "[", 3),
precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
)
这里我使用 css 到 select 节点 class fb-price-list
然后 select 第二个 p
子节点:
library(rvest)
"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/ 1,699 (Internet)</p><p class=\"fb-price\">S/ 2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>%
read_html() %>%
html_nodes(".fb-price-list p:nth-child(2)") %>%
html_text()
tl;dr
内容是动态加载的,但可以字符串形式使用,来源是javascript字典,可以在正则表达式后用json解析器解析得到字符串。 This 是当前提取的json。
如果您使用 F12 打开开发工具并检查页面 html 您将看到包含 javascript 字典的 script
标签可以通过 json 解析器提取和处理。这确实意味着您可以定位显示的 script
标记,然后从节点和子字符串中提取文本,但我更喜欢在字符串上使用正则表达式(参见我将正文提取为字符串。通常不建议使用 HTML 使用正则表达式,但是用字符串很好)。
代码输出:
json$state$searchItemList$resultList$prices
给你一个长度为 32 的列表,其中包含数据帧。您可以看到在每个数据框中 originalPice
包含您想要的信息(label
列 == (Normal)
的行)
并非每件商品都有原价。以下是一种简单但不一定最有效的写出值的方法:
l <- json$state$searchItemList$resultList$prices
for (i in l){
if (length(i$originalPrice)>1){
print(i$originalPrice[2])
} else {
print("No original price")
}
}
R
library(rvest)
library(jsonlite)
library(stringr)
url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)
正则表达式解释:
似乎是动态的,所以数据来自其他地方。我在数据中寻找 JSON、XML 等的 GET 响应,但没有找到任何内容。此时我会选择 RSelenium。以下应提取正确的节点。您可以使用任何您喜欢的方法从结果字符串中提取数字:
# install.packages("RSelenium")
library(RSelenium)
library(rvest)
driver <- rsDriver(4444L, "firefox")
fox_client <- driver$client
url <- "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1"
fox_client$navigate(url = url)
html <- fox_client$getPageSource()[[1]]
read_html(html) %>%
html_nodes(".fb-price:nth-child(2)") %>%
html_text()
#### OUTPUT ####
[1] "S/ 1,599 (Normal)" "S/ 3,999 (Normal)" "S/ 2,399 (Normal)" "S/ 1,149 (Normal)"
[5] "S/ 1,399 (Normal)" "S/ 1,699 (Normal)" "S/ 4,999 (Normal)" "S/ 7,999 (Normal)"
[9] "S/ 3,499 (Normal)" "S/ 12,999 (Normal)" "S/ 9,798 (Normal)" "S/ 1,999 (Normal)"
[13] "S/ 2,499 (Normal)" "S/ 1,299 (Normal)" "S/ 2,499 (Normal)" "S/ 3,599 (Normal)"
[17] "S/ 8,999 (Normal)" "S/ 2,499 (Normal)" "S/ 8,599 (Normal)" "S/ 1,499 (Normal)"
[21] "S/ 2,199 (Normal)" "S/ 1,199 (Normal)" "S/ 699 (Normal)" "S/ 999 (Normal)"
[25] "S/ 29,999 (Normal)" "S/ 499 (Normal)" "S/ 699 (Normal)" "S/ 4,999 (Normal)"
[29] "S/ 17,999 (Normal)" "S/ 1,399 (Normal)"
您还可以使用 findElement
和 clickElement
浏览页面。有关更多信息,请参阅 。
正如您所考虑的那样RSelenium
这里有一个带有相应包的解决方案。
您可以通过 xpath
找到这些元素。在您的情况下,xpath
将是:/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p
.
类似于@gersht 的解决方案,但仅使用 RSelenium
。
可重现的例子:
library(RSelenium)
rD <- rsDriver()
remDr <- rD$client
remDr$navigate(url)
priceElems = remDr$findElements(
using = "xpath",
value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)
rawPrices = sapply(
X = priceElems,
FUN = function(elem) elem$getElementText()
)
splitted = sapply(
X = rawPrices,
FUN = strsplit,
split = "\nS/"
)
prices = data.frame(
internetPrices = sapply(splitted, "[", 1),
normalPrices = sapply(splitted, "[", 2)
)
结果/输出:
> head(prices, 8)
internetPrices normalPrices
1 S/ 1,099 (Internet) 1,599 (Normal)
2 S/ 2,299 (Internet) 3,999 (Normal)
3 S/ 1,699 (Internet) 2,399 (Normal)
4 S/ 999 (Internet) 1,149 (Normal)
5 S/ 999 (Internet) 1,399 (Normal)
6 S/ 1,399 (Internet) 1,699 (Normal)
7 S/ 2,199 (Internet) <NA>
8 S/ 2,699 (Internet) 4,999 (Normal)
设置:
如果需要,请参阅此处了解如何设置 RSenelium
:How to set up rselenium for R?。
编辑:
根据评论中的评论也捕获空元素,我将获取父元素,然后处理价格文本。
父元素是 /html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']
,如果其中一个价格不可用,则包含一个空字符串。
我有一个 div
有 2 p
个标签。
我需要获取第二个 p
元素的文本。
<div class="fb-price-list">
<p class="fb-price">S/ 1,699 (Internet)</p>
<p class="fb-price">S/ 2,399 (Normal)</p>
</div>
预期结果:
S/ 2,399 (Normal)
我有这个但是没有用:
tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")
product_price_actual <- tvs_url %>%
html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
html_text()
html:
<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/ 1,699 (Internet)</p><p class="fb-price">S/ 2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>
更新 1:
根据选择的答案,我使用 ifelse
检查给定位置的字符数:
被监督的位置是第4个,当没有precio_antes(价格前)这个位置被另一个元素占据所以我们需要在这些情况下放置NA
:
ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))
我是如何构建最终 df 的:
df <- data.frame(
brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
product = sapply(splitted, "[", 3),
precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
)
这里我使用 css 到 select 节点 class fb-price-list
然后 select 第二个 p
子节点:
library(rvest)
"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&hei=544&qlt=70&anchor=750,750&crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/ 1,699 (Internet)</p><p class=\"fb-price\">S/ 2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>%
read_html() %>%
html_nodes(".fb-price-list p:nth-child(2)") %>%
html_text()
tl;dr
内容是动态加载的,但可以字符串形式使用,来源是javascript字典,可以在正则表达式后用json解析器解析得到字符串。 This 是当前提取的json。
如果您使用 F12 打开开发工具并检查页面 html 您将看到包含 javascript 字典的 script
标签可以通过 json 解析器提取和处理。这确实意味着您可以定位显示的 script
标记,然后从节点和子字符串中提取文本,但我更喜欢在字符串上使用正则表达式(参见我将正文提取为字符串。通常不建议使用 HTML 使用正则表达式,但是用字符串很好)。
代码输出:
json$state$searchItemList$resultList$prices
给你一个长度为 32 的列表,其中包含数据帧。您可以看到在每个数据框中 originalPice
包含您想要的信息(label
列 == (Normal)
的行)
并非每件商品都有原价。以下是一种简单但不一定最有效的写出值的方法:
l <- json$state$searchItemList$resultList$prices
for (i in l){
if (length(i$originalPrice)>1){
print(i$originalPrice[2])
} else {
print("No original price")
}
}
R
library(rvest)
library(jsonlite)
library(stringr)
url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
html_node('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)
正则表达式解释:
似乎是动态的,所以数据来自其他地方。我在数据中寻找 JSON、XML 等的 GET 响应,但没有找到任何内容。此时我会选择 RSelenium。以下应提取正确的节点。您可以使用任何您喜欢的方法从结果字符串中提取数字:
# install.packages("RSelenium")
library(RSelenium)
library(rvest)
driver <- rsDriver(4444L, "firefox")
fox_client <- driver$client
url <- "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1"
fox_client$navigate(url = url)
html <- fox_client$getPageSource()[[1]]
read_html(html) %>%
html_nodes(".fb-price:nth-child(2)") %>%
html_text()
#### OUTPUT ####
[1] "S/ 1,599 (Normal)" "S/ 3,999 (Normal)" "S/ 2,399 (Normal)" "S/ 1,149 (Normal)"
[5] "S/ 1,399 (Normal)" "S/ 1,699 (Normal)" "S/ 4,999 (Normal)" "S/ 7,999 (Normal)"
[9] "S/ 3,499 (Normal)" "S/ 12,999 (Normal)" "S/ 9,798 (Normal)" "S/ 1,999 (Normal)"
[13] "S/ 2,499 (Normal)" "S/ 1,299 (Normal)" "S/ 2,499 (Normal)" "S/ 3,599 (Normal)"
[17] "S/ 8,999 (Normal)" "S/ 2,499 (Normal)" "S/ 8,599 (Normal)" "S/ 1,499 (Normal)"
[21] "S/ 2,199 (Normal)" "S/ 1,199 (Normal)" "S/ 699 (Normal)" "S/ 999 (Normal)"
[25] "S/ 29,999 (Normal)" "S/ 499 (Normal)" "S/ 699 (Normal)" "S/ 4,999 (Normal)"
[29] "S/ 17,999 (Normal)" "S/ 1,399 (Normal)"
您还可以使用 findElement
和 clickElement
浏览页面。有关更多信息,请参阅
正如您所考虑的那样RSelenium
这里有一个带有相应包的解决方案。
您可以通过 xpath
找到这些元素。在您的情况下,xpath
将是:/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p
.
类似于@gersht 的解决方案,但仅使用 RSelenium
。
可重现的例子:
library(RSelenium)
rD <- rsDriver()
remDr <- rD$client
remDr$navigate(url)
priceElems = remDr$findElements(
using = "xpath",
value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)
rawPrices = sapply(
X = priceElems,
FUN = function(elem) elem$getElementText()
)
splitted = sapply(
X = rawPrices,
FUN = strsplit,
split = "\nS/"
)
prices = data.frame(
internetPrices = sapply(splitted, "[", 1),
normalPrices = sapply(splitted, "[", 2)
)
结果/输出:
> head(prices, 8)
internetPrices normalPrices
1 S/ 1,099 (Internet) 1,599 (Normal)
2 S/ 2,299 (Internet) 3,999 (Normal)
3 S/ 1,699 (Internet) 2,399 (Normal)
4 S/ 999 (Internet) 1,149 (Normal)
5 S/ 999 (Internet) 1,399 (Normal)
6 S/ 1,399 (Internet) 1,699 (Normal)
7 S/ 2,199 (Internet) <NA>
8 S/ 2,699 (Internet) 4,999 (Normal)
设置:
如果需要,请参阅此处了解如何设置 RSenelium
:How to set up rselenium for R?。
编辑:
根据评论中的评论也捕获空元素,我将获取父元素,然后处理价格文本。
父元素是 /html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']
,如果其中一个价格不可用,则包含一个空字符串。