如何排除 R 中的某些节点？

Question

我正在抓取一个网站并希望排除一些节点。

url <- "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?q=iphone+11"

gettitles <- read_html(url) %>% 
  html_nodes("div.productArea") %>% 
  html_nodes(":not(div.group.listingGroup.set6.promoGroup)") %>%
  html_nodes(xpath = '//*[@class="plink"]') %>% 
  html_text() %>% 
  tibble()

我不想要分页下页面底部的标题。但它不起作用。它应该是 28，但 42 中的 14 是突出的。这段代码有什么问题？谢谢

Answer 1

在页面底部和主要节点中区分标题的属性很少。我们可以使用其中一个属性并过滤节点。我在这里使用了"data-ctgid"。

library(rvest)

url <- "https://www.n11.com/telefon-ve-aksesuarlari/cep-telefonu?q=iphone+11"
nodes <- read_html(url) %>% html_nodes("div.columnContent") 

nodes[!is.na(nodes %>% html_attr('data-ctgid'))] %>%
   html_nodes('div.pro a') %>%
   html_attr('title')

#[1] "iPHONE 11 128 GB APPLE TÜRKİYE GARANTİLİ"                         
#[2] "APPLE İPHONE 11 64 GB (APPLE TÜRKİYE GARANTİLİ)"                  
#[3] "Apple iPhone 11 128GB (2 Yıl Apple Türkiye Garantili)"            
#[4] "Apple iPhone 11 Pro Max 64 GB (2 Yıl Apple Türkiye Garantili)"    
#[5] "Apple iPhone 11 Pro 64 GB (2 Yıl Apple Türkiye Garantili)"        
#[6] "APPLE İPHONE 11 64 GB (2 YIL APPLE TÜRKİYE GARANTİLİ)"       
#...
#...
#[27] "iPhone 11 Pro 64 GB"                                              
#[28] "Apple iPhone 11 64 GB (Distribütör Garantili)"

如何排除 R 中的某些节点？

How to exclude some nodes in R?

html

r

rvest