如何 select 并从一堆 <ul> 和 <li> 中抓取特定文本？

Question

我需要从下方抓取“2015”和“09/09/2015”link:

lacentrale.fr/auto-occasion-annonce-87102353714.html

但由于有很多 li 和 ul，我无法抓取确切的文本。我使用了以下代码非常感谢您的帮助。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(HTML)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()

Answer 1

尝试：

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}

url = "https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html"

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

v1 = soup.select_one('.optionLabel:-soup-contains("Année") + span')
v2 = soup.select_one(
    '.optionLabel:-soup-contains("Mise en circulation") + span'
)

print(v1.text)
print(v2.text)

打印：

2015
09/09/2015

Answer 2

css selectors 和 :-soup-contains() 的粉丝，如@Andrejs 的回答所述。因此，以防万一一种替代方法，如果到了这一点，则需要更多选择。

生成一个dict，所有选项选择相关值，以选项标签为键：

data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))

数据如下：

{'Année': '2015', 'Mise en circulation': '09/09/2015', 'Contrôle technique': 'requis', 'Kilométrage compteur': '68 736 Km', 'Énergie': 'Electrique', 'Rechargeable': 'oui', 'Autonomie batterie': '190 Km', 'Capacité batterie': '22 kWh', 'Boîte de vitesse': 'automatique', 'Couleur extérieure': 'gris foncé metal', 'Couleur intérieure': 'cuir noir', 'Nombre de portes': '5', 'Nombre de places': '4', 'Garantie': '6 mois', 'Première main (déclaratif)': 'non', 'Nombre de propriétaires': '2', 'Puissance fiscale': '3 CV', 'Puissance din': '102 ch', 'Puissance moteur': '125 kW', "Crit'Air": '0', 'Émissions de CO2': '0 g/kmA', 'Norme Euro': 'EURO6', 'Prime à la conversion': ''}

例子

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
url = 'https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html'

soup = BeautifulSoup(requests.get(url, headers=headers).text)

data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))

print(data['Année'], data['Mise en circulation'], sep='\n')

输出

2015
09/09/2015

如何 select 并从一堆 <ul> 和 <li> 中抓取特定文本？

How to select and scrape specific texts out of a bunch <ul> and <li>?

python

beautifulsoup

web-scraping

例子

输出