使用 Python 的链接抓取房地产网站

Question

我尝试使用 Python 和 Beautifulsoup 抓取一个商业房地产网站，相应的 href 也显示在最终的 csv 列表中。但是 link 列始终显示为空。我如何提取 href 并每周通过整个网站安排此任务运行？提前致谢！

from bs4 import BeautifulSoup
import requests
from csv import writer
import re


url = "https://objektvision.se/lediga_lokaler/stockholm/city"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('a', class_ ="ov--list-item d-flex")

with open('lokal_stockholm_city_v11.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['title', 'location', 'area','link']
    thewriter.writerow(header)
    
    
    for list in lists:
        title = list.find('div', class_="font-weight-bold text-ov street-address").text.replace('\r\n','')
        location = list.find('div', class_="text-ov-dark-grey area-address").text.replace('\r\n','')
        area = list.find('div', class_="font-weight-bold size").text.replace('\r\n','')
        link =list.find('a', attrs_={'href': re.compile("^https://objektvision.se/Beskriv/")})
            
       
      
        info = [title,location, area,link]
        thewriter.writerow(info)

The final csv looks like this

Answer 1

专注于 - 获取href有两点你应该知道 - href在你的soup不启动对于域，它们是相对的，你不需要找到 <a> 因为你已经根据你的 ResultSet.

处理它

所以为了让你 href 直接调用 .get('href) 或 ['href] 并将其与基础 url:

连接

    link = 'https://objektvision.se/'+e['href']

例子

注意： 不要使用 list 作为变量名 - 将其更改为 e for element

from bs4 import BeautifulSoup
import requests
from csv import writer

url = "https://objektvision.se/lediga_lokaler/stockholm/city"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
lists = soup.find_all('a', class_ ="ov--list-item d-flex")

with open('lokal_stockholm_city_v11.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['title', 'location', 'area','link']
    thewriter.writerow(header)

    for e in lists:
        title = e.find('div', class_="font-weight-bold text-ov street-address").text.replace('\r\n','')
        location = e.find('div', class_="text-ov-dark-grey area-address").text.replace('\r\n','')
        area = e.find('div', class_="font-weight-bold size").text.replace('\r\n','')
        link = 'https://objektvision.se/'+e['href']

        info = [title,location, area,link]
        thewriter.writerow(info)

输出

title	location	area	link
Kungsgatan 49	City , Stockholm	923 m²	https://objektvision.se//Beskriv/218003079?IsPremium=True
Sveavägen 20	City , Stockholm	1 000 - 2 200 m²	https://objektvision.se//Beskriv/218017049?IsPremium=True
Sergelgatan 8-14/Sveavägen 5-9 /Mäste...	City , Stockholm	1 373 m²	https://objektvision.se//Beskriv/218030745?IsPremium=True
Adolf Fredriks Kyrkogata 13	Stockholm	191 m²	https://objektvision.se//Beskriv/218031939
Arena Sergel - Malmskillnadsgatan 36	City , Stockholm	1 - 3 000 m²	https://objektvision.se//Beskriv/218006788

使用 Python 的链接抓取房地产网站

Real estate website scrape with linksusing Python

python

beautifulsoup

web-scraping

例子

输出