如何从 div 中获取第一个字符串，其中嵌入了 div beautifulsoup4

Question

我正在尝试从网站提取价格。

我写的代码可以做到这一点，但是当网站的价格也显示旧价格时，它 returns "none" 而不是价格字符串。

这是没有旧价格的代码示例（我的代码 returns 作为字符串）

<div class="xl-price rangePrice">
                            535.000 €  
                        </div>

这是旧价格的代码示例（我的代码 returns 为 "none"）

    < div


class ="xl-price rangePrice" >


487.000 €
< span


class ="old-price" > 497.000 € < br > < / span >

< / div >

我试图从中提取代码的页面：pagelink

我的代码：

prices = []
for items in soup.find_all("div", {"class": "xl-price rangePrice"}):
    prices.append(items.string)

print(prices)

我遇到的另一个问题是 returns 值是这样的：

'\r\n\t\t\t\t\t\t\t\t298.000 € \r\n\t\t\t\t\t\t\t', '\r\n\t\t\t\t\t\t\t\t145.000 € \r\n\t\t\t\t\t\t\t'

当我只想要数字时。

非常感谢您的帮助！

Answer 1

这是您问题的示例代码。

import re
import requests
page = requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
print(page.content)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

prices = []
for items in soup.find_all("div", {"class": "xl-price rangePrice"}):
if items.string:
    result = re.findall(r'\d+.\d+', items.string)
    prices.append(result[0])
else:
    soup1 = BeautifulSoup(str(items), 'html.parser')
    for item in soup1.find("div", {"class": "xl-price rangePrice"}):
        if item.string:
            result = re.findall(r'\d+.\d+', item.string)
            if len(result)>0:
                prices.append(result[0])

print(prices)

Answer 2

import requests
from bs4 import BeautifulSoup

r = requests.get(
    'https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000')
soup = BeautifulSoup(r.text, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'xl-price rangePrice'}):
    item = item.contents[0]
    print(item.strip()[0:-1])

输出：

Answer 3

我现在无法访问计算机，所以请考虑这个准伪代码：

new_price = div_elem.find(text=True, recursive=False)

find_res = div_elem.find('span', attrs={'class': 'old-price'})

if find_res:
    old_price = find_res.get_text(strip=True)

我尽量让事情变得容易理解。

如果您有任何问题，请告诉我:)

如何从 div 中获取第一个字符串，其中嵌入了 div beautifulsoup4

How can I get the first string from a div that has a div embedded beautifulsoup4

python

text-extraction

beautifulsoup

data-extraction