使用 BeautifulSoup 从 HTML 字符串中获取文本和图像 url

Using BeautifulSoup to get text and image urls from HTML string

class BlogApi(object):
def __init__(self):
    json = "https://remaster.realmofthemadgod.com/index.php?rest_route=/wp/v2/posts/"
    with urllib.request.urlopen(f"{rotmgjson}") as url:
        self.post_json = json.loads(url.read().decode())

async def content(self, thread=0, parse=True):
    """Returns content of blog post as string.
    Thread is 0 (latest) by default.
    Parse is True by default."""
    dirty_content = self.post_json[thread]['content']['rendered']
    if not parse:
        return dirty_content
    else:
        soup = BeautifulSoup(dirty_content, features="html.parser")
        images = []
        for img in soup.findAll('img'):
            images.append(img.get('src'))
        images = soup.find_all('img', {'src':re.compile('.png')})
        return images, soup.text

我正在使用上面的 class 从 HTML 字符串中获取所有文本和图像 URL。完整的字符串看起来像这样 https://controlc.com/c3cdf2ef.

我的问题是,图像 URL 显然与文本不在同一个字符串中。我的目标是让它们与网页中的文本位置相同。例如, 我返回的字符串应该是这样的:

https://remaster.realmofthemadgod.com/wp-content/uploads/2022/05/steam_forgerenovation.png
Realmers,
The Forge is about to change. Coming in June the blacksmith will present the Heroes with her renovated forge. She’s equipped it with better and more reliable equipment, capable of making items no one thought could be made that way. 
Here’s what’s going to change:
https://remaster.realmofthemadgod.com/wp-content/uploads/2022/04/c5c640b1-a033-4547-aabb-5af37a8ce4c5-1024x616.png ...

它实际上更长,图片更多。但是是的。

您可以继续将 <img src=my_image.png/> 元素替换为 src 文本,例如

for image in (images := soup.find_all('img', {'src':re.compile('.png')})):
    image.replace_with(image.get('src'))

这将使您在调用 soup.text 时只看到文本。不过,它更像是一种“务实的解决方案”,而不是任何花哨的东西,更不用说推荐的方法了。