在美丽的汤中用文字刮擦表情符号

Question

我正在尝试使用 python 和漂亮的汤 bs4

抓取页面

我想保留页面 <p> 元素中的 text 以及此文本中的 emojis。

第一次尝试是：

import urllib
import urllib.request
from bs4 import BeautifulSoup

urlobject = urllib.request.urlopen("https://example.com")

soup = BeautifulSoup(urlobject, "lxml")

result= list(map(lambda e: e.getText(), soup.find_all("p", {"class": "text"})))

但这不包括表情符号。然后我尝试删除 .getText() 并保留 :

result= list(map(lambda e: e, soup.find_all("p", {"class": "text"})))

这让我意识到这个网站中的表情符号在 img 标签的 alt 中：

<p class="text">I love the night<img alt="" class="emoji" src="etc"/><span>!</span></p>

所以我想做的是：

getText() for p with class text
但是 img 和 class=emoji

alt

并将文字和表情符号保持为一个句子。

有什么办法吗？

如有任何帮助，我们将不胜感激。

Answer 1

接下来如何，为每个 p 返回目标数据的元组？我刚刚使用你的示例 p 元素两次作为此测试的输入：

from bs4 import BeautifulSoup

s = """
<p class="text">I love the night<img alt="" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the night<img alt="" class="emoji" src="etc"/><span>!</span></p>
"""

soup = BeautifulSoup(s, 'lxml')

elements = soup.find_all('p', {'class': 'text'})
print(list(map(lambda e: (e.getText(), e.find('img', {'class': 'emoji'})['alt']), elements)))

结果：

[('I love the night!', ''), ('I love the night!', '')]

Answer 2

如果 img.emoji 是可选的，您可以在下面尝试，它会保留表情符号位置

urlobject = '''<p class="text">I love the night<img alt="" class="emoji" src="etc"/><span>!</span></p>
<p class="text">I love the day<span>!</span></p>
<p class="text">I love the music<img alt="" class="emoji" src="etc"/> <img alt="" class="emoji" src="etc"/><span>!</span></p>
'''

result = []
for p in soup.find_all('p', {'class': 'text'}):
    emoji = p.select('img.emoji')
    if emoji:
        for em in emoji:
            index = p.contents.index(em)
            p.contents[index].replace_with(em['alt'])
    result.append(p.getText())

print(result)

结果：

['I love the night!', 'I love the day!', 'I love the music !']

在美丽的汤中用文字刮擦表情符号

scrape emojis with text in beautiful soup

python

beautifulsoup

emoji