刮文本包括。带有 BeautifulSoup 的表情符号

Question

非常感谢你在这个问题上的帮助。我正在尝试抓取包括表情符号在内的论坛帖子。获取文本是有效的，但不包括表情符号，我想使用您在下面看到的功能将它们与文本拼凑在一起。感谢您的帮助！

对于下面的link，图像被称为class = 'smilies'。

这是我的代码：

### import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()

### second, create a function to get all the user comments
def get_comments(lst_name):
# find all user comments and save them to a list
  comment = bs.find_all(class_= "content")
# iterate over the list comment to get the text and strip the strings
  for c in comment:
        lst_name.append(c.get_text(strip = True))
# return the list
  return lst_name

### third, start the scraping
link = 'https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120'

# create the lists for the functions
user_comments = []
   
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
        
# call the functions to get the information
get_comments(user_comments)

# create a pandas dataframe for the comments
comments_dict = {
    'user_comments': user_comments
}

df_comments_info = pd.DataFrame(data=comments_dict)
        
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)

Answer 1

一种方法是用文本替换所有 <img class="smilies">。例如：

### import
import requests
import pandas as pd
from bs4 import BeautifulSoup

### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()

### second, create a function to get all the user comments
def get_comments(lst_name):
    # replace all <img class="smilies"> with text:
    for img in bs.select("img.smilies"):
        img.replace_with(img["alt"])
    bs.smooth()

    # find all user comments and save them to a list
    comment = bs.find_all(class_="content")
    # iterate over the list comment to get the text and strip the strings
    for c in comment:
        lst_name.append(c.get_text(strip=True))
    # return the list
    return lst_name


### third, start the scraping
link = "https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120"

# create the lists for the functions
user_comments = []

# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, "html.parser")

# call the functions to get the information
get_comments(user_comments)

# create a pandas dataframe for the comments
comments_dict = {"user_comments": user_comments}

df_comments_info = pd.DataFrame(data=comments_dict)

# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
print(df)

打印：


...

Danke!Erst Mal sollte ich bei den Tabletten bleiben. Hab die ja schon Mal genommen. Genau die gleichen wie ihr mir empfiehlt. Aber die sind fast leer und auf Amazon gibt's die nicht mehr.Soll ich bei der Quelle nachfragen oder denkst du findest es günstiger? :)

...

刮文本包括。带有 BeautifulSoup 的表情符号

Scraping Text incl. Emojis with BeautifulSoup

python

beautifulsoup

web-scraping