刮文本包括。带有 BeautifulSoup 的表情符号
Scraping Text incl. Emojis with BeautifulSoup
非常感谢你在这个问题上的帮助。我正在尝试抓取包括表情符号在内的论坛帖子。获取文本是有效的,但不包括表情符号,我想使用您在下面看到的功能将它们与文本拼凑在一起。感谢您的帮助!
对于下面的link,图像被称为class = 'smilies'。
这是我的代码:
### import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# find all user comments and save them to a list
comment = bs.find_all(class_= "content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip = True))
# return the list
return lst_name
### third, start the scraping
link = 'https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120'
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {
'user_comments': user_comments
}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
一种方法是用文本替换所有 <img class="smilies">
。例如:
### import
import requests
import pandas as pd
from bs4 import BeautifulSoup
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# replace all <img class="smilies"> with text:
for img in bs.select("img.smilies"):
img.replace_with(img["alt"])
bs.smooth()
# find all user comments and save them to a list
comment = bs.find_all(class_="content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip=True))
# return the list
return lst_name
### third, start the scraping
link = "https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120"
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, "html.parser")
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {"user_comments": user_comments}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
print(df)
打印:
...
Danke!Erst Mal sollte ich bei den Tabletten bleiben. Hab die ja schon Mal genommen. Genau die gleichen wie ihr mir empfiehlt. Aber die sind fast leer und auf Amazon gibt's die nicht mehr.Soll ich bei der Quelle nachfragen oder denkst du findest es günstiger? :)
...
非常感谢你在这个问题上的帮助。我正在尝试抓取包括表情符号在内的论坛帖子。获取文本是有效的,但不包括表情符号,我想使用您在下面看到的功能将它们与文本拼凑在一起。感谢您的帮助!
对于下面的link,图像被称为class = 'smilies'。
这是我的代码:
### import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# find all user comments and save them to a list
comment = bs.find_all(class_= "content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip = True))
# return the list
return lst_name
### third, start the scraping
link = 'https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120'
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, 'html.parser')
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {
'user_comments': user_comments
}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
一种方法是用文本替换所有 <img class="smilies">
。例如:
### import
import requests
import pandas as pd
from bs4 import BeautifulSoup
### first, create an empty dataframe where the final results will be stored
df = pd.DataFrame()
### second, create a function to get all the user comments
def get_comments(lst_name):
# replace all <img class="smilies"> with text:
for img in bs.select("img.smilies"):
img.replace_with(img["alt"])
bs.smooth()
# find all user comments and save them to a list
comment = bs.find_all(class_="content")
# iterate over the list comment to get the text and strip the strings
for c in comment:
lst_name.append(c.get_text(strip=True))
# return the list
return lst_name
### third, start the scraping
link = "https://vegan-forum.de/viewtopic.php?f=54&t=8325&start=120"
# create the lists for the functions
user_comments = []
# get the content
page = requests.get(link)
html = page.content
bs = BeautifulSoup(html, "html.parser")
# call the functions to get the information
get_comments(user_comments)
# create a pandas dataframe for the comments
comments_dict = {"user_comments": user_comments}
df_comments_info = pd.DataFrame(data=comments_dict)
# append the temporary dataframe to the dataframe we created earlier outside the for loop
df = df.append(df_comments_info)
print(df)
打印:
...
Danke!Erst Mal sollte ich bei den Tabletten bleiben. Hab die ja schon Mal genommen. Genau die gleichen wie ihr mir empfiehlt. Aber die sind fast leer und auf Amazon gibt's die nicht mehr.Soll ich bei der Quelle nachfragen oder denkst du findest es günstiger? :)
...