在 for 循环代码中使用 beautifulsoup gettext 抓取多个内容
Scraping multiiple things with beautifulsoup gettext in for roop code
我想在这些页面中获得每 3 条评论。但问题是这段代码。
for i in range(0, 2):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
当我 运行 这段代码时,错误信息出现了。
TypeError: must be str, not list
而整个代码是
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import urljoin
import pandas as pd
import requests
import re
#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']
df = pd.DataFrame()
for n in pages:
# Create url
url = base_url + n
url2 = base_url2 + n
for i in range(0, 2):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
res2 = requests.get(url2)
soup = BeautifulSoup(res2.text, "html.parser")
title = soup.find('h3', 'h_movie')
for a in title.find_all('a'):
#print(a.text)
title=a.text
data = {'title':[title], 'reviewn':[reple]}
df = df.append(pd.DataFrame(data))
df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')
错误非常简单。准确告诉您问题所在。
问题在于行:
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
你有变量 i
作为 list
。它也需要是一个字符串。最后,space before/after content/text 有很多白色,所以我用 .strip()
把它去掉了。
改为:
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+str(i))}).getText().strip()
我想你想为每个人抓取前 3 个评论 pages.I 已经编写了代码并打印输出,我给你写了一些评论 understanding.Something 你不明白请评论我会的help you.You 可以 运行 它并且可以在控制台中看到输出。
import urllib3
from bs4 import BeautifulSoup
# scrape urls
base_url_one = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' # review page
base_url_two = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' # movie title
pages = ['177374', '164102']
# how many review count need to get
review_count = 3
# create request pool
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
http = urllib3.PoolManager()
for page in pages:
# create urls
base_url_one = base_url_one + page
base_url_two = base_url_two + page
# request web pages
response_one = http.request('POST', base_url_one)
response_two = http.request('POST', base_url_two)
# check response one status and scrape data
if response_one.status == 200:
page_data = response_one.data
soup = BeautifulSoup(page_data, "lxml")
comment_list = soup.find_all('div', {'class': 'score_reple'})
for index in range(0, review_count):
try:
comment_text = comment_list[index].find('p').text.strip()
print(comment_text)
except IndexError:
pass
print("-------------")
# check response two status and scrape data
if response_two.status == 200:
page_data = response_two.data
soup = BeautifulSoup(page_data, "lxml")
comment_list = soup.find_all('div', {'class': 'score_reple'})
for index in range(0, review_count):
try:
comment_text = comment_list[index].find('p').text.strip()
print(comment_text)
except IndexError:
pass
我想在这些页面中获得每 3 条评论。但问题是这段代码。
for i in range(0, 2):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
当我 运行 这段代码时,错误信息出现了。
TypeError: must be str, not list
而整个代码是
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import urljoin
import pandas as pd
import requests
import re
#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']
df = pd.DataFrame()
for n in pages:
# Create url
url = base_url + n
url2 = base_url2 + n
for i in range(0, 2):
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
res2 = requests.get(url2)
soup = BeautifulSoup(res2.text, "html.parser")
title = soup.find('h3', 'h_movie')
for a in title.find_all('a'):
#print(a.text)
title=a.text
data = {'title':[title], 'reviewn':[reple]}
df = df.append(pd.DataFrame(data))
df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')
错误非常简单。准确告诉您问题所在。
问题在于行:
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+[i])}).getText()
你有变量 i
作为 list
。它也需要是一个字符串。最后,space before/after content/text 有很多白色,所以我用 .strip()
把它去掉了。
改为:
reple = soup.find("span", {"id":re.compile("^_filtered_ment_"+str(i))}).getText().strip()
我想你想为每个人抓取前 3 个评论 pages.I 已经编写了代码并打印输出,我给你写了一些评论 understanding.Something 你不明白请评论我会的help you.You 可以 运行 它并且可以在控制台中看到输出。
import urllib3
from bs4 import BeautifulSoup
# scrape urls
base_url_one = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' # review page
base_url_two = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' # movie title
pages = ['177374', '164102']
# how many review count need to get
review_count = 3
# create request pool
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
http = urllib3.PoolManager()
for page in pages:
# create urls
base_url_one = base_url_one + page
base_url_two = base_url_two + page
# request web pages
response_one = http.request('POST', base_url_one)
response_two = http.request('POST', base_url_two)
# check response one status and scrape data
if response_one.status == 200:
page_data = response_one.data
soup = BeautifulSoup(page_data, "lxml")
comment_list = soup.find_all('div', {'class': 'score_reple'})
for index in range(0, review_count):
try:
comment_text = comment_list[index].find('p').text.strip()
print(comment_text)
except IndexError:
pass
print("-------------")
# check response two status and scrape data
if response_two.status == 200:
page_data = response_two.data
soup = BeautifulSoup(page_data, "lxml")
comment_list = soup.find_all('div', {'class': 'score_reple'})
for index in range(0, review_count):
try:
comment_text = comment_list[index].find('p').text.strip()
print(comment_text)
except IndexError:
pass