同时抓取两个页面：pandas 错误

Question

我想保存那两个页面的影评和片名。

https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=~
https://movie.naver.com/movie/bi/mi/basic.nhn?code=~

当我运行这段代码，并打开 csv 文件时。

ValueError: Shape of passed values is (2, 6), indices imply (2, 10)

from bs4 import BeautifulSoup
from urllib.request import urlopen
from selenium import webdriver
from urllib.request import urljoin
import pandas as pd
import requests

#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']

#print(soup.find_all('div', 'score_reple'))
#div = soup.find('h3', 'h_movie')

df = pd.DataFrame()
for n in pages:
    # Create url
    url = base_url + n
    url2 = base_url2 + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    print('Downloading page %s...' % url2)

    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    reple = soup.find_all('div', 'score_reple')
    res2 = requests.get(url2)
    soup = BeautifulSoup(res2.text, "html.parser")
    title = soup.find('h3', 'h_movie')
    #ratesc = soup.find('','')
    #story=rname.getText()
    #data = [title,reple]
    data = {'title':[title], 'reviewn':[reple]}
    df = df.append(pd.DataFrame(data), sort=True).reset_index(drop=True)

df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')

如何修复此代码？

Answer 1

您可以尝试清理它的一件事是首先转换为字符串，然后根据 html 放置约束，如下所示：

title = str(soup.find('h3', 'h_movie'))
start = '" title="'
end = '                                     ,                   2018">'
newTitle = title[title.find(start)+len(start):title.rfind(end)]

然后在评论部分尝试同样的事情。您需要缩小结果集的范围，然后转换为评论部分所在的字符串并对其施加约束。

然后您将清理数据并准备好添加到 DataFrame 中。

希望这能帮助您走上正确的道路！

Answer 2

现在是干净的....只需删除带有如下内容的标签：

from bs4 import BeautifulSoup
from urllib.request import urlopen
#from selenium import webdriver
from urllib.request import urljoin
import pandas as pd
import requests
import re

#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']

df = pd.DataFrame()
for n in pages:
    # Create url
    url = base_url + n
    url2 = base_url2 + n

    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    reple = soup.find("span", {"id":re.compile("^_filtered_ment")}).getText()
    res2 = requests.get(url2)
    soup = BeautifulSoup(res2.text, "html.parser")
    title = soup.find('h3', 'h_movie')
    for a in title.find_all('a'):
        #print(a.text)
        title=a.text

    data = {'title':[title], 'reviewn':[reple]}
    df = df.append(pd.DataFrame(data))

df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')

我为正则表达式添加了 import re class _filtered_ment_*

同时抓取两个页面：pandas 错误

Scraping two pages at the same time : pandas error

python

beautifulsoup

web-crawler

web-scraping

pandas