使用 BeautifulSoup 和 pandas 保存和抓取多个页面

Save and Scraping multiple pages with BeautifulSoup and pandas

我用这段代码用 jupiter notebook 测试了我的代码

...
rname = soup.find('p', 'con_tx')
#rnamelis = rname.findAll('p')
rname
from urllib.request import urljoin
  story=[]
  #review_text = lis[0].find('p').getText()
  #list_soup =soup.find_all('p', 'con_tx')
  story=rname.getText()
  story

而且效果很好。

(result) '전 여친에 ...'

但是当我试图抓取多个页面时

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import urljoin
import pandas as pd
import numpy as np
import requests


base_url = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code='
pages =['177374','164102']
url = base_url + pages[0]
story = []
for n in pages:
    # Create url
    url = base_url + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()
    html = urlopen(url)
    soup = BeautifulSoup(html, "html.parser")
    #print(soup.find('p', 'con_tx'))

    rname = soup.find('p', 'con_tx')
    story=rname.getText()
    data = {story}
    df = pd.DataFrame(data)
    df.head()
    df.to_csv('./moviestory.csv', sep=',', encoding='EUC-KR')

出现错误信息。

ValueError: DataFrame constructor not properly called!

如何修复我的代码? Crawling area

不确定您要做什么,但我注意到一件事是您每次都在覆盖数据框。也不知道为什么要将 story 初始化为列表,然后将其作为字典存储在循环中。

from bs4 import BeautifulSoup
import pandas as pd
import requests


base_url = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code='
pages =['177374','164102']

df = pd.DataFrame()
for n in pages:
    # Create url
    url = base_url + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    rname = soup.find('p', 'con_tx')
    story=rname.getText()
    data = [story]
    df = df.append(pd.DataFrame(data), sort=True).reset_index(drop=True)

df.to_csv('./moviestory.csv', sep=',', encoding='EUC-KR')