我正在尝试使用 Python 将抓取的数据保存到 CSV 文件中,但出现 TypeError

I'm trying to save scraped data to a CSV file with Python but get the a TypeError

我正在尝试将 scraped data 保存到 csv 文件。但是,我收到以下错误

TypeError:列表索引必须是整数或切片,而不是 str。 我认为错误来自这段代码。

csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))

完整代码如下..

import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.get('src')
        text = link.span.text
        link_list.append([{'url':url, 'img':img, 'text':text}])
    except AttributeError:
        pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
    csv_writer = csv.writer(csv_out)
    csv_writer.writerow(['url', 'img', 'text'])
for row in link_list:
    csv_writer.writerow(str(row['url']), str(row['img']), str(row['text']))
print('All done')

请注意:以下代码创建一个文件并写入行

with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
    csv_writer = csv.writer(csv_out)
    csv_writer.writerow(['url', 'img', 'text'])

更新

使用 csv.DictWriter():

with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
    i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
    i.writeheader()
    i.writerows(link_list)

您可以使用 set().union(*(d.keys() for d in link_list)) 从您的 dicts 中获取密钥列表,或者只需将 ['url', 'img', 'text'] 作为 fieldnames 传递

例子
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.img.get('src')
        text = link.span.text
        link_list.append({'url':url, 'img':img, 'text':text})
    except AttributeError:
        pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
    i = csv.DictWriter(csv_out, fieldnames = set().union(*(d.keys() for d in link_list)))
    i.writeheader()
    i.writerows(link_list)
print('All done')

替代方法:

将您的数据简单地存储为 dict 而不是 list with dict in list:

link_list.append({'url':url, 'img':img, 'text':text})

然后这样写:

csv_writer.writerow([row['url'], row['img'], row['text']])

或者更简单的直接保存为list:

link_list.append([url,img,text])

并将其写为列表:

csv_writer.writerow(row)
例子
import requests
from bs4 import BeautifulSoup
import csv
page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.img.get('src')
        text = link.span.text
        link_list.append({'url':url, 'img':img, 'text':text})
    except AttributeError:
        pass
with open('links.csv', 'w', encoding='utf-8', newline='') as csv_out:
    csv_writer = csv.writer(csv_out)
    csv_writer.writerow(['url', 'img', 'text'])
    for row in link_list:
        csv_writer.writerow([row['url'], row['img'], row['text']])
print('All done')

错误修复: 将这一行 img = link.get('src') 替换为 img = link.img.get('src')

更新代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

page_url = 'https://alansimpson.me/python/scrape_sample.html'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4514.131 Safari/537.36'}
rawpage = requests.get(page_url, headers=headers)
soup = BeautifulSoup(rawpage.content, 'html5lib')
content = soup.article
link_list = []
for link in content.find_all('a'):
    try:
        url = link.get('href')
        img = link.img.get('src')
        text = link.span.text
        link_list.append({
          'url':url,
          'img':img,
          'text':text,

        })
    except AttributeError:
        pass

df = pd.DataFrame(link_list)
print(df)