在从多个网页抓取信息并以表格形式导入 csv 文件方面需要帮助 - Python

Need help in scraping information from multiple webpages and import to csv file in tabular form - Python

我一直致力于网络抓取维基百科上的信息框信息。这是我一直在使用的以下代码:

import requests 
import csv 
from bs4 import BeautifulSoup 
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union','https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
               'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union','https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
               'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help','https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
               'https://en.wikipedia.org/wiki/State_Employees_Credit_Union','https://en.wikipedia.org/wiki/United_Heritage_Credit_Union'] 
for url in URL:
            headers=[]
            rows=[]
            response = requests.get(url)
            soup = BeautifulSoup(response.text,'html.parser')
            table = soup.find('table',class_ ='infobox')
            credit_union_name= soup.find('h1', id = "firstHeading")
            header_tags = table.find_all('th')
            headers = [header.text.strip() for header in header_tags]
            data_rows = table.find_all('tr')
            for row in data_rows:
                value = row.find_all('td')
                beautified_value = [dp.text.strip() for dp in value]
                if len(beautified_value) == 0: 
                    continue
                rows.append(beautified_value)
            rows.append("")
            rows.append([credit_union_name.text.strip()])
            rows.append([url])
            
            with open(r'credit_unions.csv','a+',newline="") as output:
                writer=csv.writer(output)
                writer.writerow(headers)
                writer.writerow(rows)

但是,我检查了 csv 文件,发现信息没有以表格形式显示。抓取的元素存储在嵌套列表中,而不是单个列表中。我需要将每个 URL 的抓取信息存储在一个单独的列表中,并以带有标题的表格形式在 csv 文件中打印该列表。需要这方面的帮助。

信息框具有不同的结构和标签。所以我认为解决这个问题的最好方法是使用字典和 DictWriter。

import requests
import csv
from bs4 import BeautifulSoup

URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union',
       'https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
       'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union',
       'https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
       'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help',
       'https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
       'https://en.wikipedia.org/wiki/State_Employees_Credit_Union',
       'https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']

csv_headers = set()
csv_rows = []

for url in URL:
    csv_row = {}
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    credit_union_name = soup.find('h1', id="firstHeading")
    table = soup.find('table', class_='infobox')
    data_rows = table.find_all('tr')
    for data_row in data_rows:
        label = data_row.find('th')
        value = data_row.find('td')
        if label is None or value is None:
            continue
        beautified_label = label.text.strip()
        beautified_value = value.text.strip()
        csv_row[beautified_label] = beautified_value
        csv_headers.add(beautified_label)
    csv_row["name"] = credit_union_name.text.strip()
    csv_row["url"] = url
    csv_rows.append(csv_row)

with open(r'credit_unions.csv', 'a+', newline="") as output:
    headers = ["name", "url"]
    headers += sorted(csv_headers)
    writer = csv.DictWriter(output, fieldnames=headers)
    writer.writeheader()
    writer.writerows(csv_rows)