在从多个网页抓取信息并以表格形式导入 csv 文件方面需要帮助 - Python
Need help in scraping information from multiple webpages and import to csv file in tabular form - Python
我一直致力于网络抓取维基百科上的信息框信息。这是我一直在使用的以下代码:
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union','https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union','https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help','https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union','https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
for url in URL:
headers=[]
rows=[]
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',class_ ='infobox')
credit_union_name= soup.find('h1', id = "firstHeading")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
rows.append("")
rows.append([credit_union_name.text.strip()])
rows.append([url])
with open(r'credit_unions.csv','a+',newline="") as output:
writer=csv.writer(output)
writer.writerow(headers)
writer.writerow(rows)
但是,我检查了 csv 文件,发现信息没有以表格形式显示。抓取的元素存储在嵌套列表中,而不是单个列表中。我需要将每个 URL 的抓取信息存储在一个单独的列表中,并以带有标题的表格形式在 csv 文件中打印该列表。需要这方面的帮助。
信息框具有不同的结构和标签。所以我认为解决这个问题的最好方法是使用字典和 DictWriter。
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union',
'https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help',
'https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union',
'https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
csv_headers = set()
csv_rows = []
for url in URL:
csv_row = {}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
credit_union_name = soup.find('h1', id="firstHeading")
table = soup.find('table', class_='infobox')
data_rows = table.find_all('tr')
for data_row in data_rows:
label = data_row.find('th')
value = data_row.find('td')
if label is None or value is None:
continue
beautified_label = label.text.strip()
beautified_value = value.text.strip()
csv_row[beautified_label] = beautified_value
csv_headers.add(beautified_label)
csv_row["name"] = credit_union_name.text.strip()
csv_row["url"] = url
csv_rows.append(csv_row)
with open(r'credit_unions.csv', 'a+', newline="") as output:
headers = ["name", "url"]
headers += sorted(csv_headers)
writer = csv.DictWriter(output, fieldnames=headers)
writer.writeheader()
writer.writerows(csv_rows)
我一直致力于网络抓取维基百科上的信息框信息。这是我一直在使用的以下代码:
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union','https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union','https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help','https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union','https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
for url in URL:
headers=[]
rows=[]
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',class_ ='infobox')
credit_union_name= soup.find('h1', id = "firstHeading")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
rows.append("")
rows.append([credit_union_name.text.strip()])
rows.append([url])
with open(r'credit_unions.csv','a+',newline="") as output:
writer=csv.writer(output)
writer.writerow(headers)
writer.writerow(rows)
但是,我检查了 csv 文件,发现信息没有以表格形式显示。抓取的元素存储在嵌套列表中,而不是单个列表中。我需要将每个 URL 的抓取信息存储在一个单独的列表中,并以带有标题的表格形式在 csv 文件中打印该列表。需要这方面的帮助。
信息框具有不同的结构和标签。所以我认为解决这个问题的最好方法是使用字典和 DictWriter。
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union',
'https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help',
'https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union',
'https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
csv_headers = set()
csv_rows = []
for url in URL:
csv_row = {}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
credit_union_name = soup.find('h1', id="firstHeading")
table = soup.find('table', class_='infobox')
data_rows = table.find_all('tr')
for data_row in data_rows:
label = data_row.find('th')
value = data_row.find('td')
if label is None or value is None:
continue
beautified_label = label.text.strip()
beautified_value = value.text.strip()
csv_row[beautified_label] = beautified_value
csv_headers.add(beautified_label)
csv_row["name"] = credit_union_name.text.strip()
csv_row["url"] = url
csv_rows.append(csv_row)
with open(r'credit_unions.csv', 'a+', newline="") as output:
headers = ["name", "url"]
headers += sorted(csv_headers)
writer = csv.DictWriter(output, fieldnames=headers)
writer.writeheader()
writer.writerows(csv_rows)