尝试使用 Python 将已解析的数据导出到 CSV 文件,但我不知道如何导出多行
Attempting to export parsed data to CSV file with Python and I can't figure out how to export more than one row
我对漂亮的 soup/Python/Web 抓取相当陌生,我已经能够从站点抓取数据,但我只能将第一行导出到 csv 文件(我想导出所有抓取的数据都放入文件中。)
我对如何让这段代码将所有抓取的数据导出到多个单独的行感到困惑:
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
pres_data = {'Name': [name],
'Date': [date],
'Link': [links]
}
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
有什么想法吗?我相信我需要通过数据解析循环它并获取每组并将其推入。
我这样做正确吗?
感谢任何见解
目前的设置方式,您似乎没有将每个 link 添加为新条目,而是仅添加最后一个 link。如果您初始化一个列表并添加一个字典,就像您为“links”for 循环的每次迭代设置的那样,您将添加每一行,而不仅仅是最后一行。
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
pres_data = []
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
this_data = {'Name': name,
'Date': date,
'Link': links
}
pres_data.append(this_data)
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
你不需要在这里使用 Pandas
因为你不愿意在那里应用任何类型的 Data
操作!
通常尽量限制自己使用内置库,以防任务较短。
import requests
from bs4 import BeautifulSoup
import csv
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
target = [([x.a['href']] + x.a.text[:-1].split(' ('))
for x in soup.select('span.article')]
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Url', 'Name', 'Date'])
writer.writerows(target)
main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')
输出示例:
我对漂亮的 soup/Python/Web 抓取相当陌生,我已经能够从站点抓取数据,但我只能将第一行导出到 csv 文件(我想导出所有抓取的数据都放入文件中。)
我对如何让这段代码将所有抓取的数据导出到多个单独的行感到困惑:
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
pres_data = {'Name': [name],
'Date': [date],
'Link': [links]
}
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
有什么想法吗?我相信我需要通过数据解析循环它并获取每组并将其推入。 我这样做正确吗?
感谢任何见解
目前的设置方式,您似乎没有将每个 link 添加为新条目,而是仅添加最后一个 link。如果您初始化一个列表并添加一个字典,就像您为“links”for 循环的每次迭代设置的那样,您将添加每一行,而不仅仅是最后一行。
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
pres_data = []
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
this_data = {'Name': name,
'Date': date,
'Link': links
}
pres_data.append(this_data)
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
你不需要在这里使用 Pandas
因为你不愿意在那里应用任何类型的 Data
操作!
通常尽量限制自己使用内置库,以防任务较短。
import requests
from bs4 import BeautifulSoup
import csv
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
target = [([x.a['href']] + x.a.text[:-1].split(' ('))
for x in soup.select('span.article')]
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Url', 'Name', 'Date'])
writer.writerows(target)
main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')
输出示例: