使用 BeautifulSoup 进行抓取：想要抓取整个列，包括 header 和标题行

Question

我正在尝试获取代码为 "SEVNYXX" 的列下的数据，其中 "XX" 是随后的数字（例如 01、02 等） site 使用 Python。

使用下面的代码，我可以获得我想要的所有列数据的第一行。但是，有没有一种方法可以在其中包含 header 和行标题？

我知道我有 Headers，但我想知道是否有办法将这些包含在输出的数据中？而且，我怎样才能包括所有行？

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append(headers.index(th))

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells= row.findAll('td')
    for column in desired_columns:
        print(cells[column].text)

Answer 1

这个怎么样？

我添加了 th.getText() 并在提取列名称的所需列上创建了一个列表，然后添加了 row_name = row.findNext('th').getText() 以获取该行。

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'SVENY' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])

使用 BeautifulSoup 进行抓取：想要抓取整个列，包括 header 和标题行

Scraping with BeautifulSoup: want to scrape entire column including header and title rows

python

beautifulsoup

web-scraping