抓取时需要帮助获取 tr 值

Need help getting tr values when scraping

我有以下代码

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id 
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
 title = i.text
 headers.append(title)
# Create DataFrame with the headers as columns 
mydata = pd.DataFrame(columns = headers)

# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
 row_data = j.find_all('td')
 row = [i.text for i in row_data]
 length = len(mydata)
 mydata.loc[length] = row

我做错了什么?最终目的是拥有一个数据框,我可以在其中提取每列的前 4 个值

'Temperatura Max (ºC)',
 'Temperatura Min (ºC)',
 'Prec. acumulada (mm)',
 'Rajada máxima (km/h)',
 'Humidade Max (%)',
 'Humidade Min (%)',
 'Pressão atm. (hPa)']

然后使用这些生成每日图像。 有什么想法吗?提前致谢!

免责声明:这是一个非营利项目,不会将该解决方案用于商业用途。

从来源view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp,很明显数据在 th 属性中,所以尝试使用 row_data = j.find_all('th')

进行抓取

所以这有效,基于 Falsovsky on GitHub

的解决方案
# Import libraries 
import requests
import pandas as pd
import regex
# Define target URL 
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'

# Get URL information 
page = requests.get(url)

# After inspecting the page apply a regex search 
search = re.search('var observations = (.*?);',page.text,re.DOTALL);

# Create dict by loading the json information
json_data = json.loads(search.group(1))

# Create Dataframe from json result 
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)