抓取时需要帮助获取 tr 值
Need help getting tr values when scraping
我有以下代码
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
title = i.text
headers.append(title)
# Create DataFrame with the headers as columns
mydata = pd.DataFrame(columns = headers)
# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
我做错了什么?最终目的是拥有一个数据框,我可以在其中提取每列的前 4 个值
'Temperatura Max (ºC)',
'Temperatura Min (ºC)',
'Prec. acumulada (mm)',
'Rajada máxima (km/h)',
'Humidade Max (%)',
'Humidade Min (%)',
'Pressão atm. (hPa)']
然后使用这些生成每日图像。
有什么想法吗?提前致谢!
免责声明:这是一个非营利项目,不会将该解决方案用于商业用途。
从来源view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp
,很明显数据在 th
属性中,所以尝试使用 row_data = j.find_all('th')
进行抓取
所以这有效,基于 Falsovsky on GitHub
的解决方案
# Import libraries
import requests
import pandas as pd
import regex
# Define target URL
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
# Get URL information
page = requests.get(url)
# After inspecting the page apply a regex search
search = re.search('var observations = (.*?);',page.text,re.DOTALL);
# Create dict by loading the json information
json_data = json.loads(search.group(1))
# Create Dataframe from json result
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)
我有以下代码
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
# Get the content for tab_Co id
temp_table = soup.find('table', id='tab_Co')
# Create Headers
headers = []
for i in temp_table.find_all('th'):
title = i.text
headers.append(title)
# Create DataFrame with the headers as columns
mydata = pd.DataFrame(columns = headers)
# This is where the script goes wrong
# Create loop that retrieves information and appends it to the DataFrame
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
我做错了什么?最终目的是拥有一个数据框,我可以在其中提取每列的前 4 个值
'Temperatura Max (ºC)',
'Temperatura Min (ºC)',
'Prec. acumulada (mm)',
'Rajada máxima (km/h)',
'Humidade Max (%)',
'Humidade Min (%)',
'Pressão atm. (hPa)']
然后使用这些生成每日图像。 有什么想法吗?提前致谢!
免责声明:这是一个非营利项目,不会将该解决方案用于商业用途。
从来源view-source:https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp
,很明显数据在 th
属性中,所以尝试使用 row_data = j.find_all('th')
所以这有效,基于 Falsovsky on GitHub
的解决方案# Import libraries
import requests
import pandas as pd
import regex
# Define target URL
url = 'https://www.ipma.pt/pt/otempo/obs.superficie/table-top-stations-all.jsp'
# Get URL information
page = requests.get(url)
# After inspecting the page apply a regex search
search = re.search('var observations = (.*?);',page.text,re.DOTALL);
# Create dict by loading the json information
json_data = json.loads(search.group(1))
# Create Dataframe from json result
df1 = pd.concat({k: pd.DataFrame(v).T for k, v in json_data.items()}, axis=0)