Beautifulsoup 使用分页报废 Table
Beautifulsoup Scrapping Table with Pagination
我正在尝试抓取此站点 URL:https://statusinvest.com.br/fundos-imobiliarios/urpr11 以从 table 获取此特定房地产投资信托基金的股息信息(稍后我将对此进行概括)。这是包含信息的 table:
dividends table
我能够从 table 中获取日期和值,但仅限于第一页。当我更改 table 页面时,网站 URL 中没有任何修改,所以我实际上不知道如何处理这个问题。任何帮助将不胜感激。
Obs: 如果解决方法不依赖于页数就好了,因为一些 REITs 可以有超过 2 页的信息。
这就是我目前从第一页获取信息的方式
from bs4 import BeautifulSoup
import requests
URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all("tr", class_="")
rows = []
for r in test:
if not r.find("td", title="Rendimento"):
continue
row = []
for child in r.findChildren():
if child.text.lower()=="rendimento":
continue
print(child.text)
row.append(child.text)
rows.append(row)
内容由 JavaScript
动态提供,requests
本身不呈现,因此您无法通过这种方式获取所有数据。
如何修复?
您可以使用 selenium
与网站进行交互,就像人类在浏览器中进行交互一样 - 用于稍后和更复杂的问题。
但在这种情况下它更简单并且不需要selenium
。只需获取 JSON
数据 JavaScript
用于提供 table:
data = json.loads(soup.select_one('#results')['value'])
将其转换为 DataFrame
根据您的需要进行调整并将其保存为 csv,json,....
pd.DataFrame(data).to_csv('yourFile.csv', index=False)
网站上显示的列较多,请查看示例输出。这些调整将通过仅读取特定数据并重命名列 headers:
来为您提供预期的调整
df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
df.to_csv('yourFile.csv', index=False)
TIPO
DATA COM
PAGAMENTO
VALOR
Rendimento
25/02/2022
15/03/2022
1.635
Rendimento
31/01/2022
14/02/2022
1.63
例子
from bs4 import BeautifulSoup
import requests, json
import pandas as pd
URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
data = json.loads(soup.select_one('#results')['value'])
pd.DataFrame(data)
#or with adjustment as mentioned above
#df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
#df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
#df.to_csv('yourFile.csv', index=False)
输出
y
m
d
ad
ed
pd
et
etd
v
ov
sv
sov
adj
0
0
0
25/02/2022
15/03/2022
Rendimento
Rendimento
1.635
1,63500000
-
False
0
0
0
31/01/2022
14/02/2022
Rendimento
Rendimento
1.63
1,63000000
-
False
0
0
0
30/12/2021
14/01/2022
Rendimento
Rendimento
1.67
1,67000000
-
False
0
0
0
30/11/2021
14/12/2021
Rendimento
Rendimento
1.869
1,86900000
-
False
0
0
0
29/10/2021
16/11/2021
Rendimento
Rendimento
1.37
1,37000000
-
False
0
0
0
30/09/2021
15/10/2021
Rendimento
Rendimento
2.17
2,17000000
-
False
0
0
0
31/08/2021
15/09/2021
Rendimento
Rendimento
2.01
2,01000000
-
False
0
0
0
30/07/2021
13/08/2021
Rendimento
Rendimento
1.48
1,48000000
-
False
0
0
0
30/06/2021
14/07/2021
Rendimento
Rendimento
2.4
2,40000000
-
False
0
0
0
31/05/2021
15/06/2021
Rendimento
Rendimento
2.06
2,06000000
-
False
0
0
0
30/04/2021
14/05/2021
Rendimento
Rendimento
1.185
1,18500000
-
False
0
0
0
31/03/2021
15/04/2021
Rendimento
Rendimento
2.87
2,87000000
-
False
0
0
0
26/02/2021
12/03/2021
Rendimento
Rendimento
2.09
2,09000000
-
False
0
0
0
29/01/2021
12/02/2021
Rendimento
Rendimento
2.25
2,25000000
-
False
0
0
0
30/12/2020
15/01/2021
Rendimento
Rendimento
2.01
2,01000000
-
False
0
0
0
30/11/2020
14/12/2020
Rendimento
Rendimento
2.03668
2,03668260
-
False
0
0
0
30/10/2020
13/11/2020
Rendimento
Rendimento
3.24
3,24000000
-
False
0
0
0
30/09/2020
15/10/2020
Rendimento
Rendimento
2.15
2,15000000
-
False
0
0
0
31/08/2020
15/09/2020
Rendimento
Rendimento
1.35
1,35000000
-
False
0
0
0
31/07/2020
14/08/2020
Rendimento
Rendimento
0.814098
0,81409811
-
False
0
0
0
30/06/2020
15/07/2020
Rendimento
Rendimento
1.56063
1,56063128
-
False
0
0
0
29/05/2020
15/06/2020
Rendimento
Rendimento
0.778074
0,77807445
-
False
0
0
0
30/04/2020
11/05/2020
Rendimento
Rendimento
0.615445
0,61544523
-
False
0
0
0
14/04/2020
15/04/2020
Rendimento
Rendimento
0.189474
0,18947368
-
False
我正在尝试抓取此站点 URL:https://statusinvest.com.br/fundos-imobiliarios/urpr11 以从 table 获取此特定房地产投资信托基金的股息信息(稍后我将对此进行概括)。这是包含信息的 table:
dividends table
我能够从 table 中获取日期和值,但仅限于第一页。当我更改 table 页面时,网站 URL 中没有任何修改,所以我实际上不知道如何处理这个问题。任何帮助将不胜感激。
Obs: 如果解决方法不依赖于页数就好了,因为一些 REITs 可以有超过 2 页的信息。
这就是我目前从第一页获取信息的方式
from bs4 import BeautifulSoup
import requests
URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
test = soup.find_all("tr", class_="")
rows = []
for r in test:
if not r.find("td", title="Rendimento"):
continue
row = []
for child in r.findChildren():
if child.text.lower()=="rendimento":
continue
print(child.text)
row.append(child.text)
rows.append(row)
内容由 JavaScript
动态提供,requests
本身不呈现,因此您无法通过这种方式获取所有数据。
如何修复?
您可以使用 selenium
与网站进行交互,就像人类在浏览器中进行交互一样 - 用于稍后和更复杂的问题。
但在这种情况下它更简单并且不需要selenium
。只需获取 JSON
数据 JavaScript
用于提供 table:
data = json.loads(soup.select_one('#results')['value'])
将其转换为 DataFrame
根据您的需要进行调整并将其保存为 csv,json,....
pd.DataFrame(data).to_csv('yourFile.csv', index=False)
网站上显示的列较多,请查看示例输出。这些调整将通过仅读取特定数据并重命名列 headers:
来为您提供预期的调整df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
df.to_csv('yourFile.csv', index=False)
TIPO | DATA COM | PAGAMENTO | VALOR |
---|---|---|---|
Rendimento | 25/02/2022 | 15/03/2022 | 1.635 |
Rendimento | 31/01/2022 | 14/02/2022 | 1.63 |
例子
from bs4 import BeautifulSoup
import requests, json
import pandas as pd
URL = "https://statusinvest.com.br/fundos-imobiliarios/urpr11"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
data = json.loads(soup.select_one('#results')['value'])
pd.DataFrame(data)
#or with adjustment as mentioned above
#df = pd.DataFrame(data, columns=['et','ed', 'pd', 'v'])
#df.columns = ['TIPO','DATA COM','PAGAMENTO','VALOR']
#df.to_csv('yourFile.csv', index=False)
输出
y | m | d | ad | ed | pd | et | etd | v | ov | sv | sov | adj |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 25/02/2022 | 15/03/2022 | Rendimento | Rendimento | 1.635 | 1,63500000 | - | False | ||
0 | 0 | 0 | 31/01/2022 | 14/02/2022 | Rendimento | Rendimento | 1.63 | 1,63000000 | - | False | ||
0 | 0 | 0 | 30/12/2021 | 14/01/2022 | Rendimento | Rendimento | 1.67 | 1,67000000 | - | False | ||
0 | 0 | 0 | 30/11/2021 | 14/12/2021 | Rendimento | Rendimento | 1.869 | 1,86900000 | - | False | ||
0 | 0 | 0 | 29/10/2021 | 16/11/2021 | Rendimento | Rendimento | 1.37 | 1,37000000 | - | False | ||
0 | 0 | 0 | 30/09/2021 | 15/10/2021 | Rendimento | Rendimento | 2.17 | 2,17000000 | - | False | ||
0 | 0 | 0 | 31/08/2021 | 15/09/2021 | Rendimento | Rendimento | 2.01 | 2,01000000 | - | False | ||
0 | 0 | 0 | 30/07/2021 | 13/08/2021 | Rendimento | Rendimento | 1.48 | 1,48000000 | - | False | ||
0 | 0 | 0 | 30/06/2021 | 14/07/2021 | Rendimento | Rendimento | 2.4 | 2,40000000 | - | False | ||
0 | 0 | 0 | 31/05/2021 | 15/06/2021 | Rendimento | Rendimento | 2.06 | 2,06000000 | - | False | ||
0 | 0 | 0 | 30/04/2021 | 14/05/2021 | Rendimento | Rendimento | 1.185 | 1,18500000 | - | False | ||
0 | 0 | 0 | 31/03/2021 | 15/04/2021 | Rendimento | Rendimento | 2.87 | 2,87000000 | - | False | ||
0 | 0 | 0 | 26/02/2021 | 12/03/2021 | Rendimento | Rendimento | 2.09 | 2,09000000 | - | False | ||
0 | 0 | 0 | 29/01/2021 | 12/02/2021 | Rendimento | Rendimento | 2.25 | 2,25000000 | - | False | ||
0 | 0 | 0 | 30/12/2020 | 15/01/2021 | Rendimento | Rendimento | 2.01 | 2,01000000 | - | False | ||
0 | 0 | 0 | 30/11/2020 | 14/12/2020 | Rendimento | Rendimento | 2.03668 | 2,03668260 | - | False | ||
0 | 0 | 0 | 30/10/2020 | 13/11/2020 | Rendimento | Rendimento | 3.24 | 3,24000000 | - | False | ||
0 | 0 | 0 | 30/09/2020 | 15/10/2020 | Rendimento | Rendimento | 2.15 | 2,15000000 | - | False | ||
0 | 0 | 0 | 31/08/2020 | 15/09/2020 | Rendimento | Rendimento | 1.35 | 1,35000000 | - | False | ||
0 | 0 | 0 | 31/07/2020 | 14/08/2020 | Rendimento | Rendimento | 0.814098 | 0,81409811 | - | False | ||
0 | 0 | 0 | 30/06/2020 | 15/07/2020 | Rendimento | Rendimento | 1.56063 | 1,56063128 | - | False | ||
0 | 0 | 0 | 29/05/2020 | 15/06/2020 | Rendimento | Rendimento | 0.778074 | 0,77807445 | - | False | ||
0 | 0 | 0 | 30/04/2020 | 11/05/2020 | Rendimento | Rendimento | 0.615445 | 0,61544523 | - | False | ||
0 | 0 | 0 | 14/04/2020 | 15/04/2020 | Rendimento | Rendimento | 0.189474 | 0,18947368 | - | False |