使用 Python 和 BeautifulSoup 发布抓取
Issue Scraping with Python and BeautifulSoup
我已经抓取这个网站一年了,但是他们最近更改了网站的布局,出于某种原因我无法再使用它了。我正在使用 Python 和 BeautifulSoup.
我主要是想从这个 link 中获取表格中的数据:https://www.loto.ro/?p=3872
这是用于旧布局的代码,我将其改编为网站的当前布局:
website_result = requests.get("https://www.loto.ro/?p=3872")
src = website_result.content
soup = BeautifulSoup(src, 'lxml')
for i in range(0, 8):
table_title = soup.select(".content .content-info .rezultate-extrageri-content.resultDiv .button-open-details")[i].get_text().strip()
if "6/49" in table_title:
images = soup.select(".content-info .rezultate-extrageri-content.resultDiv "
".info-rezultat .numere-extrase img[src]")
if len(images) > 0:
table = soup.select(".content .content-info .rezultate-extrageri-content.resultDiv .results-table")[i]
在调试模式下,我的代码卡在“table_title”行,没有给我任何错误或回溯,所以我什至不知道问题出在哪里。
有什么想法吗?谢谢
即将出现的 URL 结果形式确实 是新的,因为其中包含“newLottoSite”。
试试这个:
import pandas as pd
import requests
from tabulate import tabulate
new_url = "https://www.loto.ro/loto-new/newLotoSiteNexioFinalVersion/web/app2.php/jocuri/649_si_noroc/rezultate_extragere.html"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36",
"referer": "https://www.loto.ro/?p=3872",
}
df = pd.read_html(requests.get(new_url, headers=headers).text, flavor="lxml")[0]
print(tabulate(df, headers="keys", tablefmt="psql"))
这应该输出:
+----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------+
| | CAT. | Numar castiguri | Valoare castig | Report |
|----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------|
| 0 | I (6/6) | REPORT | 272.80920 | 4.289.31280 |
| 1 | II (5/6) | 5 | 18.18728 | - |
| 2 | III (4/6) | 285 | 31907 | - |
| 3 | IV (3/6) | 4.563 | 3000 | - |
| 4 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 |
+----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------+
我已经抓取这个网站一年了,但是他们最近更改了网站的布局,出于某种原因我无法再使用它了。我正在使用 Python 和 BeautifulSoup.
我主要是想从这个 link 中获取表格中的数据:https://www.loto.ro/?p=3872
这是用于旧布局的代码,我将其改编为网站的当前布局:
website_result = requests.get("https://www.loto.ro/?p=3872")
src = website_result.content
soup = BeautifulSoup(src, 'lxml')
for i in range(0, 8):
table_title = soup.select(".content .content-info .rezultate-extrageri-content.resultDiv .button-open-details")[i].get_text().strip()
if "6/49" in table_title:
images = soup.select(".content-info .rezultate-extrageri-content.resultDiv "
".info-rezultat .numere-extrase img[src]")
if len(images) > 0:
table = soup.select(".content .content-info .rezultate-extrageri-content.resultDiv .results-table")[i]
在调试模式下,我的代码卡在“table_title”行,没有给我任何错误或回溯,所以我什至不知道问题出在哪里。
有什么想法吗?谢谢
即将出现的 URL 结果形式确实 是新的,因为其中包含“newLottoSite”。
试试这个:
import pandas as pd
import requests
from tabulate import tabulate
new_url = "https://www.loto.ro/loto-new/newLotoSiteNexioFinalVersion/web/app2.php/jocuri/649_si_noroc/rezultate_extragere.html"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36",
"referer": "https://www.loto.ro/?p=3872",
}
df = pd.read_html(requests.get(new_url, headers=headers).text, flavor="lxml")[0]
print(tabulate(df, headers="keys", tablefmt="psql"))
这应该输出:
+----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------+
| | CAT. | Numar castiguri | Valoare castig | Report |
|----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------|
| 0 | I (6/6) | REPORT | 272.80920 | 4.289.31280 |
| 1 | II (5/6) | 5 | 18.18728 | - |
| 2 | III (4/6) | 285 | 31907 | - |
| 3 | IV (3/6) | 4.563 | 3000 | - |
| 4 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 | Fond total de castiguri: 4.608.075,60 |
+----+---------------------------------------+---------------------------------------+---------------------------------------+---------------------------------------+