如何从网站抓取多个 pages/cities (BeautifulSoup,Requests,Python3)
How to Crawl Multiple pages/cities from a website (BeautifulSoup,Requests,Python3)
我想知道如何使用漂亮的 soup/requests 从一个网站抓取多个不同的 pages/cities,而不必一遍又一遍地重复我的代码。
这是我现在的代码:
Region = "Marrakech"
Spider = 20
def trade_spider(max_pages):
page = -1
partner_ID = 2
location_ID = 25
already_printed = set()
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))
我的目标是理想地从一个特定网站抓取多个 cities/regions。
现在,我可以手动执行此操作,方法是一遍又一遍地重复我的代码并抓取每个单独的网站,然后将我对这些数据帧中的每一个的结果连接在一起,但这似乎非常不符合 Python 规范。我想知道有没有人有更快的方法或任何建议?
我尝试将第二个城市添加到我的地区标签中,但没有成功
Region = "Marrakech","London"
有人可以帮我吗?感谢任何反馈。
Region = ["Marrakech","London"]
将您的 while 循环放在 for 循环中,然后将页面重置为 -1。
for reg in Region:
pages = -1
并在请求时用 reg 替换 Region url。
Region = ["Marrakech","London"]
Spider = 20
def trade_spider(max_pages):
partner_ID = 2
location_ID = 25
already_printed = set()
for reg in Region:
page = -1
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))
我想知道如何使用漂亮的 soup/requests 从一个网站抓取多个不同的 pages/cities,而不必一遍又一遍地重复我的代码。
这是我现在的代码:
Region = "Marrakech"
Spider = 20
def trade_spider(max_pages):
page = -1
partner_ID = 2
location_ID = 25
already_printed = set()
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))
我的目标是理想地从一个特定网站抓取多个 cities/regions。
现在,我可以手动执行此操作,方法是一遍又一遍地重复我的代码并抓取每个单独的网站,然后将我对这些数据帧中的每一个的结果连接在一起,但这似乎非常不符合 Python 规范。我想知道有没有人有更快的方法或任何建议?
我尝试将第二个城市添加到我的地区标签中,但没有成功
Region = "Marrakech","London"
有人可以帮我吗?感谢任何反馈。
Region = ["Marrakech","London"]
将您的 while 循环放在 for 循环中,然后将页面重置为 -1。
for reg in Region:
pages = -1
并在请求时用 reg 替换 Region url。
Region = ["Marrakech","London"]
Spider = 20
def trade_spider(max_pages):
partner_ID = 2
location_ID = 25
already_printed = set()
for reg in Region:
page = -1
while page <= max_pages:
page += 1
response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page))
jsondata = json.loads(response.read().decode("utf-8"))
format = (jsondata['activities'])
g_data = format.strip("'<>()[]\"` ").replace('\'', '\"')
soup = BeautifulSoup(g_data)
hallo = soup.find_all("article", {"class": "activity-card"})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card"})
for header in headers:
header_final = header.text.strip()
if header_final not in already_printed:
already_printed.add(header_final)
deeplinks = item.find_all("a", {"class": "activity"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
if deeplink_final not in already_printed:
already_printed.add(deeplink_final)
end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final
print(end_final)
trade_spider(int(Spider))