需要从网络抓取中消除重复内容
Need to eliminate duplicate content from web crawling
我是 python 中 beautifulsoup 的新手,我正在尝试从网站中提取某些信息。深层链接、标题和价格。
它工作正常,除了爬虫提供了我想从输出中删除的重复内容。
以下示例:
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
Header: Marrakech: 4-Day Desert Safari and Overnight Camp | Price: 484 | Deeplink: http://www.isa.com/marrakech-l208/desert-tour-from-marrakech-t54706/
Header: Private Transfer between Marrakech Airport to Palmeraie | Price: 23 | Deeplink: http://www.isa.com/marrakech-l208/private-transfer-between-marrakech-airport-to-palmeraie-t55781/
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
我想在抓取这些东西之前删除重复的内容
到目前为止,这是我的逻辑:
hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
header_final = header.text.strip()
#print(header_final)
prices = item.find_all("span", {"class": "price"})
for price in prices:
price_final = price.text.strip().replace(",","")[3:]
#print(price_final)
deeplinks = item.find_all("a", {"class": "activity-card-link"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
#print(deeplink_final)
print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final
谁能给我反馈如何删除重复项?任何反馈表示赞赏。我试图维护一组结果,但显然我犯了一些我无法弄清楚的错误。
编辑
根据反馈调整了我的代码:
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
item = header.text.strip()
if item not in already_printed:
print(item)
already_printed.add(item)
prices = item.find_all("span", {"class": "price"})
for price in prices:
item2 = price.text.strip().replace(",","")[3:]
if item2 not in already_printed:
print(item2)
already_printed.add(item2)
它适用于 header 商品,但对于价格,我收到以下错误消息:
File "C:/Users/hmattu/PycharmProjects/untitled1/Duplicates remove.py", line 52, in trade_spider
prices = item.find_all("span", {"class": "price"})
AttributeError: 'str' object has no attribute 'find_all'
我做错了什么?感谢任何反馈
与其在每次迭代时打印每个项目,不如先将它们存储在字典中,然后使用 header
或 url
作为键。 (您也可以使用 set())
当您完成 hallo
列表的迭代后,您将一张一张地打印出字典。
这样您将只在 dictionary/set 中保留一个条目用于重复内容。
我是 python 中 beautifulsoup 的新手,我正在尝试从网站中提取某些信息。深层链接、标题和价格。
它工作正常,除了爬虫提供了我想从输出中删除的重复内容。
以下示例:
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
Header: Marrakech: 4-Day Desert Safari and Overnight Camp | Price: 484 | Deeplink: http://www.isa.com/marrakech-l208/desert-tour-from-marrakech-t54706/
Header: Private Transfer between Marrakech Airport to Palmeraie | Price: 23 | Deeplink: http://www.isa.com/marrakech-l208/private-transfer-between-marrakech-airport-to-palmeraie-t55781/
Header: Splendid Imlil: Mount Toubkal Day Trip from Marrakech | Price: 83 | Deeplink: http://www.isa.com/marrakech-l208/splendid-imlil-mount-toubkal-t41589/
Header: Morocco - The Imperial Cities 7-Day Tour | Price: 653 | Deeplink: http://www.isa.com/fuengirola-l1160/morocco-the-imperial-cities-7-day-tour-t15167/
Header: Ourika Valley Full-Day Private Tour | Price: 27 | Deeplink: http://www.isa.com/marrakech-l208/ourika-valley-full-day-private-tour-lunch-t19152/
Header: Sunday market Had Draa & Oasis of Ain el Hajar | Price: 39 | Deeplink: http://www.isa.com/essaouira-l877/sunday-market-had-draa-oasis-of-ain-el-hajar-t51987/
Header: Marrakech: 4-Day Long Weekend Tour | Price: 646 | Deeplink: http://www.isa.com/marrakech-l208/long-weekend-tour-in-marrakech-t54831/
Header: From Agadir: Marrakech Excursion Full-Day Trip | Price: 113 | Deeplink: http://www.isa.com/agadir-l1413/marrakech-express-bus-and-walking-tour-from-agadir-t28772/
Header: Sahara Desert 4-Day New Years Eve Tour from Marrakech | Price: 422 | Deeplink: http://www.isa.com/marrakech-l208/sahara-desert-4-day-new-years-eve-tour-from-marrakech-t24757/
Header: Essaouira: VIP Gnawa Music Experience Festival Tour | Price: 122 | Deeplink: http://www.isa.com/essaouira-l877/essaouira-vip-gnawa-music-experience-festival-tour-t50983/
Header: Marrakech: Full Day Private Tour | Price: 235 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-full-day-private-tour-t56646/
Header: Marrakech Palmeraie 3-Hour Bike Tour | Price: 79 | Deeplink: http://www.isa.com/marrakech-l208/marrakech-palmeraie-bike-tour-t53282/
我想在抓取这些东西之前删除重复的内容
到目前为止,这是我的逻辑:
hallo = soup.find_all("article", {"class": "activity-card activity-card-horizontal "})
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
header_final = header.text.strip()
#print(header_final)
prices = item.find_all("span", {"class": "price"})
for price in prices:
price_final = price.text.strip().replace(",","")[3:]
#print(price_final)
deeplinks = item.find_all("a", {"class": "activity-card-link"})
for t in set(t.get("href") for t in deeplinks):
deeplink_final = t
#print(deeplink_final)
print("Header: " + header_final + " | " + "Price: " + str(price_final) + " | " + "Deeplink: " + deeplink_final
谁能给我反馈如何删除重复项?任何反馈表示赞赏。我试图维护一组结果,但显然我犯了一些我无法弄清楚的错误。
编辑
根据反馈调整了我的代码:
for item in hallo:
headers = item.find_all("h3", {"class": "activity-card-title"})
for header in headers:
item = header.text.strip()
if item not in already_printed:
print(item)
already_printed.add(item)
prices = item.find_all("span", {"class": "price"})
for price in prices:
item2 = price.text.strip().replace(",","")[3:]
if item2 not in already_printed:
print(item2)
already_printed.add(item2)
它适用于 header 商品,但对于价格,我收到以下错误消息:
File "C:/Users/hmattu/PycharmProjects/untitled1/Duplicates remove.py", line 52, in trade_spider
prices = item.find_all("span", {"class": "price"})
AttributeError: 'str' object has no attribute 'find_all'
我做错了什么?感谢任何反馈
与其在每次迭代时打印每个项目,不如先将它们存储在字典中,然后使用 header
或 url
作为键。 (您也可以使用 set())
当您完成 hallo
列表的迭代后,您将一张一张地打印出字典。
这样您将只在 dictionary/set 中保留一个条目用于重复内容。