使用 BS4 遍历多个 URL - 并将结果存储为 csv 格式

Question

你好，我目前正在研究一个小的 sraper - 我正在把一些碎片放在一起我有一个 URL 保存着所谓的数字集线器的记录：见 https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

我想将 700 条记录导出为 csv 格式：即导出为 excel-spreadsheet。到目前为止一切顺利：

我做了一些初步实验 - 看起来很不错。

见

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# scraping a the content
url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
request = requests.get(url_link)
 
Soup = BeautifulSoup(request.text, 'lxml')
 
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3", "h4"]
for tags in Soup.find_all(heading_tags):
    print(tags.name + ' -> ' + tags.text.strip())

交付：

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

我想获取数据集：https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool：我需要迭代 700 个网址，请参阅

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view

好吧，对于迭代部分，我想我可以通过这种方式迭代多个 URLs，我可以使用 Requests 和 BeautifulSoup 提前定义：附件是我所拥有的到目前为止，即尝试将 URls 放入列表中....

import requests
import bs4
URLs = ["https://example-url-1.com", "https://example-url-2.com"]
result = requests.get(URLs)
soup = bs4.BeautifulSoup(result.text,"lxml")

print(soup.find_all('p'))

好吧 - 坦率地说：我也在寻找一种方法来包括间隔延迟，以免向服务器发送垃圾邮件。所以我可以这样:

import requests
import bs4
import sleep from time
URLs = ['https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17865/view',
        'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1416/view'
        ]

def getPage(url):
    print('Indexing {0}......'.format(url))
    result = requests.get(url)
    print('Url Indexed...Now pausing 50secs before next ')
    sleep(50)
    return result

results = map(getPage, URLs)
for result in results:
    # soup = bs4.BeautifulSoup(result.text,"lxml")
    soup = bs4.BeautifulSoup(result.text,"html.parser")
    print(soup.find_all('p'))

现在是解析器部分：解析数据，例如从这里作为示例：https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

resultat- 非常棒但未排序 - 我想以 csv 格式存储所有结果 - 即 excelsheet 包含以下列：

Digital Innovation Hubs h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN h4 -> Contact Data h4 -> Description h4 -> Link to national or regional initiatives for digitising industry h4 -> Market and Services h4 -> Organization h4 -> Evolutionary Stage h4 -> Geographical Scope h4 -> Funding h4 -> Partners h4 -> Technologies

见

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

查看结果：

Click on the following link if you want to propose a change of this HUB

You need an EU Login account for request proposals for editions or creations of new hubs. If you already have an ECAS account, you don't have to create a new EU Login account. In EU Login, your credentials and personal data remain unchanged. You can still access the same services and applications as before. You just need to use your e-mail address for logging in. If you don't have an EU Login account please use the following link. you can create one by clicking on the Create an account hyperlink. If you already have a user account for EU Login please login via https://webgate.ec.europa.eu/cas/login Sign in New user? Create an account Coordinator (University) Robotic Competence Center of Technical University of Munich, TUM CC Coordinator website http://www6.in.tum.de/en/home/ Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Adam Schmidt adam.schmidt@tum.de +49 (0)89 289-18064

Year Established 2017 Location Schleißheimer Str. 90a, 85748, Garching bei München (Germany) Website http://www.robot.bayern Social Media

Contact information Description BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern) and Bayerische Patentallianz, the latter three being members of the Bavarian Research and Innovation Agency) in order to facilitate the process of robotizing Bavarian manufacturing sector. In its current form it is an informal alliance of established institutions with a vast experience in the field of bringing and facilitating innovation in Bavaria. The mission of the network is to make Bavaria the forerunner of the digitalized and robotized European industry. The mission is realized by offering services ranging from providing the technological expertise, access to the robotic equipment, IPR advice and management, and funding facilitation to various entities of the Bavarian manufacturing ecosystem – start-ups, SMEs, research institutes, universities and other institutions interested in embracing the Industry 4.0 revolution. BaRoN verbindet mehrere Bayerische Akteure mit einem gemeinsamen Ziel – die Robotisierung des Bayerischen produzierenden Gewerbes voranzutreiben. OP Bayern ERDF 2014-2020 Enhancing the competitiveness of SMEs through the creation and the extension of advanced capacities for product and service developments and through internationalisation initiatives Budget 1.4B€

总结一下：它正在使用 BS4 返回一个糟糕且未排序的数据块 - 如何使用 Pandas 进行清理，以便我有一个包含以下列 [=30] 的清理 table =]

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

更新：感谢蒂姆·罗伯茨，我看到我们有以下组合

class: hubCard
class: hubCardTitle
class: hubCardContent
class: infoLabel >Description>
<p> Text - data - content <p>

有了这个 - 我们可以逐步扩展解析作业。非常感谢蒂姆！

那是说 - 我只是想获取其他感兴趣领域的数据 - 例如。像这样开头的描述文本：

BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern)

我将蒂姆你的想法应用到了代码中。 - 但它不起作用

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The name of the hub is in the <h4> tag.

hubname = soup.find('h4').text

# All contact info is within a <div class='hubCard'>.

description = soup.find("div", class_="hubCardContent")

cardinfo = {}

# Grab all the <p> tags inside that div.  The infoLabel class marks
# the section header.

for data in description.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        Description = data.text
        cardinfo[Description] = []
    else:
        cardinfo[Description].append( data.text )

# The contact info is in a <div> inside that div.

#for data in contact.find_all('div', class_='infoMultiValue'):
#    cardinfo['Description'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

它总是返回内容信息 - 但不是我正在寻找的数据 - 描述的文本：我做错了什么...

Answer 1

也许这可以给你一个开始。您必须深入研究 HTML 才能找到所需信息的关键标记。我感觉你想要标题和联系信息。标题位于 <h2> 标签中，这是页面上唯一的此类标签。联系信息在 <div class='hubCard'> 标签内，因此我们可以抓住它并取出碎片。

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html5lib')

# The name of the hub is in the <h2> tag.

hubname = soup.find('h2').text

# All contact info is within a <div class='hubCard'>.

contact = soup.find("div", class_="hubCard")

cardinfo = {}

# Grab all the <p> tags inside that div.  The infoLabel class marks
# the section header.

for data in contact.find_all('p'):
    if 'infoLabel' in data.attrs.get('class', []):
        title = data.text
        cardinfo[title] = []
    else:
        cardinfo[title].append( data.text )

# The contact info is in a <div> inside that div.

for data in contact.find_all('div', class_='infoMultiValue'):
    cardinfo['Contact information'].append( data.text)

print("---")
print(hubname)
print("---")
pprint(cardinfo)

输出：

---
 Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
---
{'Contact information': [' Adam Schmidt',
                         ' adam.schmidt@tum.de',
                         ' +49 (0)89 289-18064'],
 'Coordinator (University)': ['',
                              'Robotic Competence Center of Technical '
                              'University of Munich, TUM CC'],
 'Coordinator website': ['http://www6.in.tum.de/en/home/\n'
                         '\t\t\t\t\t\n'
                         '\t\t\t\t\t\n'
                         '\t\t\t\t\t'],
 'Location': ['Schleißheimer Str. 90a, 85748, Garching bei München (Germany)'],
 'Social Media': ['\n'
                  '\t\t\t\t\t\t\n'
                  '\t\t\t\t\t\t\t\t\t\t \n'
                  '\t\t\t\t\t\t\t\t\t\t\n'
                  '\t\t\t\t\t\t\t\t\t\t \n'
                  '\t\t\t\t\t\t\t\t\t\t\n'
                  '\t\t\t\t\t\t\t\t\t\t \n'
                  '\t\t\t\t\t\t\t\t\t\t\n'
                  '\t\t\t\t\t\t'],
 'Website': ['http://www.robot.bayern'],
 'Year Established': ['2017']}

使用 BS4 遍历多个 URL - 并将结果存储为 csv 格式

itterate through multiple URLs with BS4 - and store results into a csv-format

python

linux

beautifulsoup

web-scraping