beautifulsoup returns None 我尝试的任何元素

Question

我正在构建一个完全自动化的 get-a-job 应用程序，有趣的是，自动化部分相当简单，但报废情况并不多。

简而言之，requests + beautifulsoup 对我正在废弃的大多数域都有效，但是在 workable 上尝试相同的过程时没有任何效果页数：

import requests
from bs4 import BeautifulSoup as bs

session = requests.Session()
url = 'https://apply.workable.com/breederdao-1/j/602097ACC9/'
req = session.get(url)

title = soup.find('h1', {'data-ui': 'job-title'})
print(title)

>>> None

details = soup.find('span', {'data-ui': 'job-location'})
print(details)

>>> None

两个元素都在 body 下，但是当我尝试获取页面标题时，我确实得到了我期望的结果：

title_0 = soup.find('title')
print(title_0)

>>> <title>Data Analyst (Fully Remote) - BreederDAO</title>

我也尝试使用 await + HTMLSEssion / AsyncHTMLSession，但只要元素在 body 内，每个 find() 仍然returns None.

谁能教我这方面的知识？我目前的假设是该网站具有某种 anti-scrapping 机制，但我什至不知道从哪里开始寻找。不过，这个元素确实看起来格外有格调：

<html...
  <head>...</head>
  <body>
    .
    .
    .
    <noscript>
      <iframe height="0" width="0" src="https://www.googletagmanager.com/ns.html?id=GTM-WKS7WTT&amp;gtm_auth=SGnzIn3pcB7S4fevFXOKPQ&amp;gtm_preview=env-2&amp;gtm_cookies_win=x" style="display: none; visibility: hidden;">
        #document
          <!DOCTYPE html>
          <html lang="en">
            <head>
              <meta charset="utf-8">
              <title>ns</title>
            </head>
            <body>
              " "
            </body>
          </html>
      </iframe>
    </noscript>
    .
    .
    .
  </body>
</html>

Answer 1

您看到的数据是通过 javascript 从外部 URL 加载的。要加载它，您可以使用 requests 模块。例如：

import json
import requests


# 602097ACC9 is from your URL
url = "https://apply.workable.com/api/v2/accounts/breederdao-1/jobs/602097ACC9"
data = requests.get(url).json()

# uncomment to print all data:
# print(json.dumps(data, indent=4))

print(data["title"])
print(", ".join(data["location"].values()))

打印：

Data Analyst (Fully Remote)
Philippines, PH, Makati, Metro Manila

beautifulsoup returns None 我尝试的任何元素

beautifulsoup returns None for any element I try

html

beautifulsoup

web-scraping

python-3.x

python-requests