使用 beautifulsoup 从 youtube 频道获取链接时出现问题

problems getting links from youtube channel with beautifulsoup

我正在尝试抓取一个 youtube 频道和 return 该频道每个视频的所有链接,但是当我尝试打印出这些链接时,我只得到了一些没有任何意义的链接处理视频。我怀疑这些视频可能是由 Javascript 加载的,那么我们有没有办法用 beautifulsoup 来做到这一点?我必须使用硒吗?有人可以帮我做一些测试吗?到目前为止,这是我的代码:

import requests
from bs4 import BeautifulSoup

print('scanning page...')

youtuber = 'memeulous'
result = requests.get('https://www.youtube.com/c/' + youtuber + '/videos')
status = result.status_code
src = result.content
soup = BeautifulSoup(src, 'lxml')
links = soup.find_all('a')

if status == 200:
    print('valid URL, grabbing uploads...')
else:
    print('invalid URL, status code: ' + str(status))
    quit()

print(links)

这是我的输出:

scanning page...
valid URL, grabbing uploads...
[<a href="https://www.youtube.com/about/" slot="guide-links-primary" style="display: none;">About</a>, <a href="https://www.youtube.com/about/press/" slot="guide-links-primary" style="display: none;">Press</a>, <a href="https://www.youtube.com/about/copyright/" slot="guide-links-primary" style="display: none;">Copyright</a>, <a href="/t/contact_us" slot="guide-links-primary" style="display: none;">Contact us</a>, <a href="https://www.youtube.com/creators/" slot="guide-links-primary" style="display: none;">Creators</a>, <a href="https://www.youtube.com/ads/" slot="guide-links-primary" style="display: none;">Advertise</a>, <a href="https://developers.google.com/youtube" slot="guide-links-primary" style="display: none;">Developers</a>, <a href="/t/terms" slot="guide-links-secondary" style="display: none;">Terms</a>, <a href="https://www.google.co.uk/intl/en-GB/policies/privacy/" slot="guide-links-secondary" style="display: none;">Privacy</a>, <a href="https://www.youtube.com/about/policies/" slot="guide-links-secondary" style="display: none;">Policy and Safety</a>, <a href="https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&amp;utm_source=ythp&amp;utm_medium=LeftNav&amp;utm_content=txt&amp;u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen" slot="guide-links-secondary" style="display: none;">How YouTube works</a>, <a href="/new" slot="guide-links-secondary" style="display: none;">Test new features</a>]
[Finished in 4.0s]

如您所见,没有视频链接。

执行此操作的一种方法是使用以下代码:

import requests

api_key = "PASTE_YOUR_API_KEY_HERE!"
yt_user = "memeulous"
api_url = f"https://www.googleapis.com/youtube/v3/channels?part=contentDetails&forUsername={yt_user}&key={api_key}"

response = requests.get(api_url).json()

playlist_id = response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]

channel_url = f"https://www.googleapis.com/youtube/v3/playlistItems?" \
              f"part=snippet%2CcontentDetails&maxResults=50&playlistId={playlist_id}&key={api_key}"


def get_video_ids(vid_data: dict) -> list:
    return [_id["contentDetails"]["videoId"] for _id in vid_data["items"]]


def build_links(vid_ids: list) -> list:
    return [f"https://www.youtube.com/watch?v={_id}" for _id in vid_ids]


def get_all_links() -> list:
    all_links = []
    url = channel_url
    while True:
        res = requests.get(url).json()
        all_links.extend(build_links(get_video_ids(res)))
        try:
            paging_token = res["nextPageToken"]
            url = f"{channel_url}&pageToken={paging_token}"
        except KeyError:
            break
    return all_links


print(get_all_links())

这将为您提供 memeulous 用户的所有视频链接 (469)。

['https://www.youtube.com/watch?v=4L8_isnyGfg', 'https://www.youtube.com/watch?v=ogpaiD2e-ss', 'https://www.youtube.com/watch?v=oH-nJe9XMN0', 'https://www.youtube.com/watch?v=kUcbKl4qe5g', ...

您可以像这样从 videos_data 对象中获取视频总数:

print(f"Total videos: {videos_data['pageInfo']['totalResults']}")

希望这对您有所帮助并帮助您入门。您需要做的就是获取 YouTube 数据 API 的 API 密钥。