如何使用 python 从网页中抓取视频 URL？

Question

我想从网站下载视频。

这是我的代码。每次当我运行这段代码时，它都是 returns 空白文件。这是实时代码：https://colab.research.google.com/drive/19NDLYHI2n9rG6KeBCiv9vKXdwb5JL9Nb?usp=sharing

from bs4 import BeautifulSoup
import requests

url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")

page = url.content

soup = BeautifulSoup(page, "html.parser")

#print(soup.prettify())

result = soup.find_all('video', class_="video-player")

print(result)

Answer 1

你总是得到一个空白 return 因为 soup.find_all() 没有找到任何东西。也许您应该手动检查收到的 url.content，然后决定使用 find_all()

查找什么

编辑：经过一番挖掘后，我发现了如何获得 content_url_orig:

from bs4 import BeautifulSoup
import requests
import json

url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")

page = url.content

soup = BeautifulSoup(page, "html.parser")



result = str(soup.find_all('script')[1]) #looking for script tag inside the html-file
result = result.split('window._state = ')[1].split("</script>']")[0].split('\n')[0] 
#separating the json from the whole script-string, digged around in the file to find out how to do it

result = json.loads(result)


#navigating in the json to get the video-url
entity = list(result['entities'].items())[0][1]
download_url = entity['content_url_orig']

print(download_url)

有趣的旁注：如果我阅读 JSON 正确，您可以找到所有包含 download-URLs 创作者上传的视频:)

Answer 2

使用正则表达式

import requests
import re

response = requests.get("....../video/20000kCCF0")
videoId = '20000kCCF0'
videos = re.findall(r'https://[^"]+' + videoId + '[^"]+mp4', response.text)
print(videos)

如何使用 python 从网页中抓取视频 URL？

How to scrape video URL from Webpage using python?

python

web-scraping